The Digamma.ai Natural Language Processing (NLP) Framework
The traditional natural language programming (NLP) process is complex and time-consuming. At Digamma.ai, we made this development process better, faster, and more efficient and developed our own proprietary NLP framework.
What is NLP?

NLP allows machines to understand how humans speak. This human-computer interaction begets real-world applications such as sentiment analysis, automatic text summarization, topic extraction, named entity recognition, parts-of-speech tagging, relationship extraction, stemming, and many more applications.

NLP is commonly used for automated question answering, information extraction (e.g. legal or financial documents), automated spam filters, and machine translation (e.g. Google translate).



Why Use the Digamma.ai NLP Framework?

NLP is a challenging field. Human language is rarely precise, at times ambiguous and often requires context. To understand human language is to understand not only the words, but the concepts being communicated and how they’re connected together to create meaning. Despite language being one of the easiest things for humans to learn, the ambiguity of language makes natural language processing a difficult problem for computers to master.

Conducting NLP is often a laborious undertaking. In practice, it’s not a matter of simply developing an algorithm and letting a computer do the work. First, it is necessary to prepare and clean the data. Following data processing, repetitive, time-consuming and hands-on trial and error experiments must be conducted manually. However, our proprietary and scalable framework changes all of this by ‘gluing’ various NLP methods together, automating them and allowing you to use them iteratively to refine your model. When applied, our framework allows you to bootstrap your big data, quickly assess it and get to market faster than your competitors.

Our Framework
Our framework consists of three parts including text preprocessing, exploratory analysis, and text-to features. Each of these methods includes a variety of models and algorithms that are customizable with options and parameters. A typical workflow usually requires multiple iterations, trying different models, and algorithms. From there, it’s necessary to tweak key parameters and apply them in many different combinations. As a manual process, this is often very time-consuming. However, our framework automates the entire process. Our approach allows you to use a variety of efficient methods quickly and iteratively to easily to refine your model instead of having to rely on onerous trial and error experiments. Our framework is also scalable to very large datasets and can be deployed to the cloud (i.e. Amazon AWS and other providers).

Text Preprocessing

  • Custom readers for common formats (XML, CSV/Excel, Databases, etc.)
  • Regex replacing/cleaning
  • Optimized keyword replacing /cleaning
  • HTML parsing/cleaning
  • Character filtering (ASCII, printable, user-defined cleaners)
  • Character replacing/cleaning (includes both predefined and custom filters)
  • Language detection
  • Encoding detection
  • Sentence tokenization
  • Word tokenization

Exploratory Analysis

  • Dataset statistics for tabular data (i.e. missing values, unique values, etc.)
  • Statistics for entire raw corpuses (i.e. text sizes and length distributions)
  • Language and encoding distributions across both full corpus and particular documents
  • N-Grams and word clouds
  • Visualization data via a variety of plots and charts that helps users to assess data quickly

Text-to-Features

  • Syntactic parsing
  • POS tagging
  • Named Entity Recognition
  • TF-IDF
  • Frequency features
  • Bag-of-Words
  • Word embeddings (i.e. word2vec, glove)

Applications

  • Sentiment Analysis
  • Named Entity Recognition
  • Text Summarization
  • Topic Extraction

Our Methodology

First, we obtain data from the client in whatever format they provide. We use automatic routines to transform this raw data into a standardized format to make processing easier. Then, we start a typical data science processing flow, starting from data cleaning (e.g. removing HTML tags, performing noise removal based on regular expressions as well as keyword replacing, character cleaning and character filtering). From there, we perform sentence and word tokenizations.

Next, we begin the exploratory process where we analyze the data, draw charts, histograms, extract task-related features and calculate various statistics. Then, we proceed to feature engineering. We have models for Named Entity Recognition, POS tagging, dependency parsing, TF-IDF and word embeddings. As a result of the process described above, the pre-processed dataset is ready to be fed into ML models. Here, the first iteration ends. From this point, we analyze the results, and, if they are not adequate, tweak the previous stages to improve overall performance. This is the next iteration.

This process is generally sequential, as one stage comes after the other. However, in practice, each stage performs its task with multiple sets of parameters and the result of every run is passed to the next stage. In this way, the process is more like a graph or a tree—not a sequence—and is somewhat similar to neural nets insofar as they relate to how computing nodes are organized. Also, this method stores all intermediate results. So, if only the second stage from the end of the pipeline is changed, then only the last two stages will be re-computed.

Once the steps in this methodology are completed and a certain desired level of accuracy is reached, the end result is a custom NLP framework based on the original input data.

Please reach out to us directly to learn more about our NLP Framework.