NLP Pipeline from Data Acquisition to Deployment

What is NLP Pipeline?

NLP is a set of steps followed to build an end-to-end NLP software.

Pipeline provides the thinking to build any app from a basic level up to advanced or real time used applications.

NLP software consists of the following steps:

Data Acquisition
Text Preparation
- Text Cleanup
- Basic Preprocessing
- Advance Preprocessing
Feature Engineering
Modelling
- Process of applying models / algorithms
- Evaluation
Deployment
- Deployment
- Monitoring
- Model Update

Points to remember

It’s not universal
DL Pipelines are slightly different.
Pipeline is non-linear.
- We can either at any direction either from deployment to feature processing to data preprocessing or any at any stage.

Data Acquisition

Major sub-parts of this method,

Available Data
- Table
- Database
  - Data Engineering
- Less Data
  - Data Augmentation
    - Synonym
    - Biagram Flip
    - Back Translate
    - Addition of noise
Others
- Use public dataset
- Web scraping
- Explore API
- From PDFs
- From Images
- From Audios
No data found

Text Preparation

Cleaning
- HTML Tag Cleaning
- Emoji Encoding/Cleaning
- Spelling Checks
Basic Text pre-processing
- Basic pre-processing
  - Tokenization
    - Sentence
    - Word
- Optional pre-processing
  - Stop words removal
  - Stemming
  - Lemmatization
  - Removing digits, punctuation
  - Lower casing
  - Language detection
Advanced Text pre-processing
- Parts of Speech (POS) tagging
- Parsing (Parse Tree formation)
- Coreference resolution

Feature Engineering

Let’s us learn this topic by using an example, I’ve a set of datasets which contains 50,000 entries of sentiments to analyze them and 2 columns i.e., review_text and sentiment columns, where, review_text contains the review of the product, and sentiment contains either 0 or 1 i.e., either a positive sentiment or negative sentiment.

If suppose now, how to build the model and classify the reviews properly from the text?

The best and simple method is to create your own features, let’s say, count of +ve words, -ve words, and neutral words.

Now, dataset looks like,

+ve words	-ve words	neutral words	sentiment
5	6	6	0
3	5	2	1

By moving ahead, we’ll learn some of the algorithms or python packages which can easily help in creating the new features like Bag of Words, Tfidf, One Hot Encoding, and word2vec.

In NLP, we’ll solve our problem either using jugadu pipeline, ML pipeline and DL pipeline. But one question arises, which method is suitable for our problem? Which method is best for our model will become healthier?

Let’s understand these questions one by one by knowing the difference between in these methods.

Heuristic or Jugadu pipeline, it can only be applied on self-identifying the dataset and searched a pattern from the dataset, which many times become lengthy, toughest on goes to large amount of dataset.
ML pipeline, here we’ll explore the features and create our features by own on the basis of our knowledge on the field provided in the problem statement.
- It requires a pre-knowledge of that problem.
- Easily able to tune the model because features are created by us and easily identify which feature is disturbing our model which one boost the efficiency or reduce errors.
- Able to justify the model because of creation of features on the basis of our domain knowledge.
DL pipeline, just doing a simple pre-processing and feed the whole data into the dl models which can self-create the features inside the algorithms and build our model.
- No need of pre-knowledge of that problem.
- Evaluation is difficult as we don’t know which features are built inside the algorithms and on what pattern model provides this result.
- Tuning sometimes becomes difficult and no proper explanation of us model.
- Not able to justify the model because we don’t know which features are generated internally in dl model building.

Modelling

Modelling mainly depends on 2 factors,

Amount of Data
Nature of Problem

Approaches of Modelling,

Heuristic Methods
- Less amount of data, where we can observe the patterns.
ML Algorithms
- Amount of data is more
- Having good knowledge of data and problem
DL Algorithms
- Here, we can use Transfer learning in today’s trends.
Cloud API
- If there any existing solution of that problem on cloud, we can use that solution, if we’ve money.

Note: Many times, we’ll combine 2 or more methods to build our model. It depends on what output is required and how you achieve them.

Evaluation

There are 2 types of evaluations,

Intrinsic evaluation
- Using Accuracy, Confusion Matrix, Recall, Precision, etc.
Extrinsic evaluation
- Evaluation from the real world i.e., getting the performance feedback from the users.

NOTE: If any model is very well at EXTRINSIC level, it’ll be better at INTRINSIC level. But reverse can’t be true.

To analyze a NLP model, we will analyze the PERPLEXITY of the model i.e., how much confused is my model on performing any task.

Deployment

There is major 3 points in deployment,

Deploy

It depends on which type of product is formed and how the user will used our product. On the basis of them, we’ll able to classify how to deploy about product like using microservices, API creation, etc.
Monitoring

There is a dashboard where we’ll able to analyze how the customers are reaching our product and what’s the performance of our product.

It also helps us to track, is there any conflict/error will be raising in future and will be able to rectify it before the time.
Update

After monitoring, we’ve to update our product and fixes any bug or errors in it.

There are the major and finalized steps of our NLP pipelines which is providing an overview of whole NLP topic in easy explanation.

NLP Pipeline from Data Acquisition to Deployment

Table of contents

What is NLP Pipeline?

Points to remember

Data Acquisition

Text Preparation

Feature Engineering

Modelling

Evaluation

Deployment

Thank you very much!

If it will be helpful for you, like it!

Subscribe to my newsletter

Avdhesh Varshney

Avdhesh Varshney