NLP Pipeline from Data Acquisition to Deployment
What is NLP Pipeline?
NLP is a set of steps followed to build an end-to-end NLP software.
Pipeline provides the thinking to build any app from a basic level up to advanced or real time used applications.
NLP software consists of the following steps:
Data Acquisition
Text Preparation
Text Cleanup
Basic Preprocessing
Advance Preprocessing
Feature Engineering
Modelling
Process of applying models / algorithms
Evaluation
Deployment
Deployment
Monitoring
Model Update
Points to remember
It’s not universal
DL Pipelines are slightly different.
Pipeline is non-linear.
We can either at any direction either from deployment to feature processing to data preprocessing or any at any stage.
Data Acquisition
Major sub-parts of this method,
Available Data
Table
Database
- Data Engineering
Less Data
Data Augmentation
Synonym
Biagram Flip
Back Translate
Addition of noise
Others
Use public dataset
Web scraping
Explore API
From PDFs
From Images
From Audios
No data found
Text Preparation
Cleaning
HTML Tag Cleaning
Emoji Encoding/Cleaning
Spelling Checks
Basic Text pre-processing
Basic pre-processing
Tokenization
Sentence
Word
Optional pre-processing
Stop words removal
Stemming
Lemmatization
Removing digits, punctuation
Lower casing
Language detection
Advanced Text pre-processing
Parts of Speech (POS) tagging
Parsing (Parse Tree formation)
Coreference resolution
Feature Engineering
Let’s us learn this topic by using an example, I’ve a set of datasets which contains 50,000 entries of sentiments to analyze them and 2 columns i.e., review_text
and sentiment
columns, where, review_text
contains the review of the product, and sentiment
contains either 0 or 1 i.e., either a positive sentiment or negative sentiment.
If suppose now, how to build the model and classify the reviews properly from the text?
The best and simple method is to create your own features, let’s say, count of +ve words
, -ve words
, and neutral words
.
Now, dataset looks like,
+ve words | -ve words | neutral words | sentiment |
5 | 6 | 6 | 0 |
3 | 5 | 2 | 1 |
By moving ahead, we’ll learn some of the algorithms or python packages which can easily help in creating the new features like Bag of Words
, Tfidf
, One Hot Encoding
, and word2vec
.
In NLP, we’ll solve our problem either using jugadu pipeline
, ML pipeline
and DL pipeline
. But one question arises, which method is suitable for our problem? Which method is best for our model will become healthier?
Let’s understand these questions one by one by knowing the difference between in these methods.
Heuristic or Jugadu pipeline, it can only be applied on self-identifying the dataset and searched a pattern from the dataset, which many times become lengthy, toughest on goes to large amount of dataset.
ML pipeline, here we’ll explore the features and create our features by own on the basis of our knowledge on the field provided in the problem statement.
It requires a pre-knowledge of that problem.
Easily able to tune the model because features are created by us and easily identify which feature is disturbing our model which one boost the efficiency or reduce errors.
Able to justify the model because of creation of features on the basis of our domain knowledge.
DL pipeline, just doing a simple pre-processing and feed the whole data into the dl models which can self-create the features inside the algorithms and build our model.
No need of pre-knowledge of that problem.
Evaluation is difficult as we don’t know which features are built inside the algorithms and on what pattern model provides this result.
Tuning sometimes becomes difficult and no proper explanation of us model.
Not able to justify the model because we don’t know which features are generated internally in dl model building.
Modelling
Modelling mainly depends on 2 factors,
Amount of Data
Nature of Problem
Approaches of Modelling,
Heuristic Methods
- Less amount of data, where we can observe the patterns.
ML Algorithms
Amount of data is more
Having good knowledge of data and problem
DL Algorithms
- Here, we can use Transfer learning in today’s trends.
Cloud API
- If there any existing solution of that problem on cloud, we can use that solution, if we’ve money.
Note: Many times, we’ll combine 2 or more methods to build our model. It depends on what output is required and how you achieve them.
Evaluation
There are 2 types of evaluations,
Intrinsic evaluation
- Using Accuracy, Confusion Matrix, Recall, Precision, etc.
Extrinsic evaluation
- Evaluation from the real world i.e., getting the performance feedback from the users.
NOTE: If any model is very well at EXTRINSIC level, it’ll be better at INTRINSIC level. But reverse can’t be true.
To analyze a NLP model, we will analyze the PERPLEXITY of the model i.e., how much confused is my model on performing any task.
Deployment
There is major 3 points in deployment,
Deploy
It depends on which type of product is formed and how the user will used our product. On the basis of them, we’ll able to classify how to deploy about product like using
microservices
,API creation
, etc.Monitoring
There is a dashboard where we’ll able to analyze how the customers are reaching our product and what’s the performance of our product.
It also helps us to track, is there any conflict/error will be raising in future and will be able to rectify it before the time.
Update
After monitoring, we’ve to update our product and fixes any bug or errors in it.
There are the major and finalized steps of our NLP pipelines which is providing an overview of whole NLP topic in easy explanation.
Thank you very much!
If it will be helpful for you, like it!
Subscribe to my newsletter
Read articles from Avdhesh Varshney directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Avdhesh Varshney
Avdhesh Varshney
I am an aspiring data scientist. Currently, I'm pursuing B.Tech from Dr. B R Ambedkar NIT Jalandhar. Contributed a lot in many open-source programs and secured top ranks amongs them.