Text Representation in NLP
Table of contents
Introduction
What is Feature Extraction from text?
To text representation
To text recognition
Why do we need it?
Why is it difficult?
What is the core idea?
What are the techniques?
OHE (One Hot Encoding)
BOW (Bag of Words)
ngrams
TfIdf
Custom features
Word2Vec (Word Embeddings)
Common Terms used in NLP
What is Corpus (C)?
A collection of authentic texts or audio that’s organized into datasets.
In other words, A corpus is a collection of machine-readable texts that’s representative of a specific language or language variety.
What is Vocabulary (V)?
Vocabulary is the collection of all unique words or linguistic units that appear in a given dataset.
In other words, the set of unique words used in the text corpus.
What is Document (D)?
A text object, the collection of which make up your corpus.
If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically.
What is Word (W)?
In NLP, a word is represented as a vector of real numbers, called a word embedding, to help computers understand the meaning of words.
One Hot Encoding (OHE)
To understand this topic, we will use an example,
D1 | people watch campusx |
D2 | campusx watch campusx |
D3 | people write comment |
D4 | campusx write comment |
Here Corpus is shown below,
people watch campusx campusx watch campusx people write comment campusx write comment
Here Vocabulary is shown below,
people watch campusx write comment
Subscribe to my newsletter
Read articles from Avdhesh Varshney directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
Avdhesh Varshney
Avdhesh Varshney
I am an aspiring data scientist. Currently, I'm pursuing B.Tech from Dr. B R Ambedkar NIT Jalandhar. Contributed a lot in many open-source programs and secured top ranks amongs them.