Text Representation in NLP

Introduction

  1. What is Feature Extraction from text?

    • To text representation

    • To text recognition

  2. Why do we need it?

  3. Why is it difficult?

  4. What is the core idea?

  5. What are the techniques?

    • OHE (One Hot Encoding)

    • BOW (Bag of Words)

    • ngrams

    • TfIdf

    • Custom features

    • Word2Vec (Word Embeddings)


Common Terms used in NLP

  1. What is Corpus (C)?

    A collection of authentic texts or audio that’s organized into datasets.

    In other words, A corpus is a collection of machine-readable texts that’s representative of a specific language or language variety.

  2. What is Vocabulary (V)?

    Vocabulary is the collection of all unique words or linguistic units that appear in a given dataset.

    In other words, the set of unique words used in the text corpus.

  3. What is Document (D)?

    A text object, the collection of which make up your corpus.

    If you are doing work on Search or Topics, the documents will be the objects which you will be finding similarities between in order to group them topically.

  4. What is Word (W)?

    In NLP, a word is represented as a vector of real numbers, called a word embedding, to help computers understand the meaning of words.


One Hot Encoding (OHE)

To understand this topic, we will use an example,

D1people watch campusx
D2campusx watch campusx
D3people write comment
D4campusx write comment

Here Corpus is shown below,

people watch campusx campusx watch campusx people write comment campusx write comment

Here Vocabulary is shown below,

people watch campusx write comment
0
Subscribe to my newsletter

Read articles from Avdhesh Varshney directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Avdhesh Varshney
Avdhesh Varshney

I am an aspiring data scientist. Currently, I'm pursuing B.Tech from Dr. B R Ambedkar NIT Jalandhar. Contributed a lot in many open-source programs and secured top ranks amongs them.