NLP - Semantics and Sentiment Analysis


Overview of Semantic and Word Vector
word2vec is a two layer neural net that processes text; where input is a text corpus and output is a vector set. Word2vec’s purpose is to group the vector of similar words. It detects similarities mathematically. It is kind of a numerical distribution of vector of of similar words.
word2veq is all about training words against other words. There is 2 ways of doing this.
bag of words (CBOW)
skip-gram
They are the opposite of each other. In CBOW, given the context word, it predicts the output word. In skip-gram, it tries to predict the context word from the output word.
Whatever the method is, each of the words are going to be a vector. In spacy, each of these vectors has 300 dimensions. Training an autoencoder for word2vec yourself can take a long time on a large text corpus. Because of this, it's uncommon for people to not use built-in embedded word vectors. However, if you want to train your own autoencoder for word2vec, you can choose either fewer or more dimensions. Typically, dimensions range from 100 to 1000. The more dimensions you have, the longer the training time will be. However, with more dimensions, you can capture more context around each word, as there is more space to store information.
Semantic and Word Vector (spaCy)
Important thing to notice: Larger spacy english model contains the word2veq. The smaller versions do not contain the word vectors.
# Import spacy
import spacy
# Directly download via spaCy (recommended)
!python -m spacy download en_core_web_lg
# Load the model
import spacy
nlp = spacy.load("en_core_web_lg")
nlp('BRAC University').vector
Doc and span objects also have vectors, which are created by averaging the vectors of individual tokens. This means you can not only use word2vec but also document2vec, where the document's vector is the average of all its words.
nlp('The quick brown fox jumps over the lazy dog.').vector.shape
nlp('dog').vector.shape
Essentially, both has 300 vectors.
# Identifying similar words
tokens = nlp('lion cat pet')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
The words "lion lion," "cat cat," and "pet pet" have a 100% similarity with each other. The similarity values range between zero and one. It makes sense that a word is completely similar to itself. What's interesting is that the word vectors have enough information to show that "lion" and "cat" have some similarity, with a score of 0.52. "Cat" and "pet" tend to have a high similarity because most cats are pets, so it makes sense they have a high similarity score. It also makes sense that "lion" and "pet" have a similarity of less than 0.5, as having a lion as a pet isn't common, though some people do. We can see that relationships are established just from the word vectors. Essentially, this process checks the cosine similarity between token one and token two.
Love and hate are very different words with very different meanings. However, they're often used in the same context. You either love a movie or hate a movie, love a book or hate a book. In this way, they are similar because they're frequently used in similar situations. So, words like these can often have very similar vectors.
So, remember that if words are used in a similar context, they might be similar even if they have opposite meanings in regular English.
Sometimes, it's useful to combine 300 dimensions into a Euclidean L2 norm.
nlp.vocab.vectors.shape
#OUT[UT (342918, 300)
Sometimes you may encounter words that is outside of the 342918×300 dimension’s outside. To check if a word is outside of vocabulary (token.is_oov
), we have attributes to check.
# Outside of vocab attribute checking
tokens = nlp('dog cat nargle')
for token1 in tokens:
print(token1.text, token1.has_vector, token1.vector_norm, token1.is_oov)
Nargel is a made-up word, so it doesn't have a vector. This means it returns a norm of 0.0. Is it outside the vocabulary? Yes, that's true. Keep in mind that common names might actually have vectors. For example, if we add karen
, it does have a vector associated with it, and even some uncommon names. If I were to include my middle name, it is actually in the vocabulary, which is quite interesting.
Another thing is, vector arithmetic. You can actually calculate a new vector by adding, subtracting related vectors from these words. A famous example is: king - man + queen = queen.
# NOTE: THIS IS A PERFECT CODE BUT IT'S NOT GIVING THE EXPECTED OUTPUT.
# MOST LIKELY, ISSUES RELATED WITH THE LARGE LANGUAGE MODEL
# King - Man + Woman = Queen.
# Importing spatial to perform arithmetic
from scipy import spatial
cosine_similarity = lambda vec1, vec2: 1 - spatial.distance.cosine(vec1, vec2)
# Grabbing their vector
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector
# King - Man + Woman = Queen --> new_vector similar queen, princess, highness
new_vector = king - man + woman
computed_similarities = []
# Going all out in the entire vocabulary
# FOR ALL WORDSIN MY VOCAB
for word in nlp.vocab:
if word.has_vector:
if word.is_lower:
if word.is_alpha: #numbers
if word.is_stop == False:
similarity = cosine_similarity(new_vector, word.vector)
computed_similarities.append((word, similarity))
# Sorting in a decsending order
# Without the minus, it sorts in ascending order, showing the least similar words.
computed_similarities = sorted(computed_similarities, key = lambda item: -item[1])
# Printing the top 10 word from the tuple
# Grab the first word of the tuple, ask for it's text
# Then do it for tuple in computed_similarities
# Go through top 10, :10
print([t[0].text for t in computed_similarities[:10]])
# Expected output: ['queen', 'monarch', 'princess', 'royal', 'throne', ...]
There’s some sort of understanding of royalty related with king
, and some sort of understanding of the genders here. We understand King
is a man. And if we subtract the gender dimension from King
and add woman
to it, we get something of a royalty as well.
Sentiment Analysis
We have a tool called VADER (Valence Aware Dictionary for Sentiment Reasoning). It processes raw text data with an algorithm to determine sentiment. No pre-labeled training is needed. VADER is a model used for text sentiment analysis, and it is sensitive to polarity, indicating both positivity and negativity.
Vader is available in NLTK. We can directly use it on unlabeled text data. Vader sentiment analysis depends on a dictionary which maps the Lexical features to emotion intensities AKA sentiment score. Sentiment score is obtained by summing up the intensity of each word in a text, so like, every single word has a thing to do with sentiment score. words like love, like, enjoy, happy
- all convey to positive sentiment. Vader is smart enough to understand ‘did not enjoy’ is a negative sentiment. Vader take into account of every single word. Also, Vader will mark the intensity of LOVE!!!
> love
. But yah, Vader can’t detect sarcasm. It’s really difficult. I mean, how it can detect if it sees positive words getting used in a negative way?
Okay, however, let’s see how to use Vader in NLTK.
Sentiment Analysis (NLTK)
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
review = "This is a good anime"
sid.polarity_scores(review) # Only one +ve word
review = "BEST ANIME! This was the best, most AWESOME anime MADE EVER IN HISTORY!!!"
sid.polarity_scores(review) # It has many positive words, capitalized, even had '!!! marks'
review = "This was the WORST anime!! YUCK! waste of my time!"
sid.polarity_scores(review) # It has no positive word, rather bunch of negative words.
Let’s do it on amazon reviews.
Get your resources here: https://github.com/fatimajannet/NLP-with-Fatima
The whole process is written in the Colab file. There's not much to explain here, so please check out the Colab file.
That’s it! please run this model with moviereviews.tsv which is also given in the github repo.
Subscribe to my newsletter
Read articles from Fatima Jannet directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
