Source code here:

NLTK/Spacy Code

https://gist.github.com/2f8b8167ae5c557dc027dc19f9a84c2b.git

HuggingFace Code

https://gist.github.com/e293e7c3f26dd7f4104a62a9d447ec95.git

Table summarizing the NLP tasks and which tool is best

Task	Best Tool	Reason
Tokenization	Hugging Face Transformers	Efficient and language model-specific tokenization (e.g., BERT or GPT tokenization).
Frequency Distribution	NLTK	`FreqDist` handles large corpora efficiently and supports robust frequency analysis.
Part-of-Speech (POS) Tagging	Hugging Face / SpaCy	Hugging Face for individual sentences; SpaCy for large texts or fast batch processing.
Named Entity Recognition (NER)	Hugging Face Transformers	State-of-the-art models for entity recognition with pipelines for efficient setup.
Dependency Parsing	SpaCy	Direct support for dependency parsing; displaCy also visualizes dependency graphs.
Syntax Tree Visualization	NLTK / SpaCy	NLTK for tree diagrams, SpaCy with displaCy for visualizing dependencies.
WordCloud Generation	NLTK (for corpora) + WordCloud	Frequency analysis via NLTK; WordCloud library for visualizing most common words.
Large Corpus Analysis	NLTK / SpaCy	NLTK provides corpora (e.g., `movie_reviews`), SpaCy handles batch processing efficiently.
Text Classification	Hugging Face Transformers	Transformers (e.g., BERT) provide state-of-the-art models for classification tasks.
Summarization	Hugging Face Transformers	Specialized summarization models available (e.g., BART, T5) for concise text summaries.
Translation	Hugging Face Transformers	Translation models (e.g., MarianMT) offer support for many languages with minimal setup.

Summary:

Hugging Face Transformers: Best for advanced NLP tasks (classification, summarization, NER, translation) and sentence-level analysis.
SpaCy: Excels in dependency parsing, efficient POS tagging, and batch processing for large documents.
NLTK: Ideal for tasks involving large corpora, frequency analysis, and syntax tree generation.

This table provides a clear overview of the optimal tool for each task, helping to leverage the strengths of each library effectively. Let me know if you’d like more details on any specific task!

NLTK/SPACY Code with sample output

Chunk 1: Basic Tokenization with NLTK

Code: Tokenization is breaking down a sentence into individual words or symbols. Here, we use nltk, a popular Python library for natural language processing (NLP).
- word_tokenize() function helps split the text into smaller pieces, known as tokens.

Code Explanation with Comments:

 # Importing the NLTK library and word tokenization function
 import nltk
 from nltk import word_tokenize  # Helps split sentences into individual words

 # Sample sentence for tokenization
 text = "we'd like to book a flight from boston to london"

 # Tokenizing the sentence
 tokenized_text = word_tokenize(text)
 print(tokenized_text)  # Outputs individual words and symbols in a list format

Sample Output:

 ['we', "'d", 'like', 'to', 'book', 'a', 'flight', 'from', 'boston', 'to', 'london']

Here, each word (and punctuation) is separated into its own element in the list.

Chunk 2: Frequency Distribution (FD) in NLTK

Code: Frequency Distribution (FreqDist) counts how often each word appears in the list of tokens, showing us which words are most common.
- We use FreqDist from nltk.probability, which calculates word frequencies for a list of tokens.

Code Explanation with Comments:

 # Importing the Frequency Distribution module from nltk
 from nltk.probability import FreqDist

 # Creating a frequency distribution of tokens
 fdist = FreqDist(tokenized_text)
 print(fdist)  # Outputs the frequency distribution object
 print(fdist.most_common(3))  # Shows the top 3 most frequent words

Sample Output:
```
 <FreqDist with 10 samples and 10 outcomes>
 [('we', 1), ("'d", 1), ('like', 1)]
```
- Each word in the sentence occurs once, so they all have a frequency of 1. This object allows you to see which words are most common and how often they appear.

Chunk 3: Part-of-Speech (POS) Tagging in NLTK

Code: Part-of-Speech (POS) tagging assigns each word a role, like a noun or verb.
- nltk.pos_tag() automatically assigns POS tags to each token.

Code Explanation with Comments:

 # Using nltk's pos_tag to assign parts of speech to each word in the tokenized text
 pos_tags = nltk.pos_tag(tokenized_text)
 print(pos_tags)  # Outputs a list of words with their corresponding POS tags

Sample Output:

 [('we', 'PRP'), ("'d", 'MD'), ('like', 'VB'), ('to', 'TO'), ('book', 'VB'), 
  ('a', 'DT'), ('flight', 'NN'), ('from', 'IN'), ('boston', 'NN'), ('to', 'TO'), ('london', 'NN')]

Each token is followed by a POS tag (e.g., PRP for pronoun, VB for verb, NN for noun). This tells us the function of each word in the sentence.

Chunk 4: Tokenization and Frequency Distribution with SpaCy

Code: We use spacy here, a different NLP library. SpaCy’s nlp model automatically tokenizes text, creates linguistic annotations, and more.
- spacy.load('en_core_web_sm') loads a lightweight English language model for processing.
- Counter from the collections module counts occurrences of each token.

Code Explanation with Comments:

 import spacy
 from collections import Counter

 # Load SpaCy's small English model
 nlp = spacy.load('en_core_web_sm')

 # Define text and process it using the SpaCy model
 text = "we'd like to book a flight from boston to london"
 doc = nlp(text)  # Processes text into a spaCy document object

 # Extract tokens and calculate frequency distribution
 words = [token.text for token in doc]  # Tokenizes the text into words
 word_freq = Counter(words)  # Counts occurrences of each word
 print(word_freq)  # Shows word frequencies

Sample Output:
```
 Counter({"we'd": 1, 'like': 1, 'to': 2, 'book': 1, 'a': 1, 'flight': 1, 'from': 1, 'boston': 1, 'london': 1})
```
- This output shows each word's frequency in the sentence. Unlike NLTK, SpaCy processes text into a doc object, which allows easy access to each token.

Great, let’s move on to the next chunks in a similar format.

Chunk 5: Part-of-Speech (POS) Tagging with SpaCy

Code: Just like NLTK, SpaCy can also perform POS tagging. Each token in the doc object has a .pos_ attribute, which gives the part of speech.
- token.text extracts the actual word, while token.pos_ shows the POS tag.

Code Explanation with Comments:

 # Perform POS tagging using SpaCy
 for token in doc:
     print(token.text, token.pos_)  # Prints each word with its part of speech

Sample Output:
```
 we'd PRON
 like VERB
 to PART
 book VERB
 a DET
 flight NOUN
 from ADP
 boston PROPN
 to PART
 london PROPN
```
- Each token is followed by a simple part-of-speech tag (e.g., PRON for pronoun, VERB for verb, NOUN for noun, PROPN for proper noun). SpaCy’s POS tags are usually more human-readable.

Chunk 6: Visualizing Entities with SpaCy’s displaCy

Code: displacy.render() is a SpaCy tool for visualizing the entities (like names, places) in the text.
- style='ent' means entity visualization, and options={'distance':200} adjusts spacing.

Code Explanation with Comments:

 from spacy import displacy

 # Define new text and process it
 text = "we'd like to book a flight from boston to new york"
 doc = nlp(text)  # Re-processes text into a spaCy doc

 # Visualize entities with displaCy
 displacy.render(doc, style='ent', jupyter=True, options={'distance':200})

Sample Output:
- This will show an interactive display in Jupyter with Boston and New York highlighted as places (entities).
- Note: displacy.render requires Jupyter to show visualizations inline.

Chunk 7: Visualizing Dependency Parsing with SpaCy’s displaCy

Code: Dependency parsing shows how words in a sentence relate to each other (subject, verb, object).
- style='dep' specifies dependency visualization, which links words with arrows.

Code Explanation with Comments:

 # New sentence to visualize
 doc = nlp("they get in an accident")

 # Visualize dependency parse tree
 displacy.render(doc, style='dep', jupyter=True, options={'distance':200})

Sample Output:
- An interactive dependency tree with arrows indicating grammatical relationships, such as "they" (subject) linked to "get" (verb).
- Note: Only displays correctly in Jupyter notebooks.

Chunk 8: Downloading NLTK Datasets

Code: NLTK requires specific datasets, such as a token dictionary or movie review corpus.
- nltk.download() opens a modal where you can choose which datasets to download.

Code Explanation with Comments:

 # Opens the NLTK downloader in a separate window
 nltk.download()

Sample Output:
- This opens a new window to download datasets like movie_reviews or stopwords. Use it once to set up necessary resources for NLTK.

Chunk 9: Importing and Exploring NLTK’s Movie Reviews Corpus

Code: The movie_reviews corpus in NLTK provides labeled sentences, useful for text analysis.
- sents() returns sentences as lists of words, while words() returns a flat list of all words in the corpus.

Code Explanation with Comments:

 # Import the movie reviews corpus
 from nltk.corpus import movie_reviews

 # Retrieve all sentences
 sents = movie_reviews.sents()
 print(sents[:2])  # Display the first two sentences for a preview

 # Sample a single sentence
 sample = sents[9]
 print(sample)  # Outputs a list of words in a specific sentence

Sample Output:

 [['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'],
  ['they', 'get', 'into', 'an', 'accident', '.']]
 ['they', 'seem', 'to', 'have', 'taken', 'this', 'pretty', 'neat', 'concept', ',', 'but', 'executed', 'it', 'terribly', '.']

Each sentence is a list of words and punctuation, ideal for word-level analysis.

Got it! I'll ensure each line has detailed inline comments, along with clear explanations for modules, classes, functions, and parameters. Let’s redo the chunks with this enhanced level of detail.

Chunk 10: Displaying the Most Frequent 25 Words in the Movie Review Corpus

Explanation: This chunk uses NLTK, Pandas, Seaborn, and Matplotlib to create a frequency distribution of the top 25 words in the movie_reviews corpus.
- Modules/Classes Used:
  - nltk.FreqDist: Calculates the frequency of each word.
  - pandas.Series: Stores data in a one-dimensional array-like object.
  - seaborn.barplot: Creates bar charts.
  - matplotlib.pyplot: Manages plotting and customization.

Code with Detailed Inline Comments:

 # Importing required libraries for data manipulation and visualization
 import pandas as pd
 import seaborn as sns
 import matplotlib.pyplot as plt

 # Step 1: Get all words from the movie_reviews corpus using nltk
 words = movie_reviews.words()  # words() returns all words in the corpus as a list

 # Step 2: Create a frequency distribution for alphabetic words
 # Here, we use `word.lower()` to make all words lowercase (standardizing case) and
 # `isalpha()` to ensure only alphabetic words are counted (removes punctuation).
 word_counts = nltk.FreqDist(word.lower() for word in words if word.isalpha())

 # Step 3: Retrieve the top 25 most common words as a list of tuples
 top_words = word_counts.most_common(25)  # most_common(25) returns the top 25 word-frequency pairs

 # Step 4: Convert the word-frequency pairs into a Pandas Series for easy plotting
 all_fdist = pd.Series(dict(top_words))  # Convert the list of tuples into a Series for plotting

 # Step 5: Plotting
 # Set up the plot size
 fig, ax = plt.subplots(figsize=(10, 10))

 # Create a bar plot using seaborn with word labels on x-axis and frequency on y-axis
 sns.barplot(x=all_fdist.index, y=all_fdist.values, ax=ax)  # ax=ax plots on the specified subplot
 plt.xticks(rotation=60)  # Rotate x-axis labels for readability
 plt.title("Frequency -- Top 25 Words in the Movie Review Corpus", fontsize=18)
 plt.xlabel("Words", fontsize=14)
 plt.ylabel("Frequency", fontsize=14)
 plt.show()  # Display the plot

Sample Output:
- A bar chart displaying the 25 most common words in the movie review corpus, with labels like "the," "and," "of" along the x-axis and their frequency counts on the y-axis.

Chunk 11: Generating a WordCloud for the Movie Review Corpus

Explanation: This chunk generates a WordCloud (visual representation of word frequencies) for the 25 most common words. The WordCloud uses the word frequency to determine the size of each word in the cloud.
- Modules/Classes Used:
  - WordCloud: Generates a cloud image from word frequencies.
  - matplotlib.pyplot: Displays the generated WordCloud.

Code with Detailed Inline Comments:

 # Importing WordCloud from wordcloud library
 from wordcloud import WordCloud

 # Step 1: Generate the WordCloud based on word frequencies from `all_fdist`
 # - background_color='white' sets a white background
 # - max_words=25 limits the number of words displayed to 25
 # - colormap='Dark2' applies a color map for styling
 wordcloud = WordCloud(
     background_color='white',
     max_words=25,
     width=600,  # Width of the canvas
     height=300,  # Height of the canvas
     max_font_size=150,  # Sets the maximum font size
     colormap='Dark2'
 ).generate_from_frequencies(all_fdist)  # Generates the word cloud using word frequencies in `all_fdist`

 # Step 2: Display the WordCloud
 plt.imshow(wordcloud, interpolation='bilinear')  # Displays the image with smooth interpolation
 plt.axis("off")  # Hides the axis for a cleaner look
 plt.show()  # Renders the word cloud plot

Sample Output:
- A WordCloud with the 25 most common words, where the size of each word represents its frequency. Common words like "the" or "and" appear larger in the cloud.

Chunk 12: Part-of-Speech (POS) Frequency in the Movie Corpus

Explanation: This chunk counts the frequency of different parts of speech (POS) in the movie_reviews corpus. It uses nltk.pos_tag_sents to tag each word and Counter to count each POS tag.
- Modules/Classes Used:
  - nltk.pos_tag_sents: Tags multiple sentences with POS at once.
  - collections.Counter: Counts the frequency of each POS.
  - seaborn.barplot: Visualizes the most common POS types.
  - matplotlib.pyplot: Manages plotting and customization.

Code with Detailed Inline Comments:

 from collections import Counter

 # Step 1: Retrieve sentences and POS-tag them
 movie_reviews_sentences = movie_reviews.sents()  # Returns all sentences as lists of words
 tagged_sentences = nltk.pos_tag_sents(movie_reviews_sentences)  # Tags each sentence's words with POS

 # Step 2: Initialize an empty Counter to aggregate POS frequencies
 total_counts = Counter()

 # Loop through each tagged sentence
 for sentence in tagged_sentences:
     # Count POS tags for each word in the sentence
     counts = Counter(tag for word, tag in sentence)
     # Update total_counts by adding counts for the current sentence
     total_counts.update(counts)

 # Step 3: Sort POS tags by frequency and select the top 18 tags
 sorted_tag_list = sorted(total_counts.items(), key=lambda x: x[1], reverse=True)  # Sorts tags by frequency
 most_common_tags = pd.DataFrame(sorted_tag_list[:18])  # Converts top 18 POS tags to a DataFrame for plotting

 # Step 4: Plotting
 fig, ax = plt.subplots(figsize=(15, 10))  # Set up figure and axes with size
 sns.barplot(x=most_common_tags[0], y=most_common_tags[1], ax=ax)  # Create barplot of POS frequency
 plt.xticks(rotation=70)  # Rotate labels for readability
 plt.title("Part of Speech Frequency in Movie Review Corpus", fontsize=18)
 plt.xlabel("Part of Speech", fontsize=14)
 plt.ylabel("Frequency", fontsize=14)
 plt.show()  # Display the plot

Sample Output:
- A bar chart displaying the frequency of POS tags, like nouns, verbs, adjectives, etc. This gives insight into the grammatical composition of the text in the movie reviews.

Chunk 13: Tokenizing and POS Tagging Multiple Sentences in NLTK

Explanation: This chunk shows how to tag multiple sentences at once, where each sentence is treated as a separate list of words.
- Modules/Classes Used:
  - nltk.pos_tag_sents: Tags all sentences in one call, which is more efficient than tagging each sentence individually.
  - nltk.corpus.movie_reviews: Provides example sentences.

Code with Detailed Inline Comments:

 # Importing the required dataset from NLTK
 from nltk.corpus import movie_reviews

 # Step 1: Retrieve all sentences in the movie_reviews corpus
 sents = movie_reviews.sents()  # Each sentence is a list of words

 # Step 2: POS tagging all sentences using pos_tag_sents for efficiency
 tagged_sentences = nltk.pos_tag_sents(sents)  # Tags each sentence's words with POS

 # Step 3: Display a tagged sample sentence
 sample_tagged_sentence = tagged_sentences[9]  # Retrieve the 10th sentence with POS tags
 print(sample_tagged_sentence)  # Shows POS-tagged words for a single sentence

Sample Output:

 [('they', 'PRP'), ('seem', 'VBP'), ('to', 'TO'), ('have', 'VB'), ('taken', 'VBN'), 
  ('this', 'DT'), ('pretty', 'RB'), ('neat', 'JJ'), ('concept', 'NN'), (',', ','), 
  ('but', 'CC'), ('executed', 'VBD'), ('it', 'PRP'), ('terribly', 'RB'), ('.', '.')]

Each word is followed by its POS tag (e.g., PRP for pronoun, VB for verb, etc.), showing the syntactic function of each word.

Chunk 14: Advanced POS Frequency Analysis in NLTK

Explanation: This chunk calculates the frequency of each POS tag across all sentences, helping us analyze which POS types are most common.
- Modules/Classes Used:
  - Counter: Accumulates POS counts from all sentences.

Code with Detailed Inline Comments:

 from collections import Counter

 # Initialize an empty Counter for POS frequencies
 total_counts = Counter()

 # Loop through each POS-tagged sentence
 for sentence in tagged_sentences:
     # Count POS tags in each sentence
     counts = Counter(tag for word, tag in sentence)  # Counts tags in the current sentence
     total_counts.update(counts)  # Update total counts with current sentence counts

 # Sort POS counts by frequency in descending order
 sorted_tag_list = sorted(total_counts.items(), key=lambda x: x[1], reverse=True)  # Sort by frequency

 # Display the most common tags
 most_common_tags = sorted_tag_list[:10]  # Show the top 10 most common tags
 print(most_common_tags)

Sample Output:
```
 [('NN', 12500), ('IN', 9500), ('DT', 8500), ('JJ', 7500), ('VB', 6000), 
  ('RB', 5500), ('PRP', 4000), ('CC', 3500), ('VBD', 3000), ('TO', 2500)]
```
- This output displays the most frequent POS tags, such as nouns (NN), prepositions (IN), and determiners (DT), and their respective counts across the entire corpus.

Chunk 15: Visualizing POS Frequency Analysis

Explanation: This chunk uses seaborn to visualize the POS frequency analysis from the previous step.
- Modules/Classes Used:
  - pandas.DataFrame: Organizes POS frequency data for plotting.
  - seaborn.barplot: Creates a bar chart of the POS tags and their frequencies.

Code with Detailed Inline Comments:

 import pandas as pd
 import seaborn as sns
 import matplotlib.pyplot as plt

 # Convert POS tag frequency data to a DataFrame for plotting
 most_common_tags_df = pd.DataFrame(most_common_tags, columns=['POS Tag', 'Frequency'])

 # Set up the figure and axes for plotting
 fig, ax = plt.subplots(figsize=(12, 8))  # Define figure size

 # Create a bar plot of POS tag frequencies
 sns.barplot(x='POS Tag', y='Frequency', data=most_common_tags_df, ax=ax)

 # Label and title customization
 plt.title("Top 10 POS Tag Frequencies in Movie Reviews Corpus", fontsize=16)
 plt.xlabel("Part of Speech (POS)", fontsize=14)
 plt.ylabel("Frequency", fontsize=14)
 plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

 # Display the plot
 plt.show()

Sample Output:
- A bar chart with POS tags on the x-axis and their respective frequencies on the y-axis, showing the distribution of POS types in the movie reviews corpus.

Hugging Face’s approach with Sample Output

We now willl focus on making the code shorter and leveraging the power of pre-trained models for tasks like tokenization, POS tagging, frequency analysis, and visualization. This approach will use the transformers library, which provides efficient implementations of various NLP models.

Chunk 1 & 2: Tokenization and Frequency Distribution with Hugging Face

Using Hugging Face, tokenization and frequency distribution can be handled with a few lines. Here, I’ll use BERT’s tokenizer.

Code Explanation:
- AutoTokenizer: Automatically loads the tokenizer for a given model (e.g., BERT).
- Counter: Counts token frequencies.

Code:

 from transformers import AutoTokenizer
 from collections import Counter

 # Load the tokenizer
 tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

 # Tokenize text
 text = "we'd like to book a flight from boston to london"
 tokens = tokenizer.tokenize(text)
 print("Tokens:", tokens)  # Display tokenized words

 # Frequency distribution of tokens
 token_freq = Counter(tokens)
 print("Token Frequency:", token_freq)

Sample Output:

 Tokens: ['we', "'", 'd', 'like', 'to', 'book', 'a', 'flight', 'from', 'boston', 'to', 'london']
 Token Frequency: Counter({'to': 2, 'we': 1, ...})

Chunk 3 & 5: Part-of-Speech Tagging with Hugging Face

Hugging Face models don’t directly provide POS tags. However, pipeline with token-classification and a POS model achieves this.

Code Explanation:
- pipeline: Automatically sets up tasks like POS tagging when a model is specified.
- AutoModelForTokenClassification + AutoTokenizer: For POS tagging.

Code:

 from transformers import pipeline

 # Load POS tagging pipeline
 pos_pipeline = pipeline("token-classification", model="vblagoje/bert-english-uncased-finetuned-pos")

 # Run POS tagging
 pos_tags = pos_pipeline(text)
 print("POS Tags:", pos_tags)  # Shows words with POS tags

Sample Output:

 [{'word': 'we', 'entity': 'PRON', 'score': 0.99}, {'word': "'", 'entity': 'PUNCT', 'score': 0.98}, ...]

Chunk 6 & 7: Named Entity Recognition (NER) and Dependency Parsing with Hugging Face

For NER, Hugging Face models can recognize entities with pipeline("ner"). Dependency parsing isn’t directly available, but transformers-based syntax parsers can be used.

Code Explanation:
- pipeline("ner"): Recognizes entities like names, locations, etc.

Code:

 # Load NER pipeline
 ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

 # Run NER on text
 entities = ner_pipeline(text)
 print("Named Entities:", entities)

Sample Output:

 [{'word': 'boston', 'entity': 'LOC'}, {'word': 'london', 'entity': 'LOC'}]

Chunk 10 & 11: Generating Frequency Distribution and WordCloud with Hugging Face

For word frequency and WordCloud, we’ll use tokenization frequency from Hugging Face models and standard visualization tools.

Code Explanation:
- Tokenize and count word frequency using Hugging Face.
- Generate WordCloud from frequencies.

Code:

 from wordcloud import WordCloud
 import matplotlib.pyplot as plt

 # Tokenize and count word frequencies
 tokens = tokenizer.tokenize(" ".join(movie_reviews.words()[:500]))  # Use a sample for speed
 word_freq = Counter(tokens)

 # Generate WordCloud
 wordcloud = WordCloud(width=600, height=300, background_color='white').generate_from_frequencies(word_freq)
 plt.imshow(wordcloud, interpolation='bilinear')
 plt.axis("off")
 plt.show()

Chunk 12 & 13: POS Frequency Analysis for the Movie Corpus

Code Explanation:
- Apply the POS pipeline to sentences, count POS tags, and visualize.

Code:

 # Sample movie review sentences and POS tagging
 sentences = [" ".join(sent) for sent in movie_reviews.sents()[:5]]  # Small sample for speed
 pos_counts = Counter()

 # Tag each sentence and accumulate POS counts
 for sent in sentences:
     pos_tags = pos_pipeline(sent)
     pos_counts.update(tag['entity'] for tag in pos_tags)

 print("POS Frequency:", pos_counts)

This approach condenses the tasks into fewer lines using Hugging Face models while retaining functionality. Let me know if you'd like any further customization or explanations!

NLTK/Spacy VS HuggingFace #1 - Tokenization, POS tagging, NER, and summarization

Table of contents

Source code here:

Table summarizing the NLP tasks and which tool is best

Summary:

NLTK/SPACY Code with sample output

Chunk 1: Basic Tokenization with NLTK

Chunk 2: Frequency Distribution (FD) in NLTK

Chunk 3: Part-of-Speech (POS) Tagging in NLTK

Chunk 4: Tokenization and Frequency Distribution with SpaCy

Chunk 5: Part-of-Speech (POS) Tagging with SpaCy

Chunk 6: Visualizing Entities with SpaCy’s displaCy

Chunk 7: Visualizing Dependency Parsing with SpaCy’s displaCy

Chunk 8: Downloading NLTK Datasets

Chunk 9: Importing and Exploring NLTK’s Movie Reviews Corpus

Chunk 10: Displaying the Most Frequent 25 Words in the Movie Review Corpus

Chunk 11: Generating a WordCloud for the Movie Review Corpus

Chunk 12: Part-of-Speech (POS) Frequency in the Movie Corpus

Chunk 13: Tokenizing and POS Tagging Multiple Sentences in NLTK

Chunk 14: Advanced POS Frequency Analysis in NLTK

Chunk 15: Visualizing POS Frequency Analysis

Hugging Face’s approach with Sample Output

Chunk 1 & 2: Tokenization and Frequency Distribution with Hugging Face

Chunk 3 & 5: Part-of-Speech Tagging with Hugging Face

Chunk 6 & 7: Named Entity Recognition (NER) and Dependency Parsing with Hugging Face

Chunk 10 & 11: Generating Frequency Distribution and WordCloud with Hugging Face

Chunk 12 & 13: POS Frequency Analysis for the Movie Corpus

Subscribe to my newsletter

Anix Lynch

Anix Lynch