When computing TF-IDF, Scikit-Learn applies certain adjustments that may differ from the standard textbook approach. While the traditional TF-IDF calculation involves computing raw term frequency (TF) and inverse document frequency (IDF) separately before multiplication, Scikit-Learn incorporates smoothing techniques such as adding 1 to both the numerator and denominator in the IDF formula to prevent division by zero. Additionally, it applies L2 normalization by default, ensuring that each document vector has a unit norm. These modifications can cause slight variations in TF-IDF values compared to a purely theoretical implementation, making it essential to understand these nuances when comparing Scikit-Learn’s results with a custom implementation.

TFIDF Calculation Using SKLearn’s TfidfVectorizer
accompanied by the step-by-step manual TFIDF calculationblog.devgenius.io

In this work, we compare Scikit-Learn’s TF-IDF computation with a custom implementation to understand their differences, particularly when using n-grams of range (3,4). We present both approaches side by side, ensuring that the custom implementation follows the standard TF-IDF formula, while also capturing the specific preprocessing steps and adjustments made by Scikit-Learn, such as IDF smoothing and L2 normalization. By structuring both implementations into DataFrame outputs, we provide a clear, data-driven comparison to highlight any variations in term weighting and document representation.

[1] Prep dataset

import nltk
nltk.download('punkt_tab')

import numpy as np
import re

# Sample normalized corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# Function to normalize text
def normalize_document(doc):
    doc = re.sub(r'[^a-zA-Z\s]', '', doc.lower(), re.I | re.A)  # Lowercase and remove special characters
    tokens = nltk.word_tokenize(doc)
    return ' '.join(tokens)

# Normalize corpus
norm_corpus = np.array([normalize_document(doc) for doc in corpus])

The common preprocessing code prepares the text corpus for both Scikit-Learn and custom TF-IDF implementations by ensuring uniform text normalization. It starts by importing NLTK, NumPy, and regular expressions, followed by downloading the necessary tokenizer. The corpus consists of four short text samples, which are then cleaned using the normalize_document() function. This function removes special characters and punctuation, converts text to lowercase, and tokenizes it using NLTK to ensure consistency. Finally, the cleaned corpus is stored in a NumPy array, making it ready for TF-IDF computation. This preprocessing step is crucial for removing noise, standardizing text, and enabling meaningful feature extraction for both implementations.

[2] SKLearn Approach

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

### Scikit-Learn Implementation
# Initialize TfidfVectorizer with n-gram range (3,4)
tv_ngram = TfidfVectorizer(ngram_range=(3,4))

# Fit and transform the corpus
tv_matrix_ngram = tv_ngram.fit_transform(norm_corpus)

# Extract feature names (terms)
tv_corpus_term_ngram = tv_ngram.get_feature_names_out()

# Convert TF-IDF matrix to DataFrame
df_tfidf_sklearn_ngram = pd.DataFrame(tv_matrix_ngram.toarray(), columns=tv_corpus_term_ngram)

display(df_tfidf_sklearn_ngram)

The Scikit-Learn TF-IDF approach leverages the TfidfVectorizer to compute TF-IDF scores with an n-gram range of (3,4). First, it initializes TfidfVectorizer(ngram_range=(3,4)), which ensures that the model captures word sequences of 3-grams and 4-grams instead of single words. The normalized corpus is then fitted and transformed into a TF-IDF matrix, where each row represents a document, and each column corresponds to an n-gram feature extracted from the text. The feature names, i.e., the n-grams, are obtained using get_feature_names_out(), and the resulting TF-IDF matrix is converted into a Pandas DataFrame for structured representation. Finally, display(df_tfidf_sklearn_ngram) presents the output, allowing for easy interpretation of term importance across different documents. This approach provides an efficient and automated way to extract meaningful n-gram-based TF-IDF features using Scikit-Learn.

Output:

[2] Manual Approach

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

### Custom Implementation
# Create Count Vectorizer matrix with n-gram range (3,4)
cv_ngram = CountVectorizer(ngram_range=(3,4))
cv_matrix_ngram = cv_ngram.fit_transform(norm_corpus)
corpus_term_ngram = cv_ngram.get_feature_names_out()

# Compute Document Frequency (DF)
N = len(norm_corpus)
corpus_term_document_count_ngram = {term: sum(1 for i in range(cv_matrix_ngram.shape[0]) if cv_matrix_ngram[i].toarray()[0][idx] > 0)
                                    for idx, term in enumerate(corpus_term_ngram)}

# Compute Inverse Document Frequency (IDF)
corpus_term_idf_ngram = {term: np.log((1 + N) / (1 + corpus_term_document_count_ngram[term])) + 1 for term in corpus_term_ngram}

# Compute Term Frequency-Inverse Document Frequency (TF-IDF)
corpus_document_term_tfidf_ngram = []
for i in range(cv_matrix_ngram.shape[0]):
    doc_vector = {term: cv_matrix_ngram[i, idx] * corpus_term_idf_ngram[term] for idx, term in enumerate(corpus_term_ngram)}
    corpus_document_term_tfidf_ngram.append(doc_vector)

# Apply L2 Normalization
corpus_document_term_tfidf_normalized_ngram = []
for doc_tfidf in corpus_document_term_tfidf_ngram:
    norm_factor = np.sqrt(sum(v ** 2 for v in doc_tfidf.values()))
    normalized_vector = {term: (doc_tfidf[term] / norm_factor if norm_factor > 0 else 0) for term in corpus_term_ngram}
    corpus_document_term_tfidf_normalized_ngram.append(normalized_vector)

# Convert to DataFrame
df_tfidf_custom_ngram = pd.DataFrame(corpus_document_term_tfidf_normalized_ngram)

# Display the DataFrames
display(df_tfidf_custom_ngram)

The custom TF-IDF approach follows a step-by-step manual computation of TF-IDF with n-grams (3,4), closely mirroring the Scikit-Learn implementation while maintaining flexibility for customization. It begins by using CountVectorizer(ngram_range=(3,4)) to extract n-gram features and generate a count matrix. Next, it computes the Document Frequency (DF) for each n-gram, counting how many documents contain each term. The Inverse Document Frequency (IDF) is then calculated using the logarithmic smoothing formula, ensuring that terms appearing in multiple documents receive lower weights. The TF-IDF score for each n-gram in a document is obtained by multiplying its term frequency (TF) by its IDF value. To match Scikit-Learn’s normalization, L2 normalization is applied, ensuring that each document vector has a unit norm. Finally, the computed TF-IDF values are stored in a Pandas DataFrame for structured visualization, allowing easy comparison with the Scikit-Learn results. This approach provides greater control over TF-IDF computation, making it useful for fine-tuning and experimentation in text analytics and NLP tasks.

Output:

Colab Notebook:

https://colab.research.google.com/drive/15RJrF-Rhmslv3dkcffj92rJU0Qdohluq?usp=sharing

TF-IDF with N-Grams: Comparing Scikit-Learn and a Custom Approach

[1] Prep dataset

[2] SKLearn Approach

[2] Manual Approach

Colab Notebook:

Subscribe to my newsletter

Mohamad Mahmood

Mohamad Mahmood