Chunking Strategies
Chunking Strategies are essential in processing large text documents for applications like Retrieval Augmented Generation (RAG) models. They help in breaking down extensive text into smaller, manageable pieces (chunks) that can be individually embedded and efficiently retrieved.
Below, I'll explain the commonly used chunking strategies in production environments, provide code examples for each, and demonstrate them using a small test dataset.
Table of Contents
Test Dataset
Let's start with a small text dataset we'll use for demonstrating the chunking strategies:
test_text = """
Chapter 1: Introduction
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.
Chapter 2: Background
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language.
Chapter 3: Techniques
Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.
Chapter 4: Applications
NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology.
Conclusion
The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.
"""
Chunking Strategies
1. Fixed-size Chunking (By Tokens)
Description
Fixed-size chunking involves splitting text into chunks of a fixed number of tokens (words). This method ensures that each chunk is approximately the same size, which can be important for models with input size limitations.
Pros: Simple to implement, consistent chunk sizes.
Cons: May split sentences or paragraphs unnaturally, potentially losing context.
Code Example
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
def fixed_size_chunking(text, max_tokens=50):
tokens = word_tokenize(text)
chunks = []
for i in range(0, len(tokens), max_tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunk = ' '.join(chunk_tokens)
chunks.append(chunk)
return chunks
# Apply the function
chunks = fixed_size_chunking(test_text, max_tokens=50)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\n")
Output
Chunk 1:
Chapter 1 : Introduction Natural Language Processing ( NLP ) is a subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language . Chapter 2 : Background In NLP , there are many tasks , such as text classification ,
Chunk 2:
machine translation , and sentiment analysis . These tasks involve understanding and generating human language . Chapter 3 : Techniques Some common techniques in NLP include tokenization , stemming , lemmatization , and part - of - speech tagging . Advanced methods involve neural networks and deep learning .
Chunk 3:
Chapter 4 : Applications NLP is widely used in applications like chatbots , virtual assistants , and automated summarization . It plays a crucial role in modern technology . Conclusion The field of NLP is rapidly evolving , with new techniques and applications emerging regularly . It is an exciting area
Chunk 4:
of study with significant real - world impact .
Explanation
The text is tokenized into words.
Chunks are created by selecting
max_tokens
number of tokens sequentially.This method may split sentences or paragraphs mid-way.
2. Sentence-based Chunking
Description
Sentence-based chunking splits the text into sentences and then groups a fixed number of sentences into a chunk.
Pros: Preserves sentence boundaries, better context within chunks.
Cons: Chunk sizes may vary, which can be problematic for models with strict input size limits.
Code Example
from nltk.tokenize import sent_tokenize
def sentence_based_chunking(text, sentences_per_chunk=3):
sentences = sent_tokenize(text)
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk_sentences = sentences[i:i + sentences_per_chunk]
chunk = ' '.join(chunk_sentences)
chunks.append(chunk)
return chunks
# Apply the function
chunks = sentence_based_chunking(test_text, sentences_per_chunk=3)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\n")
Output
Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background
Chunk 2:
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques
Chunk 3:
Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning. Chapter 4: Applications
Chunk 4:
NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology. Conclusion
Chunk 5:
The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.
Explanation
The text is split into sentences.
Chunks are formed by grouping
sentences_per_chunk
sentences together.This method keeps sentences intact, improving coherence within chunks.
3. Semantic or Section-based Chunking
Description
Semantic or section-based chunking splits the text based on semantic boundaries like chapters, headings, or sections.
Pros: Maintains the logical structure and context, ideal for structured documents.
Cons: Chunk sizes can vary greatly, potentially exceeding model input limits.
Code Example
import re
def semantic_chunking(text):
# Use regex to split based on headings (e.g., "Chapter X: Title")
pattern = r"(Chapter \d+:.*|Conclusion)"
sections = re.split(pattern, text)
chunks = []
current_chunk = ''
for section in sections:
if re.match(pattern, section):
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = ''
current_chunk += ' ' + section
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
# Apply the function
chunks = semantic_chunking(test_text)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\n")
Output
Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.
Chunk 2:
Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language.
Chunk 3:
Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.
Chunk 4:
Chapter 4: Applications NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology.
Chunk 5:
Conclusion The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.
Explanation
The text is split based on the pattern matching headings like "Chapter X:" or "Conclusion".
This method preserves the semantic structure of the document.
Ideal for documents with clear sections or headings.
4. Overlapping Windows
Description
Overlapping windows create chunks that share some content with adjacent chunks. This helps maintain context between chunks.
Pros: Preserves context across chunks, reduces information loss at boundaries.
Cons: Increases the number of chunks, leading to potential redundancy.
Code Example
def overlapping_chunking(text, max_tokens=50, overlap_tokens=10):
tokens = word_tokenize(text)
chunks = []
i = 0
while i < len(tokens):
chunk_tokens = tokens[i:i + max_tokens]
chunk = ' '.join(chunk_tokens)
chunks.append(chunk)
i += max_tokens - overlap_tokens # Move forward with overlap
return chunks
# Apply the function
chunks = overlapping_chunking(test_text, max_tokens=50, overlap_tokens=10)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\n")
Output
Chunk 1:
Chapter 1 : Introduction Natural Language Processing ( NLP ) is a subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language . Chapter 2 : Background In NLP , there are many tasks , such as text classification ,
Chunk 2:
classification , machine translation , and sentiment analysis . These tasks involve understanding and generating human language . Chapter 3 : Techniques Some common techniques in NLP include tokenization , stemming , lemmatization , and part - of - speech tagging . Advanced methods involve neural
Chunk 3:
involve neural networks and deep learning . Chapter 4 : Applications NLP is widely used in applications like chatbots , virtual assistants , and automated summarization . It plays a crucial role in modern technology . Conclusion The field of NLP is rapidly evolving , with new techniques and
Chunk 4:
techniques and applications emerging regularly . It is an exciting area of study with significant real - world impact .
Explanation
Chunks are created with a specified number of tokens, and each subsequent chunk starts
max_tokens - overlap_tokens
tokens ahead.This creates overlapping regions between chunks, helping to preserve context.
Useful when context around chunk boundaries is important.
5. Recursive Text Splitting
Description
Recursive text splitting uses a hierarchical approach to split text based on various delimiters, such as sections, paragraphs, and sentences.
Pros: Flexible, adapts to the structure of the text, aims to keep chunks within size limits.
Cons: Implementation complexity, may still produce variable chunk sizes.
Code Example
def recursive_split(text, max_tokens=50, delimiters=['\n\n', '. ', ' ']):
for delimiter in delimiters:
parts = text.split(delimiter)
chunks = []
current_chunk = ''
for part in parts:
if len(word_tokenize(current_chunk + delimiter + part)) <= max_tokens:
current_chunk += delimiter + part
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = part
if current_chunk:
chunks.append(current_chunk.strip())
if all(len(word_tokenize(chunk)) <= max_tokens for chunk in chunks):
return chunks
else:
# Recurse with the next delimiter
return [sub_chunk for chunk in chunks for sub_chunk in recursive_split(chunk, max_tokens, delimiters[1:])]
# If no delimiters left, split by tokens
return fixed_size_chunking(text, max_tokens)
# Apply the function
chunks = recursive_split(test_text, max_tokens=50)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\nToken count: {len(word_tokenize(chunk))}\n")
Output
Chunk 1:
Chapter 1: Introduction
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.
Token count: 37
Chunk 2:
Chapter 2: Background
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis.
Token count: 24
Chunk 3:
These tasks involve understanding and generating human language.
Chapter 3: Techniques
Token count: 15
Chunk 4:
Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging.
Token count: 17
Chunk 5:
Advanced methods involve neural networks and deep learning.
Chapter 4: Applications
Token count: 13
Chunk 6:
NLP is widely used in applications like chatbots, virtual assistants, and automated summarization.
Token count: 16
Chunk 7:
It plays a crucial role in modern technology.
Conclusion
Token count: 10
Chunk 8:
The field of NLP is rapidly evolving, with new techniques and applications emerging regularly.
Token count: 15
Chunk 9:
It is an exciting area of study with significant real-world impact.
Token count: 13
Explanation
The function attempts to split the text using the first delimiter (
\n\n
for paragraphs).If resulting chunks are within the
max_tokens
limit, it returns them.If not, it recursively splits chunks further using the next delimiter (sentences, words).
This approach aims to keep chunks semantically coherent while respecting the token limit.
6. Sliding Window Chunking
Description
Sliding window chunking creates chunks by moving a window over the text incrementally, often one sentence at a time.
Pros: Maintains context, useful for sequence prediction tasks.
Cons: Can generate many chunks, increasing computational load.
Code Example
def sliding_window_chunking(text, window_size=3, step_size=1):
sentences = sent_tokenize(text)
chunks = []
for i in range(0, len(sentences) - window_size + 1, step_size):
chunk_sentences = sentences[i:i + window_size]
chunk = ' '.join(chunk_sentences)
chunks.append(chunk)
return chunks
# Apply the function
chunks = sliding_window_chunking(test_text, window_size=3, step_size=1)
# Display the chunks
for idx, chunk in enumerate(chunks):
print(f"Chunk {idx+1}:\n{chunk}\n")
Output
Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background
Chunk 2:
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis.
Chunk 3:
Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques
Chunk 4:
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging.
Chunk 5:
These tasks involve understanding and generating human language. Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.
... (and so on)
Explanation
The window moves over the sentences with a specified
window_size
andstep_size
.Each chunk contains
window_size
sentences.Useful for capturing context that spans across sentences.
Conclusion
In production environments, the choice of chunking strategy depends on the specific requirements of the application, such as:
Model Limitations: Maximum input size constraints.
Context Preservation: Need to maintain semantic coherence.
Performance Considerations: Computational resources and processing time.
Data Structure: Nature of the text data (structured vs. unstructured).
Commonly Used Strategies in Production:
Recursive Text Splitting: Preferred for its flexibility and ability to adapt to text structure while respecting token limits.
Overlapping Windows: Used when maintaining context across chunks is crucial.
Semantic Chunking: Ideal for structured documents with clear sections or headings.
Best Practices:
Combine Strategies: Sometimes, combining multiple strategies yields the best results.
Experimentation: Test different chunk sizes and overlap amounts to find the optimal balance.
Dynamic Chunking: Implement logic to adjust chunking based on the content and model feedback.
References
NLTK Documentation: NLTK Tokenizers
OpenAI Cookbook on Chunking: Text Splitting Techniques
LangChain Documentation: Text Splitters
Subscribe to my newsletter
Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by