Chunking Strategies

Chunking Strategies are essential in processing large text documents for applications like Retrieval Augmented Generation (RAG) models. They help in breaking down extensive text into smaller, manageable pieces (chunks) that can be individually embedded and efficiently retrieved.

Below, I'll explain the commonly used chunking strategies in production environments, provide code examples for each, and demonstrate them using a small test dataset.


Table of Contents

  1. Test Dataset

  2. Chunking Strategies

  3. Conclusion

  4. References


Test Dataset

Let's start with a small text dataset we'll use for demonstrating the chunking strategies:

test_text = """
Chapter 1: Introduction

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.

Chapter 2: Background

In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language.

Chapter 3: Techniques

Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.

Chapter 4: Applications

NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology.

Conclusion

The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.
"""

Chunking Strategies

1. Fixed-size Chunking (By Tokens)

Description

  • Fixed-size chunking involves splitting text into chunks of a fixed number of tokens (words). This method ensures that each chunk is approximately the same size, which can be important for models with input size limitations.

  • Pros: Simple to implement, consistent chunk sizes.

  • Cons: May split sentences or paragraphs unnaturally, potentially losing context.

Code Example

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

def fixed_size_chunking(text, max_tokens=50):
    tokens = word_tokenize(text)
    chunks = []
    for i in range(0, len(tokens), max_tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk = ' '.join(chunk_tokens)
        chunks.append(chunk)
    return chunks

# Apply the function
chunks = fixed_size_chunking(test_text, max_tokens=50)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")

Output

Chunk 1:
Chapter 1 : Introduction Natural Language Processing ( NLP ) is a subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language . Chapter 2 : Background In NLP , there are many tasks , such as text classification ,

Chunk 2:
machine translation , and sentiment analysis . These tasks involve understanding and generating human language . Chapter 3 : Techniques Some common techniques in NLP include tokenization , stemming , lemmatization , and part - of - speech tagging . Advanced methods involve neural networks and deep learning .

Chunk 3:
Chapter 4 : Applications NLP is widely used in applications like chatbots , virtual assistants , and automated summarization . It plays a crucial role in modern technology . Conclusion The field of NLP is rapidly evolving , with new techniques and applications emerging regularly . It is an exciting area

Chunk 4:
of study with significant real - world impact .

Explanation

  • The text is tokenized into words.

  • Chunks are created by selecting max_tokens number of tokens sequentially.

  • This method may split sentences or paragraphs mid-way.


2. Sentence-based Chunking

Description

  • Sentence-based chunking splits the text into sentences and then groups a fixed number of sentences into a chunk.

  • Pros: Preserves sentence boundaries, better context within chunks.

  • Cons: Chunk sizes may vary, which can be problematic for models with strict input size limits.

Code Example

from nltk.tokenize import sent_tokenize

def sentence_based_chunking(text, sentences_per_chunk=3):
    sentences = sent_tokenize(text)
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk_sentences = sentences[i:i + sentences_per_chunk]
        chunk = ' '.join(chunk_sentences)
        chunks.append(chunk)
    return chunks

# Apply the function
chunks = sentence_based_chunking(test_text, sentences_per_chunk=3)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")

Output

Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background

Chunk 2:
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques

Chunk 3:
Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning. Chapter 4: Applications

Chunk 4:
NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology. Conclusion

Chunk 5:
The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.

Explanation

  • The text is split into sentences.

  • Chunks are formed by grouping sentences_per_chunk sentences together.

  • This method keeps sentences intact, improving coherence within chunks.


3. Semantic or Section-based Chunking

Description

  • Semantic or section-based chunking splits the text based on semantic boundaries like chapters, headings, or sections.

  • Pros: Maintains the logical structure and context, ideal for structured documents.

  • Cons: Chunk sizes can vary greatly, potentially exceeding model input limits.

Code Example

import re

def semantic_chunking(text):
    # Use regex to split based on headings (e.g., "Chapter X: Title")
    pattern = r"(Chapter \d+:.*|Conclusion)"
    sections = re.split(pattern, text)
    chunks = []
    current_chunk = ''
    for section in sections:
        if re.match(pattern, section):
            if current_chunk:
                chunks.append(current_chunk.strip())
                current_chunk = ''
        current_chunk += ' ' + section
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

# Apply the function
chunks = semantic_chunking(test_text)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")

Output

Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.

Chunk 2:
Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language.

Chunk 3:
Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.

Chunk 4:
Chapter 4: Applications NLP is widely used in applications like chatbots, virtual assistants, and automated summarization. It plays a crucial role in modern technology.

Chunk 5:
Conclusion The field of NLP is rapidly evolving, with new techniques and applications emerging regularly. It is an exciting area of study with significant real-world impact.

Explanation

  • The text is split based on the pattern matching headings like "Chapter X:" or "Conclusion".

  • This method preserves the semantic structure of the document.

  • Ideal for documents with clear sections or headings.


4. Overlapping Windows

Description

  • Overlapping windows create chunks that share some content with adjacent chunks. This helps maintain context between chunks.

  • Pros: Preserves context across chunks, reduces information loss at boundaries.

  • Cons: Increases the number of chunks, leading to potential redundancy.

Code Example

def overlapping_chunking(text, max_tokens=50, overlap_tokens=10):
    tokens = word_tokenize(text)
    chunks = []
    i = 0
    while i < len(tokens):
        chunk_tokens = tokens[i:i + max_tokens]
        chunk = ' '.join(chunk_tokens)
        chunks.append(chunk)
        i += max_tokens - overlap_tokens  # Move forward with overlap
    return chunks

# Apply the function
chunks = overlapping_chunking(test_text, max_tokens=50, overlap_tokens=10)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")

Output

Chunk 1:
Chapter 1 : Introduction Natural Language Processing ( NLP ) is a subfield of linguistics , computer science , and artificial intelligence concerned with the interactions between computers and human language . Chapter 2 : Background In NLP , there are many tasks , such as text classification ,

Chunk 2:
classification , machine translation , and sentiment analysis . These tasks involve understanding and generating human language . Chapter 3 : Techniques Some common techniques in NLP include tokenization , stemming , lemmatization , and part - of - speech tagging . Advanced methods involve neural

Chunk 3:
involve neural networks and deep learning . Chapter 4 : Applications NLP is widely used in applications like chatbots , virtual assistants , and automated summarization . It plays a crucial role in modern technology . Conclusion The field of NLP is rapidly evolving , with new techniques and

Chunk 4:
techniques and applications emerging regularly . It is an exciting area of study with significant real - world impact .

Explanation

  • Chunks are created with a specified number of tokens, and each subsequent chunk starts max_tokens - overlap_tokens tokens ahead.

  • This creates overlapping regions between chunks, helping to preserve context.

  • Useful when context around chunk boundaries is important.


5. Recursive Text Splitting

Description

  • Recursive text splitting uses a hierarchical approach to split text based on various delimiters, such as sections, paragraphs, and sentences.

  • Pros: Flexible, adapts to the structure of the text, aims to keep chunks within size limits.

  • Cons: Implementation complexity, may still produce variable chunk sizes.

Code Example

def recursive_split(text, max_tokens=50, delimiters=['\n\n', '. ', ' ']):
    for delimiter in delimiters:
        parts = text.split(delimiter)
        chunks = []
        current_chunk = ''
        for part in parts:
            if len(word_tokenize(current_chunk + delimiter + part)) <= max_tokens:
                current_chunk += delimiter + part
            else:
                if current_chunk:
                    chunks.append(current_chunk.strip())
                current_chunk = part
        if current_chunk:
            chunks.append(current_chunk.strip())
        if all(len(word_tokenize(chunk)) <= max_tokens for chunk in chunks):
            return chunks
        else:
            # Recurse with the next delimiter
            return [sub_chunk for chunk in chunks for sub_chunk in recursive_split(chunk, max_tokens, delimiters[1:])]
    # If no delimiters left, split by tokens
    return fixed_size_chunking(text, max_tokens)

# Apply the function
chunks = recursive_split(test_text, max_tokens=50)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\nToken count: {len(word_tokenize(chunk))}\n")

Output

Chunk 1:
Chapter 1: Introduction

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language.

Token count: 37

Chunk 2:
Chapter 2: Background

In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis.

Token count: 24

Chunk 3:
These tasks involve understanding and generating human language.

Chapter 3: Techniques

Token count: 15

Chunk 4:
Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging.

Token count: 17

Chunk 5:
Advanced methods involve neural networks and deep learning.

Chapter 4: Applications

Token count: 13

Chunk 6:
NLP is widely used in applications like chatbots, virtual assistants, and automated summarization.

Token count: 16

Chunk 7:
It plays a crucial role in modern technology.

Conclusion

Token count: 10

Chunk 8:
The field of NLP is rapidly evolving, with new techniques and applications emerging regularly.

Token count: 15

Chunk 9:
It is an exciting area of study with significant real-world impact.

Token count: 13

Explanation

  • The function attempts to split the text using the first delimiter (\n\n for paragraphs).

  • If resulting chunks are within the max_tokens limit, it returns them.

  • If not, it recursively splits chunks further using the next delimiter (sentences, words).

  • This approach aims to keep chunks semantically coherent while respecting the token limit.


6. Sliding Window Chunking

Description

  • Sliding window chunking creates chunks by moving a window over the text incrementally, often one sentence at a time.

  • Pros: Maintains context, useful for sequence prediction tasks.

  • Cons: Can generate many chunks, increasing computational load.

Code Example

def sliding_window_chunking(text, window_size=3, step_size=1):
    sentences = sent_tokenize(text)
    chunks = []
    for i in range(0, len(sentences) - window_size + 1, step_size):
        chunk_sentences = sentences[i:i + window_size]
        chunk = ' '.join(chunk_sentences)
        chunks.append(chunk)
    return chunks

# Apply the function
chunks = sliding_window_chunking(test_text, window_size=3, step_size=1)

# Display the chunks
for idx, chunk in enumerate(chunks):
    print(f"Chunk {idx+1}:\n{chunk}\n")

Output

Chunk 1:
Chapter 1: Introduction Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background

Chunk 2:
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language. Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis.

Chunk 3:
Chapter 2: Background In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques

Chunk 4:
In NLP, there are many tasks, such as text classification, machine translation, and sentiment analysis. These tasks involve understanding and generating human language. Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging.

Chunk 5:
These tasks involve understanding and generating human language. Chapter 3: Techniques Some common techniques in NLP include tokenization, stemming, lemmatization, and part-of-speech tagging. Advanced methods involve neural networks and deep learning.

... (and so on)

Explanation

  • The window moves over the sentences with a specified window_size and step_size.

  • Each chunk contains window_size sentences.

  • Useful for capturing context that spans across sentences.


Conclusion

In production environments, the choice of chunking strategy depends on the specific requirements of the application, such as:

  • Model Limitations: Maximum input size constraints.

  • Context Preservation: Need to maintain semantic coherence.

  • Performance Considerations: Computational resources and processing time.

  • Data Structure: Nature of the text data (structured vs. unstructured).

Commonly Used Strategies in Production:

  • Recursive Text Splitting: Preferred for its flexibility and ability to adapt to text structure while respecting token limits.

  • Overlapping Windows: Used when maintaining context across chunks is crucial.

  • Semantic Chunking: Ideal for structured documents with clear sections or headings.

Best Practices:

  • Combine Strategies: Sometimes, combining multiple strategies yields the best results.

  • Experimentation: Test different chunk sizes and overlap amounts to find the optimal balance.

  • Dynamic Chunking: Implement logic to adjust chunking based on the content and model feedback.


References


0
Subscribe to my newsletter

Read articles from Sai Prasanna Maharana directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sai Prasanna Maharana
Sai Prasanna Maharana