Building NeuroStash - VI

Farhan KhojaFarhan Khoja
4 min read

Raw thoughts from the trenches of building better retrieval systems

The Problem That Kept Me Up at Night

You know that feeling when your RAG system returns technically correct but completely useless chunks? Yeah, I was there. My traditional text splitters were butchering documents like a blunt knife through steak - creating arbitrary 512-token chunks that split sentences mid-thought, separated context from conclusions, and generally made my retrieval system feel like it had amnesia.

I was getting chunks like this:

Chunk 1: "...the economic impact was severe. Inflation rose to"
Chunk 2: "15% by the end of the quarter, causing widespread"
Chunk 3: "concern among investors and policymakers who..."

Useless. Absolutely useless.

Enter Semantic Chunking

While diving deep into LangChain's experimental modules, I stumbled upon something that made me question everything I thought I knew about text splitting. The SemanticChunker - a text splitter that actually understands what it's reading.

Instead of counting tokens like a robot, it uses embeddings to measure semantic similarity between sentences. Mind = blown.

How This Beast Actually Works

Let me break down the magic happening under the hood:

Step 1: Sentence Buffering (The Context Window Trick)

def combine_sentences(sentences: List[dict], buffer_size: int = 1) -> List[dict]:
    for i in range(len(sentences)):
        combined_sentence = ""

        # Add context BEFORE current sentence
        for j in range(i - buffer_size, i):
            if j >= 0:
                combined_sentence += sentences[j]["sentence"] + " "

        # Add current sentence
        combined_sentence += sentences[i]["sentence"]

        # Add context AFTER current sentence  
        for j in range(i + 1, i + 1 + buffer_size):
            if j < len(sentences):
                combined_sentence += " " + sentences[j]["sentence"]

        sentences[i]["combined_sentence"] = combined_sentence

    return sentences

This is brilliant. Instead of treating each sentence in isolation, it creates a sliding window of context. So instead of just embedding "The results were unexpected.", it embeds "The experiment ran for 6 months. The results were unexpected. Further analysis was needed."

Context matters. Always.

Step 2: Semantic Distance Calculation (Where the Magic Happens)

def calculate_cosine_distances(sentences: List[dict]) -> Tuple[List[float], List[dict]]:
    distances = []
    for i in range(len(sentences) - 1):
        embedding_current = sentences[i]["combined_sentence_embedding"]
        embedding_next = sentences[i + 1]["combined_sentence_embedding"]

        # Calculate semantic similarity
        similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]

        # Convert to distance (higher = more different)
        distance = 1 - similarity
        distances.append(distance)
        sentences[i]["distance_to_next"] = distance

    return distances, sentences

Here's where it gets interesting. The system calculates how semantically different consecutive sentences are. High distance = topic shift. Low distance = same topic.

Step 3: Smart Breakpoint Detection

The system offers multiple strategies to decide where to split:

# Percentile approach - split at the top 5% most different transitions
breakpoint_threshold = np.percentile(distances, 95)

# Standard deviation - split when distance is 3 std devs above mean
breakpoint_threshold = np.mean(distances) + 3 * np.std(distances)

# Gradient approach - find where the rate of change spikes
distance_gradient = np.gradient(distances)
breakpoint_threshold = np.percentile(distance_gradient, 95)

I've been experimenting with all three, and honestly? Percentile at 95% has been my sweet spot for technical documentation.

Real-World Impact on My RAG Pipeline

Here's what changed when I swapped out my old chunker:

Before (Traditional Chunking):

# Old approach - dumb but fast
chunks = text_splitter.split_text(
    document, 
    chunk_size=512, 
    chunk_overlap=50
)

After (Semantic Chunking):

# New approach - smart and contextual
semantic_chunker = SemanticChunker(
    embeddings=OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
    buffer_size=1,
    min_chunk_size=100
)

chunks = semantic_chunker.split_text(document)

The Performance Trade-off (Because Nothing's Free)

Let's be real - this isn't free. Semantic chunking requires:

  • Embedding generation for every sentence group

  • Cosine similarity calculations

  • Statistical analysis for breakpoint detection

For a 10k word document:

  • Traditional chunking: ~50ms

  • Semantic chunking: ~2.3s

But here's the thing - those extra 2 seconds at indexing time save me hours of debugging why my RAG system returns garbage.

The Bottom Line

Semantic chunking isn't a silver bullet, but it's damn close to solving one of RAG's biggest problems. If you're building anything where context matters, you need to try this.

The code is all there in LangChain's experimental repo. Greg Kamradt deserves all the credit for the original implementation - I'm just the guy who got obsessed with making it work in production.


Building RAG systems that actually work, one semantic chunk at a time. Follow my journey at @DEVunderdog

Connect With Me: X

0
Subscribe to my newsletter

Read articles from Farhan Khoja directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Farhan Khoja
Farhan Khoja