Building NeuroStash - VI

Raw thoughts from the trenches of building better retrieval systems
The Problem That Kept Me Up at Night
You know that feeling when your RAG system returns technically correct but completely useless chunks? Yeah, I was there. My traditional text splitters were butchering documents like a blunt knife through steak - creating arbitrary 512-token chunks that split sentences mid-thought, separated context from conclusions, and generally made my retrieval system feel like it had amnesia.
I was getting chunks like this:
Chunk 1: "...the economic impact was severe. Inflation rose to"
Chunk 2: "15% by the end of the quarter, causing widespread"
Chunk 3: "concern among investors and policymakers who..."
Useless. Absolutely useless.
Enter Semantic Chunking
While diving deep into LangChain's experimental modules, I stumbled upon something that made me question everything I thought I knew about text splitting. The SemanticChunker - a text splitter that actually understands what it's reading.
Instead of counting tokens like a robot, it uses embeddings to measure semantic similarity between sentences. Mind = blown.
How This Beast Actually Works
Let me break down the magic happening under the hood:
Step 1: Sentence Buffering (The Context Window Trick)
def combine_sentences(sentences: List[dict], buffer_size: int = 1) -> List[dict]:
for i in range(len(sentences)):
combined_sentence = ""
# Add context BEFORE current sentence
for j in range(i - buffer_size, i):
if j >= 0:
combined_sentence += sentences[j]["sentence"] + " "
# Add current sentence
combined_sentence += sentences[i]["sentence"]
# Add context AFTER current sentence
for j in range(i + 1, i + 1 + buffer_size):
if j < len(sentences):
combined_sentence += " " + sentences[j]["sentence"]
sentences[i]["combined_sentence"] = combined_sentence
return sentences
This is brilliant. Instead of treating each sentence in isolation, it creates a sliding window of context. So instead of just embedding "The results were unexpected.", it embeds "The experiment ran for 6 months. The results were unexpected. Further analysis was needed."
Context matters. Always.
Step 2: Semantic Distance Calculation (Where the Magic Happens)
def calculate_cosine_distances(sentences: List[dict]) -> Tuple[List[float], List[dict]]:
distances = []
for i in range(len(sentences) - 1):
embedding_current = sentences[i]["combined_sentence_embedding"]
embedding_next = sentences[i + 1]["combined_sentence_embedding"]
# Calculate semantic similarity
similarity = cosine_similarity([embedding_current], [embedding_next])[0][0]
# Convert to distance (higher = more different)
distance = 1 - similarity
distances.append(distance)
sentences[i]["distance_to_next"] = distance
return distances, sentences
Here's where it gets interesting. The system calculates how semantically different consecutive sentences are. High distance = topic shift. Low distance = same topic.
Step 3: Smart Breakpoint Detection
The system offers multiple strategies to decide where to split:
# Percentile approach - split at the top 5% most different transitions
breakpoint_threshold = np.percentile(distances, 95)
# Standard deviation - split when distance is 3 std devs above mean
breakpoint_threshold = np.mean(distances) + 3 * np.std(distances)
# Gradient approach - find where the rate of change spikes
distance_gradient = np.gradient(distances)
breakpoint_threshold = np.percentile(distance_gradient, 95)
I've been experimenting with all three, and honestly? Percentile at 95% has been my sweet spot for technical documentation.
Real-World Impact on My RAG Pipeline
Here's what changed when I swapped out my old chunker:
Before (Traditional Chunking):
# Old approach - dumb but fast
chunks = text_splitter.split_text(
document,
chunk_size=512,
chunk_overlap=50
)
After (Semantic Chunking):
# New approach - smart and contextual
semantic_chunker = SemanticChunker(
embeddings=OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=95,
buffer_size=1,
min_chunk_size=100
)
chunks = semantic_chunker.split_text(document)
The Performance Trade-off (Because Nothing's Free)
Let's be real - this isn't free. Semantic chunking requires:
Embedding generation for every sentence group
Cosine similarity calculations
Statistical analysis for breakpoint detection
For a 10k word document:
Traditional chunking: ~50ms
Semantic chunking: ~2.3s
But here's the thing - those extra 2 seconds at indexing time save me hours of debugging why my RAG system returns garbage.
The Bottom Line
Semantic chunking isn't a silver bullet, but it's damn close to solving one of RAG's biggest problems. If you're building anything where context matters, you need to try this.
The code is all there in LangChain's experimental repo. Greg Kamradt deserves all the credit for the original implementation - I'm just the guy who got obsessed with making it work in production.
Building RAG systems that actually work, one semantic chunk at a time. Follow my journey at @DEVunderdog
Connect With Me: X
Subscribe to my newsletter
Read articles from Farhan Khoja directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
