Building a Semantic Highlighter: Understanding Search Result Presentation Through Machine Learning - Part 1


Part 1: Why Highlighting Matters and How Machines Learn Relevance
The Problem: Finding Needles in Haystacks
Imagine you're searching for "how to prevent memory leaks in Python" and you get back a 5,000-word article about Python programming. Somewhere in that article are two crucial paragraphs about memory management, but they're buried among sections about syntax, data structures, and web frameworks.
This is the fundamental problem search engines face: retrieval gets you to the right document, but presentation gets you to the right information.
Traditional search highlighting is like using Ctrl+F - it finds exact matches:
Search: "memory leaks Python"
Highlights: "...avoid memory leaks in Python..."
But what if the relevant section says "garbage collection fails to free unused objects"? No keyword matches, yet this is exactly what you need to know.
The Evolution of Search Understanding
Let's trace how search has evolved:
Keyword Matching Era: Find documents containing exact words
Semantic Search Era: Find documents with similar meaning
Semantic Presentation Era: Show the parts that actually answer your question
We've largely solved #1 and #2. This blog is about solving #3.
Why is Semantic Highlighting Hard?
Consider this query and document:
Query: "What causes fatigue?"
Document:
"Regular exercise improves energy levels throughout the day. Poor sleep patterns disrupt your body's natural rhythm. Nutritional deficiencies, particularly iron and B12, can leave you feeling drained. Chronic stress triggers hormonal changes that affect energy metabolism. Dehydration reduces blood volume, making your heart work harder."
A keyword highlighter would find nothing - the word "fatigue" doesn't appear. But a human instantly recognizes that several sentences directly answer the question. How do we teach a machine this kind of understanding?
The Machine Learning Approach: Teaching Computers to Read Like Humans
The key insight is that relevance is not about word matching - it's about meaning alignment. We need a system that understands:
What the query is asking for (intent)
What each sentence is talking about (content)
Whether they align (relevance)
Here's our strategy:
Query + Document → Understanding → Sentence Relevance Scores
But there's a crucial design decision: how do we process the document?
The Architectural Choice: Why Context Matters
We could score each sentence independently:
# Naive approach
for sentence in document:
score = is_relevant(query, sentence)
But this misses crucial context. Consider:
"The process requires three steps. First, you initialize the connection. Second, you send the authentication token. Third, you handle the response."
The bolded sentence alone tells us nothing. Its relevance depends entirely on what "the process" refers to, which comes from earlier sentences.
Our Solution: Hierarchical Understanding
Instead of processing sentences in isolation, we:
Read the entire query+document together (like a human would)
Build understanding of each token in context
Aggregate tokens into sentence representations
Classify each sentence as relevant/not relevant
Apply smart fallbacks to ensure quality results
This mirrors how humans read:
We don't evaluate sentences in isolation
We build understanding progressively
We use context to resolve ambiguity
The BERT Revolution: Context-Aware Reading
BERT (Bidirectional Encoder Representations from Transformers) gave us the ability to understand text in context. Think of it as teaching a computer to read every word while being aware of every other word.
Traditional processing reads left-to-right:
"The bank is steep" → bank (financial?) → is → steep (expensive?)
BERT reads with full context:
"The [bank] is steep" → bank (riverbank, because steep describes physical gradient)
For our highlighter, this means when BERT sees "Third, you handle the response," it knows what process is being discussed because it's simultaneously attending to all previous sentences.
The Training Process: Learning from Examples
How do we teach this system? Through examples:
Training Example 1:
Query: "symptoms of dehydration"
Document: "Common signs include dark urine, dizziness, and dry mouth..."
Labels: [Sentence 1: Relevant]
Training Example 2:
Query: "symptoms of dehydration"
Document: "Water is essential for life. The human body is 60% water..."
Labels: [Sentence 1: Not relevant, Sentence 2: Not relevant]
The model learns patterns:
Medical queries + symptom lists = relevant
General facts without specific symptoms = not relevant
The Intelligence in the Details
Two clever design choices make this system robust:
1. Confidence-Based Backoff
Sometimes no sentence scores high confidence. Rather than showing nothing (bad user experience), we show the best available option if it meets a minimum threshold. It's like saying "I'm not certain, but this seems most relevant."
2. Sliding Window Processing
Documents can be longer than BERT's 512-token limit. We use overlapping windows:
Window 1: [Sentences 1-5]
Window 2: [Sentences 4-8] # Overlap ensures context preservation
Window 3: [Sentences 7-10]
Putting It All Together
Our semantic highlighter represents a fundamental shift in how we think about search results:
Old way: Find documents, highlight keyword matches
New way: Find documents, understand content, show what answers the query
This isn't just about better highlighting - it's about bridging the gap between finding information and understanding information.
We’ll continue our discussion with code examples next in Part -2
Subscribe to my newsletter
Read articles from Mehmet Öner Yalçın directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
