Inverted Indexing: The Core Pattern Behind Fast Search

Atharv SinghAtharv Singh
14 min read
💡
TL;DR: GitHub's old code search was slow and expensive because it used Elasticsearch, a tool built for human language, not code. It mangled syntax and took months to re-index. They replaced it with a custom-built search engine using a code-specific inverted index and tokenizer, enabling faster, more precise search and re-indexing

A Quick Look Back

In our last post, we discussed the seemingly bizarre decision by GitHub to skip debouncing on their new code search input. We ended on a cliffhanger: the only way that choice makes sense is if the backend is so absurdly fast that it can handle the firehose of queries without breaking a sweat.

But that just raises a bigger question. In a world of instant everything, why did searching for code on the world's largest code host feel, for over a decade, felt like?

Searching for a common English word like "turbulence" across the entire internet is trivial. Yet, searching for a piece of syntax like -> or a variable named _ across a codebase a much smaller dataset was an exercise in futility. This wasn't a bug; it was a fundamental mismatch between the tool and the job. To understand how GitHub built a search engine that could finally keep up, we first need to perform an autopsy on the one that couldn't.

The Naïve LIKE '%...%' Dream and Its Collapse

We all are aware of SQL queries , when we have to search some thing what we do…?

We’ve got a table of files, we need to find some text inside them. Easy, right?

SELECT * FROM code_files WHERE content LIKE '%my_function%';
-- This is a full table scan. It reads every single byte of every row.
-- Now imagine this table is 115 Terabytes. Go make some coffee. A lot of coffee.

This is a full table scan. It’s the digital equivalent of reading every book in a library, from cover to cover, every single time someone asks for a word. It works for your first 1,000 blog posts. At GitHub’s scale, with a corpus that grew to 115 TB of code, it’s a declaration of “bankruptcy”.

This brute-force approach is obviously a non-starter. So, the GitHub engineers did what any sane team would do. They reached for a real, industrial-strength search engine. They tried Solr. They tried Elasticsearch. And for years, it was a disaster.

Yeah, about that.

For nearly 14 years, if you used GitHub search, you probably had some complaints. The journey from early solutions like Apache Solr to a massive Elasticsearch cluster was paved with good intentions and crippling technical debt. The core problem wasn't just that it was slow, it was the staggering cost of change.

When the team first deployed Elasticsearch, it took months to index the ~8 million repositories on the platform at the time. This wasn't a one-time setup fee. Any significant schema change, any software upgrade, any tweak to the indexing logic meant paying that price all over again.

This created what the engineers called a "negatively reinforcing feedback cycle". The system was so expensive and slow to change requiring "hundreds of physical servers" and months of compute time for a full re-index that engineers became terrified of touching it. Innovation ground to a halt. Experimentation was a death sentence. In a talk at Strange Loop 2023, GitHub engineer Luke Francl mentioned that the only time they could justify a full re-index was during a catastrophic event, like a planned data center evacuation. Think about that. The cost of improving the system was so high that it was only feasible when they were already moving house due to an impending disaster.

Faced with a system too brittle to improve and too expensive to run, the team made a painful but rational choice: they started indexing less data. They created a waitlist for developers to get their code indexed. The core promise of the product—search all your code—was being deliberately degraded just to keep the lights on. This wasn't a technical optimization; it was a surrender. It was the clearest possible signal that the existing architecture had fundamentally failed.

But it didn't fail because Elasticsearch is a bad product. It's a phenomenal tool for what it was designed for: full-text search of human-readable documents. It failed because code isn't a human-readable document. To understand why, we need to go back to basics and look at how search engines actually think.

Search 101: The Librarian and the Inverted Index

Let's stick with our librarian analogy. The brute-force grep or LIKE query is asking the librarian to read every book in the library for every query. Insane.

In 2025 you will find every single famous search engine using INVERTED INDEXING

To understand this we first need to see how searching works

The very basic issue with Naive Search was it uses Linear Search which obviously is too slow

In its simplest form, it looks like a JSON object. To find documents containing "turbulence," you just grab the list for that key.

JSON

{
  "turbulence": ,
  "engine": ,
  "rust": 
}
// To find "turbulence AND rust", you intersect  and  -> .
// Why it matters: This is set logic, which is computationally cheap, instead of scanning terabytes of text.

The magic here is that we've transformed a text-scanning problem into a set-logic problem. Intersecting two lists of integers is orders of magnitude faster than reading terabytes of source code. This is the core trade-off of all search engines: you do a massive amount of work upfront (at index time) to make every subsequent query (at query time) incredibly cheap. For a platform like GitHub, where code is read far more often than it's written, this is the only bet that makes sense.

💁🏽‍♂️ Ps: Some IRE (Information Retrieval & Extraction) terms

Term → the list of document IDs where that term appears

Corpus → Set of documents to be searched by any engine

Dictionary → list of all unique terms

Posting List → list of document IDs

Stopwords → common words present in every doc (the, is, on)

How to Build these Terms ?🧠

  1. Break the text into words/tokens.

  2. Lowercasing, removing punctuation, stemming / lemmatization.

  3. Remove stopwords ( acts as a noise when matching ).

  4. Keep updating , the posting list.

What’s Inside this Posting List ?

For code search, however, just knowing that a term is in a file isn't enough. We need to know where it is, how many times it appears, and other metadata to rank results and show helpful snippets. This leads to a richer posting list.

📒On the Surface level → just the list of doc Ids

Lil Advanced One will have :→

What inside a Dead Simple Lookup ?

Query - "fish AND wall"

  1. fetch posting lists for "fish" and "wall"

  2. set intersection to get docs that contains both

    Apply Boolean Algebra OR, NOT, etc.

    other techniques with relaxed constraint

Now what Optimization techniques are followed ?

  • Keep posting list sorted - faster merges

    • merge two sorted lists
  • Compression

    • Delta encoding, varints
  • Tiered index

    • keep the top-n in-memory (most important docs first)

    • all on the disk

Many others.

After that, finding a word is a simple, fast lookup.

Here, for the term Turbulence, we see it appears in document 1 once, as the 6th word, at character offset 28. It also appears in document 5 twice. This pre-computed information is what allows a search engine to not just find results, but to find the best results, quickly.

So, we have a map from "terms" to "documents." Simple enough. But... what exactly is a "term" when the document is source code?

Understanding Terms in Search: Tokenization and N-grams

Here’s where things get sticky for general-purpose search engines. The process of breaking down a document into a list of "terms" is called tokenization. A standard tokenizer, like the one in Elasticsearch, is built for prose. It splits text on whitespace and punctuation, lowercases everything, and throws away common "stop words" like "the" or "is".

For an English sentence, this is great. For a line of code, it's a massacre.

const new_user = new User();

A standard tokenizer might see this and produce the tokens: const, new, user. It throws away the = and (). It might even stem user to use. The original syntax, the very structure that gives the code its meaning, is annihilated.

This is the central reason Elasticsearch was the wrong tool. Code search has unique requirements :

  • You want to search for punctuation like . or (.

  • You don't want stemming. running and run are likely two different variables.

  • You don't want stop words removed. is could be part of a function name.

So, how do you index code while preserving its structure for substring searches, like finding lim inside of limits? The answer is n-grams.

An n-gram is simply a sequence of n characters. For code search & normal search ( Google started it!!), the sweet spot is typically trigrams (n=3).

Instead of breaking limits into one token, you break it into a sliding window of all its 3-character substrings.

As the slide shows, the word "limits" is tokenized into four overlapping trigrams: "lim", "imi", "mit", and "its".

To find the exact string "limits", the search engine doesn't look for a single term. It executes a query for documents that contain "lim" AND "imi" AND "mit" AND "its". By intersecting the posting lists for all four trigrams, it gets a list of candidate documents that are highly likely to contain the full string.

This approach has a crucial side effect: it introduces the possibility of false positives. A file could contain the words "limousine," "imitate," "transmit," and "spits." This file would contain all the necessary trigrams and would be returned as a candidate match. The system accepts this trade-off. The trigram index acts as a massive, fast filter to reduce a corpus of billions of documents down to a few thousand candidates. The final, more expensive step of verifying the exact match is then only performed on this much smaller set. It's a classic engineering compromise: sacrifice perfect precision in the cheap filtering stage to make the overall query feasible at scale.

The 64,000x Latency Flip

Now that we have the core concepts—the inverted index and trigram tokenization—we can truly appreciate the night-and-day difference between a brute-force approach and an indexed one. Let's put some hard numbers to it.

Based on napkin math shared by GitHub's engineers, running a grep-like search across their 115 TB corpus on a 32-machine cluster would saturate 2,048 CPU cores for 96 seconds just to serve a single query. This works out to a miserable 0.01 queries per second (QPS).1 Everyone else has to get in line.

Contrast that with an indexed system like their new engine, Blackbird. A typical query ties up just one CPU core for about 100 milliseconds. A single 64-core host can therefore handle an upper bound of around 640 QPS.

Let's put that in a table to let the numbers sink in.

MethodTime to First ResultQueries Per Second (QPS)Resource Cost (per query)
Brute-Force (grep)96 seconds~0.012,048 cores saturated
Indexed Search~100 milliseconds~640 (per host)1 core for 100ms

This isn't an incremental improvement; it's a phase change. The difference in throughput is a factor of over 64,000. One approach is fundamentally, mathematically broken at scale. The other is the only viable path forward. This is why indexing isn't optional; it's the physics of the problem.

The Three Cons of Elasticsearch-for-Code

So if indexing is the answer, why was GitHub's Elasticsearch-based indexed search so problematic? It comes down to three fatal flaws when applying a general-purpose text engine to the specialized domain of source code.

  • 1→ It's a Lexicographer, Not a Compiler

    Elasticsearch is a brilliant lexicographer. It's designed to understand the nuances of human language. But its features are bugs when applied to code. As one engineer put it, when searching code, "you don't want stemming or for stop words to be stripped from your query". Elasticsearch's default behavior does exactly that.

    • You can’t use the following wildcard characters as part of your search query: . , : ; / \ ` ' " = * ! ? # $ & + ^ | ~ < > ( ) { } [ ] @. The search will simply ignore these symbols.

      • It might reduce running to run, destroying the distinction between two valid variable names. It might discard is or a, which could be critical parts of a function name. And most damningly, it strips punctuation like . or (, the very syntactic lifeblood of code. It was fundamentally misinterpreting the documents it was trying to index.
  • 2→ The Re-indexing Treadmill from Hell

    As we covered, the cost of change was astronomical. The "months to re-index" statistic wasn't just an inconvenience; it was an anchor chained to the entire engineering organization. This created the "negatively reinforcing feedback cycle" where the system was too expensive and fragile to improve, so it never did. The team wasn't just fighting a slow system; they were fighting a system that actively resisted improvement.

  • 3→ Resource Wars and Strategic Retreat

    The old system was, in a word, expensive. It was "very very expensive to host" on hundreds of physical servers and was constantly competing for resources with other critical background tasks on GitHub's infrastructure. The cost and complexity became so overwhelming that the team was forced into a strategic retreat. They couldn't scale the system, so they scaled back the

    problem. They began indexing less code and put developers on a waitlist, a painful compromise that directly degraded the core product promise just to stay afloat.

The Core Problem: A Mountain of Redundant Data

First, let's reset the scene. The engineers were facing a monstrous task: indexing 115 terabytes (TB) of source code. But they realized a huge portion of that data was just the same stuff copied over and over again.

Think about it:

  • How many thousands of repositories contain the exact same mit-license.txt file?

  • How many projects have a node_modules folder with identical copies of popular libraries?

  • How many forks of a project are 99.9% the same as the original?

This is the redundancy problem. Indexing the same file thousands of times is a massive waste of time, storage, and money.

The Solution : How Git Thinks

The brilliant insight was that the solution wasn't some new, complex technology they had to invent. It was right there in the fundamental design of Git itself.

The text calls this "content-addressable storage." That sounds complex, but the concept is simple. Git doesn't identify a file by its name (like my_file.js) or its location (like /src/components/). Instead, it calculates a unique fingerprint for the content of the file. This fingerprint is a SHA-1 hash.

Imagine you have two files:

  1. project-A/src/LICENSE

  2. project-B/docs/LICENSE-mit.txt

They have different names and live in different places. But if their content is the exact same text of the MIT license, Git gives them the exact same SHA-1 hash. The hash only cares about the bytes inside the file, not its name or location.

The "Aha!" Moment: Indexing Blobs, Not Files

This leads directly to the breakthrough. The engineers realized: "Why are we trying to index files when we could be indexing unique content (blobs)?"

By shifting their focus from the file path to the content hash, they could attack the redundancy problem head-on.

The impact was staggering, as the text notes:

  • Total code on GitHub: 115 TB

  • Unique code (blobs) after deduplication: 28 TB

That's a 75% reduction in the size of the problem. They made the task four times smaller before writing a single line of the new search engine code. It's a massive win.

A Perfect Way to Split the Work (Sharding)

This insight also solved another huge problem: how to split the work across many servers. This process is called sharding.

  • The Bad Way (sharding by repository): If you split the work by repository, some servers would get almost no work, while the server responsible for a massive, popular repository (like the Linux kernel) would be constantly overloaded. This creates "hot shards" or bottlenecks.

The Elegant Way (sharding by blob SHA): Cryptographic hashes (like SHA-1) are designed to be random and uniformly distributed. If you use the hash to decide which server a piece of content goes on, the work gets spread out perfectly and evenly across the entire cluster. No single server gets overloaded.


Bridge to Blackbird: The Stage is Set

We’ve now seen why the old system was doomed to fail and have laid out the three core primitives of a modern code search engine: the inverted index for speed, n-grams for substring matching, and content-based sharding for deduplication and load balancing.

These are the necessary building blocks, but they aren't sufficient. They solve the problem of searching a massive, static collection of code. But GitHub is anything but static.

The next post will introduce Blackbird, the bespoke engine GitHub built from scratch to solve the even harder dynamic problems. How do you ingest a constant firehose of pushes and update the index in near real-time? How do you guarantee that search results are commit-consistent, ensuring you never see a Frankenstein's monster of old and new files from a single repo? And how did choosing Rust as the implementation language provide the "fearless concurrency" and raw performance needed to build a system this ambitious?

Stay tuned for Post 3, where we finally meet the bird.


👇 This post is part of a short 3-part series:

🧵 Teaser: The Tweet That Triggered Everything

  1. ✅ Post 1: I Didn’t Know What Debouncing Was ~ Until GitHub Made Me Question Reality

  2. Post 2: You’re reading it!

  3. 🔮 Post 3: Inside Blackbird: GitHub’s Rust-Powered Escape from Elasticsearch (Coming soon)

6
Subscribe to my newsletter

Read articles from Atharv Singh directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Atharv Singh
Atharv Singh