Inside Blackbird: GitHub’s Rust-Powered Escape from Elasticsearch

💡

TL;DR: GitHub replaced a sprawling Elasticsearch cluster with a custom Rust-powered search engine. To index 100M+ repositories, they battled scaling challenges with novel data structures (Geometric XOR Filter), dynamic sharding, and a hyper-efficient cache, ultimately running on 60% fewer servers while shipping features faster than ever.

The Wrong Tool for the Job

Imagine trying to find a single, specific grain of sand on every beach on Earth, simultaneously. That’s code search at GitHub scale, and our old approach was basically to hire a million people with buckets and wish them luck.

We all use off-the-shelf tools like Elasticsearch. They’re great, until they’re not. At GitHub's scale, treating code like web text created a cascade of failures: spiraling costs, degraded performance, and an architectural bottleneck that blocked innovation. The pain forced them to go back to first principles and build a custom engine in Rust, a move that not only solved the problem but gave us a strategic advantage in engineering velocity.

So they started developing a very promising research prototype search engine, internally code-named Blackbird, which was kicked off in early 2020

Rube Goldberg Machine » Old Architecture

This is an example of Rube Goldberg machine → a deliberately over-complicated contraption designed to perform a simple task in a very indirect way.

ghui

GitHub’s old system was a beautifully complex, yet fundamentally wrong, Rube Goldberg machine. When a developer pushed code, the change would trigger an event that landed in a Redis queue. From there, a massive fleet of workers would pick up the job, pull the data, process it, and shove it into an Elasticsearch cluster.

This is a classic "brute force" scaling pattern. It's simple to conceptualize but operationally nightmarish. The architecture's fatal flaw was treating code as flat text and sharding by repository. This created massive hot shards for popular projects like the Linux kernel or Rails, leading to severe load imbalances and contention. Elasticsearch, a phenomenal general-purpose tool, was being asked to perform a job it was never designed for, leading to performance challenges and operational strain.

The core failure wasn't just technical debt; it was a philosophical mismatch. Code isn't just text. It’s a deeply structured, highly duplicated medium with its own rules.

Punctuation is paramount. Searching for . or ( isn't noise; it's critical syntax.
Stemming is destructive. We need to find the word walking, not its root walk.
Stop words are significant. Words like for or if can't be stripped from queries.

Trying to configure a generic engine to respect these domain-specific rules is like trying to teach a fish to climb a tree. You can build a very elaborate ladder, but you're fighting gravity the whole way. The architectural problems, the hot shards, the spiraling costs, were not independent failures. They were symptoms of this deeper, underlying domain mismatch. The decision to shard by repository, for example, makes perfect sense for generic documents but is catastrophic for code, where repository popularity follows a power-law distribution. This causal link explains why the old system was doomed to fail at scale

The Strategic Pivot: Why Rust?

The decision to write Blackbird in Rust was not taken lightly. It was a deliberate choice rooted in the harsh lessons learned from operating systems at scale. When you are building a foundational piece of infrastructure that will serve millions of developers, you need a language that offers a trifecta of guarantees:

Performance: Rust is blazingly fast and memory-efficient. With no runtime or garbage collector, it can power performance-critical services, giving us the low-level control needed to squeeze every ounce of performance from our hardware.
Reliability: This was the killer feature. Rust's rich type system and ownership model guarantee memory-safety and thread-safety at compile-time. This eliminates entire categories of bugsnull pointer dereferences, data races, buffer overflows→ that are a constant source of instability and security vulnerabilities in systems-level code.
Productivity: Despite its power, Rust provides top-notch tooling, a friendly compiler with useful error messages, and an integrated package manager (Cargo) that makes development a joy, not a chore.

This was the bedrock upon which Blackbird was built.

The Blackbird Architecture

Blackbird's design can be understood as two distinct, highly-optimized systems: for a ) ingesting and indexing the world's code, and another for b) executing queries against that index with millisecond latency

Ingesting & Indexing World’s Code

The ingest side of the system is an event-driven pipeline designed for massive throughput, efficiency, and consistency. It's a multi-stage, decoupled architecture that turns a chaotic stream of code changes into a perfectly ordered, searchable index.

Event Sourcing with Kafka: The journey begins when an event, such as a git push, occurs on GitHub.com. This event is published to a Kafka topic. Using Kafka as the entry point provides a durable, scalable, and ordered log of all changes, acting as a crucial buffer between the main GitHub platform and the search infrastructure.
The Crawlers: A fleet of Blackbird Ingest Crawlers consumes these events. Their job is to interact with the underlying Git storage and a dedicated Analysis service. This service is responsible for the heavy lifting of language detection, parsing code to extract symbols, and other semantic analyses.
Document Creation and Buffering: The crawlers take the raw code and the extracted metadata and produce structured documents. These documents are then published to a second Kafka topic. This is a critical architectural decision. It decouples the crawling stage (which is often I/O bound) from the indexing stage (which is CPU bound), allowing each to scale independently and proceed at its own pace.
The Indexer Shards: The final Blackbird Search Engine is composed of many individual shards. Each shard is a self-contained Rust process that consumes a dedicated partition from the second Kafka topic.

Note: The use of two separate Kafka topics is a key design choice. It separates the I/O-bound work of crawling (interacting with Git and other services) from the CPU-bound work of indexing. This allows each stage to scale independently and proceed at its own pace. Furthermore, the strict ordering guarantee of Kafka partitions is what allows Blackbird to provide commit-level consistency for queries A search will not see partial results from a new commit until it has been fully processed

The query path is a mirror image of the ingest path's focus on throughput, but optimized for a different goal: minimal latency. It's designed to be screaming fast while rigorously enforcing permissions.

Let's trace a regular expression query like /arguments?/ org:rails lang:Ruby through the system:

The Query Service: The query first hits the stateless Blackbird Query Service. This service acts as the central coordinator for the entire search operation.
Parsing and Rewriting: The service parses the query string into an Abstract Syntax Tree (AST). It then rewrites this tree, enriching it with crucial information. lang:Ruby is resolved to its canonical Linguist language ID. More importantly, it injects clauses to enforce user permissions, ensuring a user only sees results from public repositories or private ones they have access to.
Caching and Rate Limiting: On its way, the service consults a Redis cluster to cache repository permissions and manage rate-limiting quotas, preventing abuse and reducing load on the primary databases.
Fan-Out: The rewritten query is then broadcast to every single shard in the search cluster. This fan-out is a direct consequence of the sharding strategy. Since data is *sharded by content, not repository*, any shard could potentially hold a file that matches the query.
On-Shard Execution: Inside each shard, the real work begins. The AST is converted into a plan of execution against the local index files. The regex /arguments?/ is translated into a series of substring queries against the n-gram indices{e.g., content_grams_iter("arg"), content_grams_iter("rgu")}. These queries produce lazy iterators over sorted lists of document IDs. The engine can then perform intersections (AND) and unions (OR) on these iterators, only reading as far as necessary to satisfy the request.
Aggregation & Finalization: The Query Service gathers the top results from all shards, aggregates them, re-sorts them by a global relevance score, and performs a final permission check. Only then are the top 100 results sent to the GitHub front end for syntax highlighting and rendering.

This entire, complex dance happens in the blink of an eye. The p99 response time from an individual shard is on the order of 100 milliseconds A single 64-core host can sustain around 640 queries per second—a mind-boggling improvement over brute-force methods.

How Ranking and Relevance is Calculated?

A crucial realization was that for a global index, result scoring and ranking are absolutely critical. Blackbird implements a number of code-specific heuristics to ensure you find useful documents first:

Definitions are ranked up while test code is penalized.
Complete matches are ranked higher than partial matches (a search for thread will rank the identifier thread above thread_id).
The popularity of the repository also influences ranking, showing results from popular open-source projects before a random match in a long-forgotten test repository.

Blueprints for Core Innovations

Shard by Content, Not Container ( De-Duplication ):

The foundational architectural decision in Blackbird was to stop thinking about repositories as the unit of work and instead focus on the code itself. We shard the entire index by the Git blob_oid, which is a SHA-1 hash of a file's content. This is the cornerstone of the entire system, and it immediately solves two critical problems.

First, it provides automatic deduplication at a global scale. Any file, whether it's a README.md or a core Linux kernel header, is indexed exactly once, no matter if it appears in one repository or one million. This single decision shrank the 115 TB corpus of raw code down to a much more manageable 28 TB of unique content to be indexed.

Second, sharding by a cryptographic hash of the content guarantees a near-perfectly uniform distribution of data and load across the cluster. This completely eliminates the "hot shard" problem that plagued the old system.

This is like a global librarian organizing a library not by book title or author, but by the unique content of each page. If the same page appears in a million different books, the librarian stores and indexes it only once, with a million pointers back to its various locations.

Concrete Code Snippet

While the actual cluster management is complex, the core logic for assigning content to a shard is beautifully simple: a *hash determines the destination*. This Rust snippet illustrates the principle.

// A simplified example of determining a shard index from a blob_oid.
// In reality, this would use a more robust hashing algorithm and
// handle cluster topology changes.

const NUM_SHARDS: u32 = 32;

/// Represents a unique blob of content in Git.
struct BlobOid([u8; 20]); // SHA-1 hash is 20 bytes.

impl BlobOid {
    /// Determines which shard this blob belongs to.
    fn get_shard_index(&self) -> u32 {
        // Use the first 4 bytes of the SHA-1 hash as a u32 integer.
        // A cryptographic hash provides excellent distribution properties.
        let hash_prefix = u32::from_be_bytes(self.0[0..4].try_into().unwrap());

        // The modulo operator maps the hash to a specific shard index.
        hash_prefix % NUM_SHARDS
    }
}

The Metadata Menace: Taming a Trillion Edges

Solving content duplication, however, created a new, more insidious problem: metadata bloat. A single indexed blob for a popular file like LICENSE could be associated with millions of (repository, path) locations. When a query hit this blob, the engine would have to scan this enormous list of locations, making the search painfully slow

The solution was Delta Compression, a brilliant application of graph theory to infrastructure. Instead of indexing every repository from scratch, we model the relationships between them as a massive graph.

We then find the Minimum Spanning Tree (MST) of this graph, which represents the most "efficient" path to index everything. Each repository only needs to store its difference from a parent in the tree, massively reducing the amount of metadata that needs to be crawled and stored. "Since the new version is 95% same as the old one, just store the changes.”

To build the MST, we needed to calculate the symmetric difference—the set of files unique to each repository—between every pair of repositories

A naive approach using hash sets would require terabytes of memory. They first reached for a well-known, powerful probabilistic tool: HyperLogLog. It's great for estimating the cardinality of large sets with very little memory. The size of the symmetric difference,

However, HyperLogLog failed . Its error is relative to the total size of the sets, making it useless for finding a tiny difference between two huge sets. For example, the standard error for HLL in Redis is 0.81%. For a set of 1 million items, the error margin could be over 8,000 items. If the true difference between two nearly identical repositories (the most common case for forks) is only 100 files, HLL is mathematically incapable of detecting it accurately.

This failure precisely defined the requirements for a new data structure: one whose accuracy is relative to the size of the difference. This domain constraint drove the invention of the Geometric XOR Filter by our teammate Alexander Neubeck. It's a novel probabilistic data structure that solves this exact problem, making our MST-based delta compression feasible.

Concrete Code Snippet

// This is the actual code shown in the presentation.
// It demonstrates how the complex problem of estimating the symmetric
// difference is reduced to a simple, expressive API call.

let mut repoA = GeometricFilter::new();
let mut repoB = GeometricFilter::new();

// Populate the filters with (content_hash, path) tuples for each repo.
repoA.toggle(Location::new(blob_oid_1, "readme.txt"));
repoB.toggle(Location::new(blob_oid_2, "license.txt"));

// The magic happens here: XOR the two filters to find the items
// that exist in one but not the other, then estimate the size.
let diffs = repoA.xor(repoB).estimate_size();

Why "for" is a Four-Letter Word in Search : Sparse Grams

With content and metadata under control, the final battle was with the index itself. For fast substring and regex search, Blackbird uses an n-gram index(as discussed previously in 2nd blog).

The problem is that some n-grams are ridiculously common in code. Trigrams like for, est, or ion appear in millions of documents. Their posting lists—the lists of documents containing them—are enormous. Queries containing them generate a storm of false positives, grinding the system to a halt. This is a classic information retrieval problem, amplified by the unique vocabulary of source code.

GitHub’s solution is a custom innovation they call Sparse Grams. It's a clever way to prune the index of these low-signal, high-noise tokens.

For searching a note containing the phrase "for the win". A naive approach would pull every single book containing the word "for" and every book containing "the" an impossible task. Using sparse grams recognizes that "for the" is common noise and instead looks for the more unique anchor "the win", drastically reducing the search space.

Okay, we solved the theory and scale ! but how do you keep evolving a system like this without shooting yourself in the foot?

Every time we wanted to improve the index format or add a new feature like symbol extraction, we had to re-crawl and re-process all 100 million repositories from scratch. This was a weeks-long, high-risk, and morale-sapping process.

The solution was to build a highly specialized, Rust-based Cache Server. This service's job is simple: it stores the final, processed Document objects from previous indexing runs. The new workflow became vastly more efficient. The crawler, instead of re-fetching and re-processing content, simply asks the cache server if it already has a document for a given blob ID.

The results were transformative. The cache achieved a 99.9% hit rate for re-indexing and over 50% for incremental indexing. This turned a multi-week ordeal into a routine, 18-hour process. This investment in infrastructure wasn't just a performance optimization; it was an investment in our ability to innovate.

The Payoff: Victory by the Numbers

The multi-year odyssey to build Blackbird from the ground up paid off spectacularly. The new system is not just more capable; it is vastly more efficient, resilient, and agile than the system it replaced.

Metric	Legacy Code Search (Elasticsearch)	Blackbird (Rust-Powered)	Improvement
Server Count	312 Servers	130 Servers (across 3 clusters)	~58% Reduction
Total CPU Cores	~24,000 Cores	~8,000 Cores	~67% Reduction
Redundancy	Single Fragile Cluster	3 Redundant Clusters + Caches	High Availability
Indexing Throughput	N/A	~120,000 documents/sec 5	Orders of Magnitude
Query Throughput	N/A	~640 queries/sec (per host) 5	Orders of Magnitude
Index Size vs. Content	~2-3x Corpus Size	~25 TB index for 115 TB content (~0.22x) 2	~90% Reduction
Agility (Full Rebuilds)	Infrequent / High-Risk (Weeks)	50 times in 3 years (18 hours) 2	Radical Malleability

The most important metric is the last one » Agility.
The ability to fearlessly and frequently rebuild the entire index from scratch unlocked a pace of innovation that was previously unimaginable.

The Bird Has Flown

From “pray Elasticsearch doesn’t melt” to “rebuild 100M repos in 18 hours.”

All so that when you type usestate results pop up before you finish the word.

Blackbird isn’t just fast — it’s GitHub’s way of flexing that they solved a problem most of us will never even have.

Inside Blackbird: GitHub’s Rust-Powered Escape from Elasticsearch

Table of contents

The Wrong Tool for the Job

Rube Goldberg Machine » Old Architecture

The Strategic Pivot: Why Rust?

The Blackbird Architecture

Ingesting & Indexing World’s Code

How Ranking and Relevance is Calculated?

Blueprints for Core Innovations

Shard by Content, Not Container ( De-Duplication ):

The Metadata Menace: Taming a Trillion Edges

Why "for" is a Four-Letter Word in Search : Sparse Grams

Okay, we solved the theory and scale ! but how do you keep evolving a system like this without shooting yourself in the foot?

The Payoff: Victory by the Numbers

The Bird Has Flown

Subscribe to my newsletter

Atharv Singh

Atharv Singh

Inside Blackbird: GitHub’s Rust-Powered Escape from Elasticsearch

Table of contents

The Wrong Tool for the Job

Rube Goldberg Machine » Old Architecture

The Strategic Pivot: Why Rust?

The Blackbird Architecture

Ingesting & Indexing World’s Code

The Life of a Query: From Search Box to Results Page

How Ranking and Relevance is Calculated?

Blueprints for Core Innovations

Shard by Content, Not Container ( De-Duplication ):

The Metadata Menace: Taming a Trillion Edges

Why "for" is a Four-Letter Word in Search : Sparse Grams

Okay, we solved the theory and scale ! but how do you keep evolving a system like this without shooting yourself in the foot?

The Payoff: Victory by the Numbers

The Bird Has Flown

Subscribe to my newsletter

Atharv Singh

Atharv Singh