Approximate Nearest Neighbours: A Technical Deep Dive

Finding Your Best Friends: How Computers Search for Similar Things Really Fast

What's This All About?

Imagine you're at a huge party with millions of people, and you want to find the people most similar to you. Maybe you want to find people who like the same music, have similar hobbies, or even look similar to you. Checking every single person would take forever!

This is exactly the problem computers face when they need to find similar things in huge databases. Whether it's Netflix recommending movies, Spotify finding similar songs, or Google showing you similar images, computers use something called Approximate Nearest Neighbors (ANN) to solve this problem super quickly.

The Basic Problem (In Simple Terms)

Let's say you have a playlist of your favorite songs, and you want Spotify to find similar songs from their library of 50 million tracks. Each song can be described by many features - tempo, genre, energy level, vocals, instruments used, etc. Think of each song as having a "fingerprint" made up of these features.

The computer needs to compare your song's fingerprint with all 50 million songs to find the most similar ones. But here's the catch - checking all 50 million songs one by one would take way too long!

The trade-off: Instead of finding the perfect match (which takes forever), ANN algorithms find really good matches super quickly. It's like saying "I'll find you someone 95% as perfect as your soulmate, but I can do it in 1 second instead of 1 hour."Problem Definition

In Technical Terms (Let’s wear our intelligence hat)

Given a dataset D = {x₁, x₂, ..., xₙ} of n points in d-dimensional space and a query point q, the goal is to find point x* ∈ D that minimizes the distance d(q, x*). In approximate nearest neighbor search, we relax this requirement to find a point x' such that:

d(q, x') ≤ (1 + ε) · d(q, x*)

where ε ≥ 0 is the approximation factor.

Mathematical Foundations

Distance Metrics

The choice of distance metric significantly impacts ANN performance:

Euclidean Distance (L₂):

d(x, y) = √(Σᵢ(xᵢ - yᵢ)²)

Manhattan Distance (L₁):

d(x, y) = Σᵢ|xᵢ - yᵢ|

Cosine Distance:

d(x, y) = 1 - (x·y)/(||x|| ||y||)

Why This Gets Tricky

The "Curse of Dimensionality" (Sounds Scary, But It's Not!)

Imagine you're describing people using just 2 things: height and weight. You could draw this on a piece of paper with height on one axis and weight on another. Easy to visualize, right?

Now imagine describing people using 1000 different features - height, weight, eye color, favorite foods, music taste, hobbies, etc. This becomes a 1000-dimensional space that's impossible to visualize, and weird things start happening:

  • Everyone starts to look equally different from everyone else

  • The concept of "close" and "far" becomes meaningless

  • Traditional search methods break down

This is why we need special algorithms designed for high-dimensional data.

The Smart Solutions

1. Locality Sensitive Hashing (LSH) - The "Fingerprint" Method

What it does: Creates smart "fingerprints" for your data so similar things get similar fingerprints.

Real-world analogy: Imagine you're organizing a massive library. Instead of checking every book one by one, you create a smart filing system. Books about cooking go in section A, books about sports go in section B, etc. When someone asks for a cookbook, you only search section A.

How it works:

  1. Take your data (songs, images, etc.) and run them through special "hash functions"

  2. Similar items get grouped into the same "buckets"

  3. When searching, only look in the relevant buckets

Algorithm Overview:

def lsh_search(query, hash_tables, k):
    candidates = set()
    for table in hash_tables:
        bucket = table.get_bucket(hash_function(query))
        candidates.update(bucket)

    # Return k nearest from candidates
    return sorted(candidates, key=lambda x: distance(query, x))[:k]

Time Complexity: O(ρ · n^ρ) where ρ = log(1/p₁)/log(1/p₂) Space Complexity: O(n¹⁺ᵖ)

The good: Very fast searches, works well with lots of data

The not-so-good: Takes up a lot of memory (like having many filing cabinets)

2. HNSW (Hierarchical Navigable Small World) - The "Friend Network" Method

What it does: Builds a network where similar items are connected, like a social network for your data.

Real-world analogy: Think of Facebook's "People You May Know" feature. It doesn't check all 2 billion users. Instead, it looks at your friends, then your friends' friends, creating a smart path to find relevant people.

How it works:

  1. Create multiple layers of connections, like a pyramid

  2. Top layers have fewer items but longer connections (highways)

  3. Bottom layers have more items but shorter connections (local roads)

  4. Search by starting at the top and working your way down

Search Complexity: O(log n) average case Construction Complexity: O(n log n)

The good: Super fast searches, great accuracy

The not-so-good: Takes some time to build the network initially

3. Random Forests - The "20 Questions" Method

What it does: Creates multiple decision trees that ask smart questions to narrow down the search.

Real-world analogy: Like playing 20 questions. "Is it bigger than a breadbox?" "Does it make sound?" Each question eliminates half the possibilities.

How it works:

  1. Build multiple decision trees with random questions

  2. Each tree splits the data differently

  3. Combine results from all trees for the final answer

Splitting Criterion:

split_hyperplane = random_vector · (x - mean) > 0

Query Time: O(log n + k) Space: O(n)

The good: Simple to understand and implement

The not-so-good: Performance drops in very high dimensions

4. Product Quantization (PQ) - The "Compression" Method

What it does: Compresses your data to save massive amounts of memory while keeping it searchable.

Real-world analogy: Like creating a super-efficient summary. Instead of storing every word of a book, you store key points that still let you understand what the book is about.

How it works:

  1. Split each item's features into smaller groups

  2. Replace each group with the closest "representative" from a small dictionary

  3. Store only these representatives (massive space savings!)

Encoding:

x = [x₁, x₂, ..., xₘ] → [c₁, c₂, ..., cₘ]

where each cᵢ is a centroid index in subspace i.

Memory Reduction: From 4d bytes to m·log₂(k) bits per vector

The good: Uses very little memory, still pretty accurate

The not-so-good: Some accuracy is lost in the compression

Performance Showdown: Which Method Wins?

Speed Test: Searches Per Second

  • Linear Search (checking everything): 1,000 searches/second

  • LSH: 10,000 searches/second (10x faster!)

  • Random Trees: 25,000 searches/second (25x faster!)

  • HNSW: 50,000 searches/second (50x faster!)

  • Product Quantization: 100,000 searches/second (100x faster!)

Winner: Product Quantization is the speed demon, but HNSW offers the best balance of speed and accuracy.

Memory Usage Contest

For storing 1 million songs with detailed features:

  • Full Precision: 0.5 GB (original size)

  • Product Quantization: 0.125 GB (super compressed!)

  • HNSW: 2 GB (reasonable)

  • LSH: 10 GB (memory hungry!)

Winner: Product Quantization uses 80 times less space than LSH!

Accuracy vs Speed Trade-off

Time needed to find the right answers:

90% Accuracy:

  • HNSW: 0.1 milliseconds

  • IVF-PQ: 0.2 milliseconds

  • NSW: 0.3 milliseconds

  • LSH: 0.5 milliseconds

95% Accuracy:

  • HNSW: 0.3 milliseconds

  • IVF-PQ: 0.8 milliseconds

  • NSW: 1.1 milliseconds

  • LSH: 2.1 milliseconds

99% Accuracy:

  • HNSW: 1.2 milliseconds

  • IVF-PQ: 3.5 milliseconds

  • NSW: 4.8 milliseconds

  • LSH: 8.3 milliseconds

Key Insight: HNSW is incredibly fast AND accurate - it's like finding a sports car that's also fuel-efficient!


Real-World Applications (Where You See This Every Day)

Music Streaming (Spotify, Apple Music)

When you like a song, the app finds similar songs from millions of tracks. It looks at features like:

  • Tempo and rhythm

  • Musical instruments used

  • Genre and mood

  • Vocal style

Performance needed: Find similar songs in under 100ms for instant recommendations

Photo Apps (Google Photos, iPhone Photos)

When you search "beach" in your photos, the app scans thousands of images to find matches:

  • Visual features (colors, shapes, objects)

  • Location data

  • Time stamps

  • Face recognition data

Challenge: Processing images with thousands of features each

E-commerce (Amazon, eBay)

"Customers who bought this also bought..." recommendations:

  • Purchase history patterns

  • Product categories and features

  • Price ranges

  • Customer demographics

Scale: Millions of products, billions of user interactions

Gaming (matchmaking systems)

Finding players of similar skill level:

  • Win/loss ratios

  • Game statistics

  • Play style metrics

  • Connection quality

Requirement: Real-time matching in seconds


The Dimensionality Challenge

As we add more features to describe our data, all algorithms get less accurate, but some handle it better than others:

Performance with increasing dimensions:

10 dimensions (simple data):

  • HNSW: 98% accuracy

  • LSH: 95% accuracy

  • PQ: 92% accuracy

100 dimensions (moderate complexity):

  • HNSW: 94% accuracy

  • LSH: 90% accuracy

  • PQ: 85% accuracy

1000 dimensions (very complex data):

  • HNSW: 82% accuracy

  • LSH: 78% accuracy

  • PQ: 68% accuracy

Takeaway: HNSW stays strong even with super complex data!


Choosing the Right Algorithm

When to use HNSW:

  • You need both speed and accuracy

  • You have enough memory (a few GB is okay)

  • Your data doesn't change too frequently

  • Best for: Most general-purpose applications

When to use LSH:

  • You have theoretical accuracy guarantees requirements

  • Your data is extremely high-dimensional (1000+ features)

  • You can afford lots of memory

  • Best for: Academic research, specialized applications

When to use Product Quantization:

  • Memory is extremely limited (mobile apps, embedded systems)

  • You can accept some accuracy loss for massive space savings

  • You have millions/billions of items to store

  • Best for: Mobile apps, large-scale web services

When to use Random Trees:

  • You want something simple to implement and understand

  • Your data is moderately dimensional (under 100 features)

  • You need reasonable performance without too much tuning

  • Best for: Prototypes, simple applications


Key Takeaways

  1. There's no one-size-fits-all solution - different applications need different approaches

  2. HNSW is usually your best bet for most applications - it's the "Swiss Army knife" of ANN

  3. Memory vs. accuracy vs. speed - you can optimize for 2 out of 3, but rarely all 3

  4. Real systems combine multiple techniques - use simple filters first, then precise methods on smaller sets

  5. The curse of dimensionality is real - but modern algorithms handle it much better than old methods

  6. Performance tuning matters - the same algorithm can be 10x faster with proper parameter tuning

The Bottom Line

Approximate Nearest Neighbors might sound like complex computer science, but it's actually solving a very human problem: finding similar things quickly in a world full of choices. Every time you get a great song recommendation, find a photo in your gallery, or see a relevant product suggestion, ANN algorithms are working behind the scenes to make your digital life better.

The field keeps evolving, with new algorithms and optimizations appearing regularly. But the core idea remains the same: trade a tiny bit of perfection for massive gains in speed, and suddenly the impossible becomes everyday magic.

Whether you're building the next great app or just curious about how your favorite services work, understanding ANN gives you insight into one of the most important problems in modern computing. It's not just about finding needles in haystacks - it's about finding the right needle, really fast, in a haystack the size of the internet!


Sources and References

The performance figures, accuracy measurements, and benchmarking data presented in this article are compiled from several authoritative sources in the approximate nearest neighbor research community:

  • ANN-Benchmarks Project (https://ann-benchmarks.com/): A comprehensive benchmarking suite that provides standardized performance comparisons across different ANN algorithms using datasets like SIFT-1M, GIST-1M, and Deep-1M

  • Facebook AI Research (FAISS) (https://github.com/facebookresearch/faiss): Performance benchmarks and optimization guides for large-scale similarity search, including detailed documentation and white papers

  • Microsoft Research HNSW Implementation (https://github.com/microsoft/SPTAG): Studies and implementations of HNSW and other graph-based algorithms with performance analysis

  • Google Research Publications (https://research.google/pubs/): Academic papers on scalability and real-world deployment of ANN algorithms in production systems, particularly focusing on large-scale applications

  • NIPS/NeurIPS Conference Proceedings (https://nips.cc/): Peer-reviewed research papers presenting algorithm improvements and comparative studies in approximate nearest neighbor search

  • Arxiv.org (https://arxiv.org/): Pre-print research papers covering the latest developments in ANN algorithms and optimizations

These sources provide the foundation for understanding real-world performance characteristics, though actual results may vary depending on your specific data, hardware, and implementation details. For the most current benchmarks and detailed experimental setups, we recommend consulting the ANN-Benchmarks project repository and recent academic publications in the field.

0
Subscribe to my newsletter

Read articles from Alankar Srivastava directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Alankar Srivastava
Alankar Srivastava