Crawling/Ingestion

Indexing

The fundamental data structure involved in search is the inverted index.

Retrieval

Keyword Search

Vector Search

There are 2 common approaches for performing vector search in a vector store. Hierarchical Navigable Small World (HNSW) and Inverted File Index (IVF). These approaches allow for quick nearest neighbour search of vectors to identify those that are similar based on cosine similarity or

Re-ranking and Rank Fusion

Personalisation

Collaborative Filtering

Collaborative filtering works on the principle of finding what users like the target user prefer, and recommending those content

Content-based Filtering

Content-based filtering

Evaluation

Search is generally ranked along recall, which is whether the right content is returned in the top K results, and relevance, which has to do with the position of the right content in the top K results.

Recall can be measured via a simple ratio:

$$\frac{\sum \text{Results returned}}{\sum \text{Queries made}} \times 100\%$$

Relevance can be measured with Mean Average Rank (MAR), which looks at the position of relevant results in the returned top K results and calculates a weighted average based on position. This penalises irrelevant results

A better metric is Normalized Discounted Cumulative Gain (NDCG), which accounts for both recall and relevance, and penalises relevant results that are not ranked the top more heavily.

References

How search works by Google

How Does Search Work?

Table of contents