Beyond Keyword Search

One of my favorite books is The Heart by Maylis de Kerangal. It's a book about a devastating loss at the same time about hope. What genre does this book belong to? I am not sure. I loved reading this book and I would like to read books like this more. Genres are not important to me but books are.

Recently, I wrote an article exploring the possibility of using Gen AI to produce recommendations based on genres of books. Since then I have kept thinking about the limitation of the recommendation technique used based on classification (like genre) or the approach that looks at what other people like to read. The current approach misses recommendations grossly sometimes because it doesn't consider context. The current search and recommendation techniques are limited by keyword search and predefined classifications without considering the constantly evolving context.

Semantic Search

If I search for books by "loss love hope" what search result do I expect to see in keyword search? Probably not much of a relevant result that I am looking for because the algorithm search engines use for keyword searches doesn't use context or the actual interpretation (sentiments) of words. This is where Semantic Search comes to the rescue. Semantic Search adds additional context to search content. It searches books that are close to the context of the sentiments of "loss love hope".

Vector Search

How do search engines achieve Semantic Search? The answer is, using Vector Search. Vector search uses natural language processing and machine learning techniques to convert documents or queries into vector representations, which capture the semantic meaning and context of the text. This search technique takes the content and converts that to a vector that represents the context of the content. Why vector? We will discuss that next.

Vector and Vector Search

In mathematics and physics, a vector is a value with both magnitude (amount) and direction. It can be visualized as an arrow pointing from one point to another. The values can be in a multi-dimensional space. So, if we can convert content to numbers based on love, sadness, hope, etc., and have a direction (like happiness is opposite to sadness) we get a vector. Once we have vectors based on content, we can measure their distance from each other. In fact, from the search query I mentioned before "loss love hope" we can generate a vector and calculate the distance of that vector from the vectors based on the content and find out the ones that are close to my search query using an algorithm like Cosine Similarity.

Cosine Similarity

Let's consider that we have a vector based on our search query "loss love hope" that is Vector A: [0.5, 0.2, 0.8] and we have a book with a sentiment vector represented by
Vector B: [0.2, 0.5, 0.1]

We calculate the cosine similarity between them:

Calculate the dot product (A ⋅ B): 0.5 × 0.2 + 0.2 × 0.5 + 0.8 × 0.1 = 0.38
Calculate the magnitude (length) of both vectors: |A| = sqrt(0.5^2 + 0.2^2 + 0.8^2) = 0.95 |B| = sqrt(0.2^2 + 0.5^2 + 0.1^2) = 0.62
Calculate the cosine similarity (θ): cos(θ) = dot product / (|A| × |B|) = 0.38 / (0.95 × 0.62) ≈ 0.64

The cosine similarity between Vector A and Vector B is approximately 0.64, indicating a moderate similarity. The closer the cosine similarity is to 1, the more similar the vectors are. A similarity of 0 means they're perpendicular (no similarity), while a similarity of -1 means they're opposite.

In our case, the cosine similarity suggests that Vector B (the book) is somewhat similar to Vector A (search query), but not extremely so.

Embeddings

Embeddings or Word Embeddings is a technique to convert words to vectors. In the context of search, Embedding doesn't have to be only for words, it can be for other types of content like audio or video. Two popular algorithms for Word Embeddings are Word2Vec and GloVe. Search engines use these machine learning algorithms to generate the vectors for content so that semantic search can be done based on vector search.
Similar embedding concepts have been used to generate vectors for other types of content so that semantic search can be done on them. Here is an article on how Elasticsearch generates vectors for audio and achieves semantic search for audio.

k-NN

k-NN or k-Nearest Neighbors is a supervised machine learning algorithm used for classification and regression tasks. We discussed how Cosine Similarity helps calculate the similarity of vectors. k-NN is another way to find out the similarity by placing the query vector in the neighborhood of vectors of similar sentiments.
Some search engines use k-NN for its proximity search feature, not cosine similarity, due to efficiency and scalability reasons. k-NN is computationally faster and more scalable for large datasets, as it only requires storing and querying discrete vector coordinates. Elasticsearch uses k-NN search whereas Google Pinecone uses Cosine Similarity.

Hybrid Search

Semantic search is useful when the query is ambiguous or when context matters, but sometimes we know what we are looking for. I mean we have a pretty good idea about the keywords. In that case, a keyword search will return more useful results than a semantic search. Search engines that support hybrid search combine keyword search with semantic search to produce meaningful results. Both Elasticsearch and Google Pinecone support hybrid search.

Content Recommendation

The recommendation engine (content or product) comes as a separate application because it takes content and uses Collaborative Filtering to recommend content. With semantic search, there is no need to have a separate recommendation engine because semantic search is based on context. Even after that Collaborative Filtering can be useful. Especially, for product recommendations where we can look at a user's purchase history and compare that with other users' liking and recommend products.

Role of Gen AI

We can't finish our discussion without considering the role of Gen AI in Semantic Search. Pinecone and Elasticsearch offer integrations with generative AI models through their APIs.

Pinecone has built-in integration with the Hugging Face Transformers library and Langchain, which allows users to easily integrate generative AI models into their search workflows.

Elasticsearch also provides an integration with the Hugging Face Transformers and Langchain library through its Elastic Machine Learning (EML) plugin, which enables users to leverage generative AI models for tasks like text classification, sentiment analysis, and more.

In addition to refining queries for vector search for accuracies, Gen AI can enrich documents with additional context, entities, or summaries, making them more discoverable and relevant in search results. Generative AI can also create personalized content based on search queries, such as product descriptions, articles, or social media posts, which can be indexed and searched.

By combining generative AI with vector search, we can create more intelligent and personalized search experiences, improve content discovery, and automate content creation tasks.

Architect's Perspective

As an architect of Digital Experience Platform (DXP) and e-commerce implementation, choosing an appropriate search engine is extremely important. Based on the customer's requirements we need to identify if Semantic Search will play an important role in conversion. When choosing a platform for the implementation we should consider the type of search engine carefully.