Vector Databases

Imagine you are at a library with hundreds of thousands of books. Instead of searching by exact title or keyword, you want to find all the books that are about the same idea as your question.
Traditional databases can answer: “Find me books where the title contains ‘blockchain’.”
But what if you ask: “Show me books that explain how money moves without banks”? A keyword match might miss it.

This is where vectors come in.

AI models can convert text, images, or audio into vectors which is basically long lists of numbers that capture meaning. Two pieces of content with similar meaning will have vectors that are close together in this high-dimensional number space.

A vector database is simply a database built to store these vectors and quickly find the ones most similar to your query.

Best way to think about it is like it is Google Maps for ideas → instead of distance between cafés, it’s distance between meanings.

Embedding Models

At the heart of every vector database is an embedding model. An embedding model takes an input like text, image, audio etc. and converts it into a vector which is a long list of numbers that captures its meaning.

Think of it as translation, just like Google Translate converts English into German, an embedding model converts human language into the “language of vectors.”

There are different kinds of embedding model based on the use case of the particular app. To name few

Text Embeddings —> semantic search, chatbots, classification, RAG.
Image Embeddings —> “find similar images,” cross-modal search
Multimodal Embeddings —> “find images that match a text description,” “align video with captions.”
Domain-Specific Embeddings —> Specialized for legal, medical, financial, or code data

There are many open source model and choosing the model is one of the most important decision that needs to be made earlier in the product development lifecycle.

Embedding Dimensions

The length of the vector is different based on the model you will choose. For example some models produces a 1536-dimensional vector, it means:

Every “piece of text” you pass in is turned into an array of 1536 numbers.
Each number captures a tiny piece of information about the meaning of that text (kind of like “semantic ingredients”).
Together, these 1536 numbers form a position in a 1536-dimensional space.

Imagine describing a fruit:

In real life, you might use 3 dimensions (color, size, sweetness).
So “banana” might be (yellow=0.9, size=0.6, sweetness=0.8).

Now, instead of 3 traits, an embedding model uses 1536 traits. They are not as intuitive as “color” or “size” as they are abstract semantic features learned from data. But the idea is the same: each text becomes a long numeric fingerprint.

Isn’t 1536 a very long array for a piece of text! Yes it is but also more dimensions means richer representation. With 1536 numbers, the model can capture subtle differences in meaning (e.g., “bank” as a financial institution vs. “bank” of a river).

Distance Metrics

Okay, so we have our embeddings but how does the database actually know which ones are closer to your question? Think of a vector database like Google Maps for ideas. To find the ‘nearest’ concepts, it needs a way to measure distance which is not in kilometers, but in terms of meaning. These measurement rules are called distance metrics.

There are some common distance metrics that is used by the vector DB to find the relevant results.

Cosine Similarity

Cosine similarity cares about the direction of meaning, not the length of the vector.
Euclidean Distance
Euclidean distance measures how far two points are from each other in the vector space.

Let’s say there are three users. 1st user has watched 2 comedy and 2 action movies. 2nd user has watched 20 comedy and 20 action movies and 3rd user has watched 10 comedy and 2 action movies.

If you use cosine similarity, 1st user and 2nd user are more similar compare to 1st and 3rd. If you use euclidean distance 1st user and 3rd user are more similar. Cosine similarity cares about taste proportions. Euclidean distance cares about how many movies watched.

Metadata Filtering

When you store embeddings in a vector database, you don’t just store the vector. You usually also attach metadata: extra information about where the text came from, who owns it, when it was added, etc.

Metadata filtering means when searching for similar vectors, you can also apply conditions on this extra information. It’s like saying: “Find me the most relevant results, but only from project X, written in English, and after 2023.”

Without metadata filters, you would risk retrieving semantically similar but irrelevant results.

Why It Matters

Keeps results contextual and secure (e.g., tenant-based access).
Reduces noise (e.g., don’t show outdated info).
Saves latency (search smaller candidate pool).

Let’s consider an e-commerce example where the data stored in vector DB looks like below

{
  "id": "p_902",
  "vector": [...],
  "text": "Leather sneakers with memory foam soles",
  "metadata": {
    "category": "shoes",
    "price": 120,
    "brand": "Nike",
    "gender": "men"
  }
}

User queries “comfortable men’s shoes”. Vector DB search would find sneakers as relevant with a filter applied gender = “men“. User sees only men’s shoes and not irrelevant women’s high heels.

Hybrid Search

When you search in a vector database, you can do it in two ways:

Keyword search (sparse search)
- Looks for exact words that match.
- Great for things like names, IDs, code, or very specific terms.
- Example: If you search for “iPhone 16 Pro”, keyword search will catch documents with the exact phrase.
Vector search (dense search)
- Looks for meaning, not exact words.
- Great for natural language queries.
- Example: If you search for “latest Apple smartphone”, vector search can still match documents about “iPhone 16 Pro” even if those words aren’t used.

The problem is keyword search alone misses documents if the wording is different and Vector search alone sometimes misses exact matches (like product codes or rare keywords). Hybrid search combines both methods. It looks at exact keywords and semantic meaning, then merges the results.

How Vector Databases Stay Fast

If you had 10 documents, you could brute-force compare your query embedding to each one which is not efficient but still would be pretty fast. But what if you had 100 million vectors?

Doing “compare with every vector” would be too slow and too expensive. That’s why vector databases use Approximate Nearest Neighbor (ANN) search. Instead of scanning everything, they use clever shortcuts to jump directly to the “neighborhood” where the answer probably is. It’s like finding the nearest Starbucks by checking your neighborhood first instead of the whole city. You sacrifice a tiny bit of accuracy for huge speed gains.

Was ist Vector Database