Introduction to Vector Database

Cenz WongCenz Wong
5 min read

In a world flooded with text, images, videos, and user interactions, we’re moving beyond just searching with keywords. Today, many systems — from AI chatbots to recommendation engines — rely on something deeper: searching by meaning.

Enter the vector database.

While it sounds like an advanced AI concept, it’s rooted in something surprisingly old-school: Word2Vec — a model introduced in 2013. In this post, we’ll explore how vector databases work, why they matter, and how they’re useful even beyond AI.

🧠 From Word2Vec to Vector Thinking

Back in 2013, Google introduced Word2Vec, a model that transformed how we process and compare words. Instead of relying on text matching, it represented each word as a vector — a list of numbers — based on the context in which it appears.

This had a profound result: words with similar meanings ended up close together in space.

Example:

import gensim.downloader

wv = gensim.downloader.load('word2vec-google-news-300')
  • "king" – "man" + "woman" ≈ "queen"

  •       wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
          # [('queen', 0.7118193507194519)]
    
  • "cat" and "dog" were closer in the vector space than "cat" and "car"

  •       wv.similarity("cat", "dog")
          # 0.76094574
          wv.similarity("cat", "car")
          # 0.21528184
    

These vectors captured semantic relationships, which opened the door for search by similarity — not just by keywords.

Reference:

🧮 What Is a Vector Database?

A vector database is a system built to store and search high-dimensional vectors efficiently.

These vectors might come from:

  • Text embeddings (e.g., Word2Vec, BERT, OpenAI)

  • Image features

  • Audio fingerprints

  • Structured numeric features (e.g., coordinates or behavioral scores)

In short, a vector database helps you find "what's most similar" rather than just "what matches".

It uses algorithms like Approximate Nearest Neighbor (ANN) to quickly find the most similar vectors in large datasets — even millions of items.

🔧 How Vector Search Works (Simplified)

flowchart LR
    %% Step 1: Data Encoding
    A["Input Data (Text, Image, etc.)"]
    B["Encode Data into Vectors (using embedding model)"]

    %% Step 2: Vector Storage
    C["Store Vectors with Metadata (e.g. name, link)"]

    %% Step 3: User Query Encoding
    D["User Query (Text or Image)"]
    E["Encode Query into Vector"]

    %% Step 4: Similarity Search
    F["Compute Similarity (based on cosine similarity, Euclidean distance)"]
    G["Find Nearest Vectors"]

    %% Step 5: Return Results
    H["Return Most Relevant Results to User"]

    %% Flow connections
    A --> B --> C
    D --> E --> F --> G --> H
  1. Convert data into vectors
    Use a model (like Word2Vec or an image encoder) to generate a vector

  2. Store the vectors
    Each vector is stored with its associated metadata (e.g. product name, link, image)

  3. Embed the query
    A user’s input is also converted into a vector using the same model

  4. Search by similarity
    The database finds the “nearest” vectors — based on cosine similarity, Euclidean distance, etc.

  5. Return results
    You retrieve the most relevant results based on the similarity scores

📍 What About Coordinates and Numeric Vectors?

Not all vectors come from AI models. In fact, vector search works just as well with structured numerical features — like coordinates, sensor readings, or user profile data.

Examples:

  • [latitude, longitude] — for nearby location search

  • [age, income, spending_score] — for customer segmentation

  • [temp, humidity, pressure] — for anomaly detection in IoT

These vectors aren’t “semantic,” but they’re still searchable by similarity. This makes vector databases useful in fields like logistics, retail analytics, health tech, and mobility.

Use CaseVector ExampleWhat It Does
Store locator[lat, lon]Finds closest store to a customer
Customer clustering[age, income, loyalty]Groups similar shoppers
Sensor anomaly detection[temp, humidity, voltage]Flags abnormal behavior in IoT data
Driving pattern analysis[lat, lon, speed]Compares vehicle routes and habits

🧰 Complementing Traditional Databases

It’s important to note: vector databases aren’t here to replace your current databases.

Instead, they work alongside relational (SQL), NoSQL, and search systems like Elasticsearch.

Example:

  • Store product info (ID, name, price) in PostgreSQL

  • Store image or description embeddings in a vector DB

  • Use both together for a smarter, hybrid search system

You get the best of both worlds: structured filtering + fuzzy semantic matching.

💡 Real-World Applications

Vector databases are already powering many tools and features you use today:

Application AreaExample Use Case
Semantic searchSearch documents by meaning, not keywords
Product recommendationsSuggest similar items based on user clicks
Chatbot memoryRetrieve relevant past conversations
Image similarity searchFind clothes or objects that look alike
Music/audio matchingGroup songs with similar tone or mood
Location-based filteringFind nearest neighbors with [lat, lon]

🧱 Tools You Can Use

There are a variety of tools designed for vector storage and similarity search:

ToolHighlights
LanceDBDeveloper-friendly, fast, open-source
WeaviateHybrid search support, built-in ML features
PineconeFully managed, scalable, production-grade
MilvusHigh-performance, supports billions of vectors
FAISSMeta’s open-source library for vector search
ChromaDBOpen-source search and retrieval database for AI applications
QdrantOpen-source, optimized for neural search

Some databases like Elasticsearch and Postgres are also adding vector search features. Also, three main cloud providers:

🚀 Final Thoughts

From the early days of Word2Vec, we’ve known that similarity matters more than exact matches. Today, with powerful models generating vector embeddings for all types of data — text, images, audio, coordinates — vector databases provide the infrastructure to search them efficiently.

Whether you're building:

  • A semantic search engine

  • A recommendation system

  • A chatbot with memory

  • Or simply grouping users or locations by similarity

...vector databases are becoming a core part of the modern AI-powered stack.

The future of search is not just about finding exact matches — it’s about understanding what’s truly similar.


If you have any questions or feedback, feel free to leave a comment below or send me a message on LinkedIn! 📩. Happy coding! 😊

0
Subscribe to my newsletter

Read articles from Cenz Wong directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Cenz Wong
Cenz Wong

Data Engineer @ ASDA | MSc Big Data Technology @ HKUST | Technology Enthusiast