Vector Databases: Fueling AI Assistants

1. Introduction: Why Should You Care?

Ever wondered how AI assistants like ChatGPT give context-aware answers, as if they “remember” what you asked?

The magic behind it? Vector Databases.

I’m currently building an AI assistant for users on [TUF+], our Ed-Tech platform. Its job? To help users solve coding problems better; by understanding their queries and pulling the right chunk of content from our 400+ problems.

In this guide, I’ll explain the entire flow, from converting plain text into “semantic fingerprints” to storing and retrieving them efficiently.

2. From Text to Meaning: Why Structure Matters

Okay, so here’s the deal.

Raw text like the editorial you see for that DP problem, is just... well, text.
Humans get it. Machines? Not so much.

Computers don’t “read” like we do. They don’t feel the meaning behind “brute force sucks.”
They only understand numbers. Like this:

"Brute force approach..." → [0.12, -0.38, 0.44, ..., 0.03] // Total: 1536 numbers (dimensions)

This magical number-list-thing is called a vector.

And this whole process of turning text into vectors?
That’s called embedding. (Remember this word, it’s everywhere.)

“Vector is just a semantic fingerprint.” (Okay, I know “buzzword alert” but I’ll break it down for you.)

But before that, you might think:

😮‍💨 Why Go Through This Mathy Pain?

Because once you convert content into vectors, you can search by meaning and not keywords.

Imagine typing:

“Can you explain the optimized part again?”

Instead of keyword-matching, your assistant understands the intent and pulls up the exact part of your editorial that explains the optimized logic.

That’s called semantic search. And it powers modern AI assistants.

3. How Vector Embeddings Work (Simplified)

Now, for our use case, I have a set of 400+ problems(editorials) that I want to store in the vector DB so I can later retrieve context related to a specific problem by its ID.

But before I explain how this works, let’s go over a few questions I had; and you probably should too, before asking an AI to build projects for you, haha!!

🧠 What happens when you store an editorial in Pinecone?

You are not storing the editorial text itself.
You’re storing its embedding → a fixed-length vector that represents its meaning.
For example, with OpenAI’s text-embedding-ada-002 model:
- Input: "This is a 2000-word editorial about segment trees."
- Output: A vector of 1536 float values → e.g. [0.021, -0.013, 0.076, ...]
- This vector captures the semantic meaning of the editorial, but not the full content.

Therefore :

📦 What Pinecone stores =

The vector
Any optional metadata you add → e.g., { id: 123, title: "Segment Tree", description: "..." }

      {
        "id": "problem_123",
        "values": [0.12, -0.07, 0.91, ..., 0.04],
        "metadata": {
          "title": "Optimized DP Approach",
          "difficulty": "Medium"
        }
      }

⛔ It does not store the full editorial text unless you explicitly add it as metadata.

📝 Note: We chop big editorials into small chunks (think ~300–500 words max) so it fits neatly within token limits.

But bro, what even are chunks? (Exactly how I felt when GPT urged me to do so!); for the time being, think of it as separating texts into smaller blocks so that you don't send the complete data at once, and don't worry, I'll explain in detail in the following part!

🔁 What happens when a user sends a query?

user: "What is the time complexity of building a segment tree?"

Here’s what happens internally:

You create a vector embedding of this query using the same OpenAI model (text-embedding-ada-002)
You use this query vector to search Pinecone (e.g., topK: 3) for the most similar stored vectors.
Pinecone returns:
- The topK most similar vectors
- Their associated metadata, which is where you can include the actual editorial/problem text

📌 At this point, the original query vector and the problem vectors do not go to OpenAI. Only the retrieved metadata is passed.

🤔 Then what goes to OpenAI?

Let’s introduce another powerful concept here: Retrieval-Augmented Generation (RAG).

Don’t worry, it’s simpler than it sounds, especially if you’ve made it this far →
In essence, it works by adding the retrieved information (called a chunk) as context for OpenAI to process.

This allows the model to generate responses that are tailored to specific queries, even if they weren’t part of its original training data.

Example prompt format:

Context: <retrieved chunk>
User: Can you explain the brute force approach?

Now that you have the top 3 editorial/problem contexts (from Pinecone), you:

Format them into a prompt like:

Context 1: Segment Tree editorial - explains how to build and query with examples.
Context 2: Explains how prefix sums differ from segment trees.
Context 3: A problem where segment tree is used for range max queries.

User query: What is the time complexity of building a segment tree?

You then send this full string as a prompt to OpenAI’s chat/completion API.

4. Why We Use Chunking

It’s question time again!

🧠 What does a “Chunk” Mean in Vector Embedding ?

LLMs like GPT-4 or embedding models have token limits (say ~8,000 for text-embedding-ada-002 and more for text-embedding-3-large).

Feeding a huge editorial (2,000+ words) into one embedding call can:

Reduce vector quality (less semantic clarity).
Exceed token limits or become very costly.

Make semantic search worse because large blobs are hard to match accurately → which defeats the whole purpose of semantic search!

🧩 “Chunking” = Breaking long text into smaller sections

Instead of embedding this full editorial:

<p>This problem is about finding the longest palindromic substring...</p>
<p>Brute-force approach: Try every substring...</p>
<p>Optimized approach uses expand around center...</p>

You break it into clean text blocks like this:

const chunks = [
  "This problem is about finding the longest palindromic substring...",
  "Brute-force approach: Try every substring and check if it's a palindrome...",
  "Optimized approach uses 'expand around center' with O(n^2) time...",
  "Dynamic Programming approach: Use a 2D dp table to mark palindromes..."
];

Each chunk becomes an individual embedding and is stored in Pinecone with metadata like problem ID, tag, type (brute, optimized, etc.)

Why Chunk?

Without Chunking	With Chunking
Entire editorial stored as one vector	3–5 vectors for better retrieval granularity
Harder to match small queries	Easier to return only the most relevant parts
More tokens to OpenAI later	Less tokens = less cost, faster reply

How to Chunk Text in Practice?

You can chunk:

Based on headings (like “Brute Force”, “Optimized Approach”)
Every 2–4 sentences
Limit per chunk: ~300–400 tokens (~150–200 words)

📦 How to Store in Pinecone ?

For each chunk, create an embedding and store it like this:

await index.upsert([
  {
    id: `problem-1234-chunk-1`,
    values: embedding1, // Float32Array of the embedding
    metadata: {
      problemId: '1234',
      section: 'Brute Force',
      chunk: 1,
      text: chunk1,
    }
  },
  {
    id: `problem-1234-chunk-2`,
    values: embedding2,
    metadata: {
      problemId: '1234',
      section: 'Optimized',
      chunk: 2,
      text: chunk2,
    }
  }
]);

5. What’s a Vector’s Size?

Pinecone charges based on two key resources:

Resource	Description
Vector Storage	Number of vectors × dimension × size (typically `float32`)
Index RAM	Amount of RAM needed to load the index into memory

What’s an index?
Think of it as a folder where your vectors live. It helps Pinecone group and search stuff efficiently.

Let’s get nerdy for a second:

Embedding model: text-embedding-ada-002
Vector dimension: 1536
Each number: 4 bytes (float32)

So:

1 vector = 1536 × 4 = 6,144 bytes ≈ 6 KB  
1000 vectors ≈ 6 MB  
100,000 vectors ≈ 600 MB

So vector storage is very lightweight → Pinecone is optimized for storing millions of vectors efficiently.

✅Conclusion: You’re unlikely to run into storage issues unless you're storing millions of problems.

⌛ How Long Is Data Stored in Pinecone?

As long as you want it there.
It’s persistent; unless you delete it, it stays.

💸 About Tokens (And Why Vectors Help)

Let’s say your editorial is 2000 tokens long.

Without vectors, you’d need to send everything to OpenAI every time. Super expensive.
With Pinecone, you only send the most relevant chunk (say ~300 tokens).

That’s 5–6× less usage per query. Much cheaper, faster, smarter.

❗ Just to clarify → the vector doesn’t “compress” the full content. It’s just a shortcut for finding meaning, not a storage format.

✅ Conclusion

The vector is not a compression or summary of the editorial.
It’s just a semantic fingerprint.

You cannot reconstruct the full editorial from the vector.
The vector is only used for similarity search.
You need to store the actual editorial in metadata (or elsewhere) and fetch it along with the vector.

“Vectors are the bridge between human language and machine understanding.

They don’t just store text; they store meaning. And that’s the secret to building smarter AI apps.”

💡 Final Thoughts: Building for Scale

You’re building a system that will grow; so consider:

Storage costs (millions of chunks?)
Metadata design (problem ID, tags, section, user’s preferred coding language)
Retrieval quality (chunk size, summary accuracy)
Token limits for OpenAI (stay under 4K–8K)

Your vector DB (e.g., Pinecone) becomes the brain of your AI assistant, it helps the model “know” what matters without reading everything every time.

Aghhh, enough writing gotta go work on improving the assistant’s accuracy now!!! 🫡

How Vector Databases Power AI Assistants