Hypothetical Document Embeddings (HyDE RAG)

Suraj PatelSuraj Patel
4 min read

Introduction

HyDE, or Hypothetical Document Embeddings, is a method in Retrieval‑Augmented Generation (RAG) that boosts the relevance of search results by having the model “think aloud” before retrieving documents.

How it works:

  1. Rather than querying the Vector Database with the user’s original question, the LLM first drafts a hypothetical answer—a concise, imagined response to the query.

  2. That hypothetical answer is transformed into embeddings and used to search the database.

  3. Because the imagined answer naturally contains terms and phrases more closely aligned with the stored content, the retrieved documents tend to be more accurate and contextually on‑point.

Note: HyDE requires a powerful, up‑to‑date LLM capable of drafting meaningful hypothetical answers—if the model doesn’t know the topic, it can’t generate a useful hypothesis, and the retrieval step will fail.

Pipeline Overview

  1. Before You Begin

  2. 📥 Ingest Data and ✂️ Chunk Text

  3. 🔢 Generate Embeddings and 💾 Store in Vector DB

  4. 🤖 Generate Hypothetical Answer (LLM)

  5. 🔍 Retrieve Relevant Chunks based on Hypothetical Answer

  1. Before You Begin

Before installing any packages, create virtual environment

# 1. Create a virtual environment named .venv
python -m venv .venv

# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
  1. 📥 Ingest Data and ✂️ Chunk Text

  • Gather all your source materials—PDFs, text documents, websites, and other knowledge repositories.

  • Break the content into manageable segments (around 500–1,000 tokens each).

  • This chunking boosts retrieval efficiency and keeps the model’s context window from being overloaded.

  • chunk_size = 1000 – each slice of text will be at most 1,000 characters (or tokens) long. chunk_overlap = 200 – each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.

To do this, we need to install the packages langchain_community and pypdf.
Run the following command in the terminal:

pip install langchain_community pypdf
#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter

# loading process
pdf_path = Path(__file__).parent / "file_name.extension_type"

loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()

# chunk process
text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
  1. 🔢 Generate Embeddings and 💾 Store in Vector DB

Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.

I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings

Note: Create a .env file to store your Google API key, and use python-dotenv to load it into your Python script.

To use GoogleGenerativeAIEmbeddings and load_dotenv, you first need to install the integration package langchain‑google‑genai and dotenv.

pip install langchain-google-genai
pip install python-dotenv
#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv

load_dotenv()

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY") 

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/text-embedding-004",
)

Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).

Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.

Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example

services:
  qdrant:
    image: qdrant/qdrant
    ports:
      - "6333:6333"

To run this docker compose file in terminal:

docker compose -f docker-compose.yml up

Once the container is running, you can connect to Qdrant at http://localhost:6333.

To use QdrantVectorStore and QdrantClient, you first need to install the integration package langchain-qdrant

pip install langchain-qdrant
#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

vector_store = QdrantVectorStore.from_documents(
    documents=[],
    url="http://localhost:6333",
    embedding=embeddings,
    collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
  1. 🤖 Generate Hypothetical Answer (LLM)

Here we pass the user query to the llm to generate a answer

To use OpenAI , you first need to install the integration package openai

pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os


load_dotenv()

def ai(message):
    response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=message,
    )
    return response.choices[0].message.content


client = OpenAI(
    api_key=os.getenv("GOOGLE_API_KEY"),
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"

)
system_prompt = f"""
    You are an helpfull AI Assistant who is specialized in resolving user query.

    Answer should be in detail
    You recive a question and you give answer 
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
llm_answer = ai(message) 

print("\nLLM Answer: ")
print(llm_answer)
  1. 🔍 Retrieve Relevant Chunks based on Hypothetical Answer

In this step, we leverage the LLM‑generated hypothetical answer to identify and retrieve the most relevant document chunks.

Python File main.py

#main.py
from retrieval import retrieve
relevant_chunk = retrieve(llm_answer)

Passing llm answer to the retrieval.py

from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os

def retrieve(query) -> str:
    if "GOOGLE_API_KEY" not in os.environ:
        os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")


    embedding = GoogleGenerativeAIEmbeddings(
        model="models/text-embedding-004"
    )


    retrive = QdrantVectorStore.from_existing_collection(
        collection_name = "parallel_query",
        embedding=embedding,
        url="http://localhost:6333",
    )

    relevant_chunks = retrive.similarity_search(
        query=query,
    )


    for doc in relevant_chunks:
        print("-------------------------")
        print("Page Content: ", doc.page_content)
        print("Page Number: ", doc.metadata.get("page"))
        print("-------------------------")

Executing the Code

Executing the main.py file

Full Source Code

Grab everything form below link

https://github.com/SurajPatel04/genAI/tree/main/cohort/day5class/HyDe_Hypothetical_Document_Embeddings

0
Subscribe to my newsletter

Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Suraj Patel
Suraj Patel