Hypothetical Document Embeddings (HyDE RAG)

Introduction
HyDE, or Hypothetical Document Embeddings, is a method in Retrieval‑Augmented Generation (RAG) that boosts the relevance of search results by having the model “think aloud” before retrieving documents.
How it works:
Rather than querying the Vector Database with the user’s original question, the LLM first drafts a hypothetical answer—a concise, imagined response to the query.
That hypothetical answer is transformed into embeddings and used to search the database.
Because the imagined answer naturally contains terms and phrases more closely aligned with the stored content, the retrieved documents tend to be more accurate and contextually on‑point.
Note: HyDE requires a powerful, up‑to‑date LLM capable of drafting meaningful hypothetical answers—if the model doesn’t know the topic, it can’t generate a useful hypothesis, and the retrieval step will fail.
Pipeline Overview
Before You Begin
📥 Ingest Data and ✂️ Chunk Text
🔢 Generate Embeddings and 💾 Store in Vector DB
🤖 Generate Hypothetical Answer (LLM)
🔍 Retrieve Relevant Chunks based on Hypothetical Answer
Before You Begin
Before installing any packages, create virtual environment
# 1. Create a virtual environment named .venv
python -m venv .venv
# 2. Activate it
# On macOS / Linux:
source .venv/bin/activate
# On Windows (PowerShell):
.venv\Scripts\Activate.ps1
# On Windows (Command Prompt):
.venv\Scripts\activate.bat
📥 Ingest Data and ✂️ Chunk Text
Gather all your source materials—PDFs, text documents, websites, and other knowledge repositories.
Break the content into manageable segments (around 500–1,000 tokens each).
This chunking boosts retrieval efficiency and keeps the model’s context window from being overloaded.
chunk_size = 1000
– each slice of text will be at most 1,000 characters (or tokens) long.chunk_overlap = 200
– each new slice repeats the last 200 characters of the previous slice so context flows smoothly across chunks.
To do this, we need to install the packages langchain_community
and pypdf
.
Run the following command in the terminal:
pip install langchain_community pypdf
#loader.py
from langchain_community.document_loaders import PyPDFLoader
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
# loading process
pdf_path = Path(__file__).parent / "file_name.extension_type"
loader = PyPDFLoader(file_path=pdf_path)
doc = loader.load()
# chunk process
text_spliter = RecursiveCharacterTextSplitter(
chunk_size = 1000,
chunk_overlap = 200
)
split_doc = text_spliter.split_documents(documents=doc)
🔢 Generate Embeddings and 💾 Store in Vector DB
Each chunk is passed through an embedding model (e.g. text‑embedding‑ada-002) that turns it into a fixed‑length vector in semantic space.
I’m using Google AI embeddings for this example, but you can use OpenAI embeddings instead. You can see all the embeddings through the link. LangChain Embeddings
Note: Create a .env
file to store your Google API key, and use python-dotenv
to load it into your Python script.
To use GoogleGenerativeAIEmbeddings
and load_dotenv
, you first need to install the integration package langchain‑google‑genai
and dotenv
.
pip install langchain-google-genai
pip install python-dotenv
#loader.py
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from dotenv import load_dotenv
load_dotenv()
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004",
)
Those vectors, plus your chunk text and metadata, go into a specialized index (Pinecone, Qdrant, FAISS, etc.).
Why use a vector DB? It lets you do ultra‑fast approximate nearest‑neighbor searches over millions of vectors, usually in milliseconds.
Here we’re using the Qdrant vector‑database.
You can either install it directly on your system or run it in Docker; I’m using Docker in this example
services:
qdrant:
image: qdrant/qdrant
ports:
- "6333:6333"
To run this docker compose file in terminal:
docker compose -f docker-compose.yml up
Once the container is running, you can connect to Qdrant at http://localhost:6333.
To use QdrantVectorStore
and QdrantClient
, you first need to install the integration package langchain-qdrant
pip install langchain-qdrant
#loader.py
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
vector_store = QdrantVectorStore.from_documents(
documents=[],
url="http://localhost:6333",
embedding=embeddings,
collection_name="learning_langchain"
)
vector_store.add_documents(documents=split_doc)
🤖 Generate Hypothetical Answer (LLM)
Here we pass the user query to the llm to generate a answer
To use OpenAI
, you first need to install the integration package openai
pip install openai
#main.py
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()
def ai(message):
response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=message,
)
return response.choices[0].message.content
client = OpenAI(
api_key=os.getenv("GOOGLE_API_KEY"),
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
system_prompt = f"""
You are an helpfull AI Assistant who is specialized in resolving user query.
Answer should be in detail
You recive a question and you give answer
"""
query = input("> ")
message=[{"role":"system","content":system_prompt},{"role":"user","content":query}]
llm_answer = ai(message)
print("\nLLM Answer: ")
print(llm_answer)
🔍 Retrieve Relevant Chunks based on Hypothetical Answer
In this step, we leverage the LLM‑generated hypothetical answer to identify and retrieve the most relevant document chunks.
Python File main.py
#main.py
from retrieval import retrieve
relevant_chunk = retrieve(llm_answer)
Passing llm answer to the retrieval.py
from langchain_qdrant import QdrantVectorStore
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
def retrieve(query) -> str:
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY")
embedding = GoogleGenerativeAIEmbeddings(
model="models/text-embedding-004"
)
retrive = QdrantVectorStore.from_existing_collection(
collection_name = "parallel_query",
embedding=embedding,
url="http://localhost:6333",
)
relevant_chunks = retrive.similarity_search(
query=query,
)
for doc in relevant_chunks:
print("-------------------------")
print("Page Content: ", doc.page_content)
print("Page Number: ", doc.metadata.get("page"))
print("-------------------------")
Executing the Code
Executing the main.py file
Full Source Code
Grab everything form below link
Subscribe to my newsletter
Read articles from Suraj Patel directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
