HyDE (Hypothetical Document Embeddings)


Previous Context
We saw how the “Parallel Query Retrieval” and “Query Decomposition” work. Let’s just see a recap:
Parallel Query Retrieval:
We asked this question, “What is fs?” and we got some questions like this from the LLM:
What is fs?
What is the file system?
What is a file in Node.js?
How to create a file in Node.js?
Really a straightforward question and some variants of it, right?
Query Decomposition:
We had a complex question: “What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications?“ and our LLM generated these questions:
Compare React and Vue.js for large-scale projects
What are the pros and cons of using React for building large applications?
Is React or Vue.js better for developing complex web applications?
What are the benefits of using Vue.js over React in large-scale projects?
Complex multi-topic queries made simple.
Now, let’s think about something complex…
A new scenario:
Some experimentation 🧑🔬:
Suppose we have a big “Academic Research Paper” on “LLM models - Transformers and NLPs“ in our hands, and we want to ask questions about it. Sample question:
"How does transformer architecture improve natural language understanding?"
Run Through Parallel Query Retrieval:
"How do transformers help with NLP?"
"What makes transformer architecture better for language?"
"Why are transformers good for natural language processing?"
We got these questions. But “Research Papers” do not contain questions. They contain research findings, methodologies, and numericals.
Run through Query Decomposition:
"What is transformer architecture?"
"How does the attention mechanism work?"
"What are transformer benefits for NLP?"
We got some broken-down or “decomposed” queries. But these are still questions, and we need references so that we can search through the Paper and get our relevant information.
Let’s try a new approach:
So, what we will do is we will generate some pseudo answers based on the question, and we will find the reference/keywords from the answers in our vector store. Suppose the LLM generates this response:
“Transformer architecture revolutionizes natural language understanding through self-attention mechanisms that capture long-range dependencies more effectively than recurrent neural networks. The multi-head attention allows the model to focus on different representation subspaces simultaneously, enabling better contextual understanding…“
We have got keywords like:
"self-attention mechanism"
"long-range dependencies"
"recurrent neural networks"
"subspace" and
"contextual understanding."
This gives us more relevant data to search through, which increases our chances of finding the information we need, right?
This method is called “ HyDE (Hypothetical Document Embeddings) “.
Definition:
HyDE stands for Hypothetical Document Embedding. Instead of directly searching with the user's question, HyDE generates a hypothetical answer first, then uses that generated answer to search for relevant documents
Why HyDE?
We have:
Parallel Query Retrieval
Reciprocal Rank Fusion and
Query Decomposition
- These are powerful techniques. So, why do we need HyDE for?
HyDE adds something extra powerful: contextual richness.
Why does that matter?
Many technical documents spread information across sections.
Questions might miss key terms that appear only in answers.
Generating a pseudo-answer lets us "guess" what a good answer might look like, then search backwards to locate supporting material.
This is the core idea behind HyDE — generating hypothetical documents (or answer embeddings) to find real ones.
This is the basic workflow of HyDE:
When not to HyDE?
HyDE is only applicable when the answer is scattered throughout the document and no direct instance is given. For simpler queries, using HyDE is quite too much. Each time you generate a pseudo-response, it burns more tokens than generating questions. So, it is only applicable when precision is more important than price.
Plus, there is a drawback to this approach. Remember, when we generated a pseudo-response to the previously asked question on the “Research Paper”? The LLM model that is generating the answer should at least know the context of the question to generate useful keywords, right?
Ok, enough with the ideas and intuitions. Let’s discuss the implementation:
Let’s code 🥰:
Sequence:
Nice system prompt
Input requests, get a “Paragraph”
Decompose that for more clarification
Generate some “Parallel Query” for context-matching
Vector Search and get your answer.
Here, 1 and 2 are the compulsory states, and 4 to 6 are just optimizations so that we do not miss any context.
Code:
1 and 2:
def HyDE(self, query):
print("HyDE Running 🧑🔬")
try:
system_prompt = f"""
Generate a comprehensive, expert-level answer to this query as if you're writing documentation or academic content.
Query: "{query}"
REQUIREMENTS:
1. Write in professional, authoritative tone (like a domain expert)
2. Generate exactly one well-structured paragraph (4-6 sentences)
3. Include technical terminology and key concepts relevant to the field
4. Cover the main topic plus 2-3 closely related subtopics
5. Use declarative statements, not questions
6. Write as if explaining to a knowledgeable audience
RETURN FORMAT:
{{
"original": "{query}",
"generated": "your expert paragraph here"
}}
Return ONLY valid JSON, no additional text.
"""
response = self.model.generate_content(system_prompt)
if not response:
print("No response was generated. ")
return None
filtered_response = filter_response(response)
try:
parsed_response = json.loads(filtered_response)
return parsed_response
except json.JSONDecodeError as e:
print(f"JSON parsing error: {e}")
return None
except Exception as e:
print("Failed to run HyDE: {e}")
return None
4 to 6.
See the full code.
Full Code:
See the full code. (Mainly see the main function for clarification.)
Input and Output Testing:
Input:
How does transformer architecture improve natural language understanding?
Output:
HyDE Running 🧑🔬 Transformer architecture significantly enhances natural language understanding (NLU) by leveraging0 self-attention mechanisms. Unlike recurrent neural networks (RNNs), transformers process all input tokens simultaneously, enabling them to capture long-range dependencies and contextual information. This allows for better representation of semantic relationships between words and phrases, resulting in improved performance on tasks like machine translation, text summarization, and question answering. Furthermore, transformers' parallel processing capabilities facilitate efficient training and inference, making them a highly effective architecture for NLU tasks. Decomposing Query 🧠 1: Transformers enhance natural language understanding using self-attention mechanisms. 2: Transformers process all input tokens simultaneously, unlike recurrent neural networks. 3: Transformers can capture long-range dependencies and contextual information. 4: Transformers improve performance on tasks like machine translation, text summarization, and question answering. 5: Transformers enable better representation of semantic relationships between words and phrases. 6: Transformers facilitate efficient training and inference due to parallel processing capabilities. 7: Transformers are a highly effective architecture for natural language understanding tasks. 8: Transformers are an alternative to recurrent neural networks for natural language processing. 9: Transformer architecture has advantages over recurrent neural networks for certain tasks. 10: The use of self-attention in transformers is crucial for their effectiveness. # There are also Parallel Query Generation is running after this. Did not show that.
❓When you will follow the full code, you might not see Decomposed queries as single lines but as questions. Can you generate single lines like this?
Conclusion:
We have seen a bunch of Query Translation methods till now - Parallel Query (Fan Out) Retrieval, Reciprocal Rank Fusion, Query Decomposition, HyDE (Hypothetical Document Embeddings). All of these are used for structuring the response generated by LLM models. We are not doing anything fancy till now, just increasing our chances of finding the required data in our document so that we can do “further work” on them.
When we did not have these LLMs, we had to do this similarity search manually, using codes, conditionals. But these LLMs made this part quite easy. Though coding logics are still the most wanted things in engineering. We just need to center the response obtained by the LLMs, and one of the problems is solved.
In the later parts, we will see how we can optimize the whole process of automating a system more intensely.
Subscribe to my newsletter
Read articles from Pritom Biswas directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
