HyDE (Hypothetical Document Embeddings)

Pritom BiswasPritom Biswas
7 min read

Previous Context

We saw how the “Parallel Query Retrieval” and “Query Decomposition” work. Let’s just see a recap:

Parallel Query Retrieval:

We asked this question, “What is fs?” and we got some questions like this from the LLM:

  1. What is fs?

  2. What is the file system?

  3. What is a file in Node.js?

  4. How to create a file in Node.js?

Really a straightforward question and some variants of it, right?

Query Decomposition:

We had a complex question: “What are the advantages and disadvantages of React compared to Vue.js for building large-scale applications?“ and our LLM generated these questions:

  1. Compare React and Vue.js for large-scale projects

  2. What are the pros and cons of using React for building large applications?

  3. Is React or Vue.js better for developing complex web applications?

  4. What are the benefits of using Vue.js over React in large-scale projects?

Complex multi-topic queries made simple.

Now, let’s think about something complex…

A new scenario:

Some experimentation 🧑‍🔬:

Suppose we have a big “Academic Research Paper” on “LLM models - Transformers and NLPs“ in our hands, and we want to ask questions about it. Sample question:

"How does transformer architecture improve natural language understanding?"

  1. Run Through Parallel Query Retrieval:

    • "How do transformers help with NLP?"

    • "What makes transformer architecture better for language?"

    • "Why are transformers good for natural language processing?"

We got these questions. But “Research Papers” do not contain questions. They contain research findings, methodologies, and numericals.

We can get some response, but will it be “contextual” to us?
  1. Run through Query Decomposition:

    • "What is transformer architecture?"

    • "How does the attention mechanism work?"

    • "What are transformer benefits for NLP?"

We got some broken-down or “decomposed” queries. But these are still questions, and we need references so that we can search through the Paper and get our relevant information.

Again, we will get some response, but will it be that helpful?
  1. Let’s try a new approach:

    So, what we will do is we will generate some pseudo answers based on the question, and we will find the reference/keywords from the answers in our vector store. Suppose the LLM generates this response:

    “Transformer architecture revolutionizes natural language understanding through self-attention mechanisms that capture long-range dependencies more effectively than recurrent neural networks. The multi-head attention allows the model to focus on different representation subspaces simultaneously, enabling better contextual understanding…“

    We have got keywords like:

    • "self-attention mechanism"

    • "long-range dependencies"

    • "recurrent neural networks"

    • "subspace" and

    • "contextual understanding."

      This gives us more relevant data to search through, which increases our chances of finding the information we need, right?

💡
This example might be quite overwhelming. Think like this: “We are generating a pseudo-answer before searching for our real answer in the document, instead of generating some questions. “

This method is called “ HyDE (Hypothetical Document Embeddings) “.

Definition:

HyDE stands for Hypothetical Document Embedding. Instead of directly searching with the user's question, HyDE generates a hypothetical answer first, then uses that generated answer to search for relevant documents

Why HyDE?

We have:

  • Parallel Query Retrieval

  • Reciprocal Rank Fusion and

  • Query Decomposition

    - These are powerful techniques. So, why do we need HyDE for?

HyDE adds something extra powerful: contextual richness.

Why does that matter?

  • Many technical documents spread information across sections.

  • Questions might miss key terms that appear only in answers.

  • Generating a pseudo-answer lets us "guess" what a good answer might look like, then search backwards to locate supporting material.

This is the core idea behind HyDE — generating hypothetical documents (or answer embeddings) to find real ones.

This is the basic workflow of HyDE:

There is a slight drawback in this approach. Can you identify that?

When not to HyDE?

HyDE is only applicable when the answer is scattered throughout the document and no direct instance is given. For simpler queries, using HyDE is quite too much. Each time you generate a pseudo-response, it burns more tokens than generating questions. So, it is only applicable when precision is more important than price.

Plus, there is a drawback to this approach. Remember, when we generated a pseudo-response to the previously asked question on the “Research Paper”? The LLM model that is generating the answer should at least know the context of the question to generate useful keywords, right?

😶
HyDE does not work well with smaller language models. It takes the “most updated and largest” language model to use with.

Ok, enough with the ideas and intuitions. Let’s discuss the implementation:

Let’s code 🥰:

Sequence:

  1. Nice system prompt

  2. Input requests, get a “Paragraph”

  3. Decompose that for more clarification

  4. Generate some “Parallel Query” for context-matching

  5. Vector Search and get your answer.

Here, 1 and 2 are the compulsory states, and 4 to 6 are just optimizations so that we do not miss any context.

Code:

1 and 2:

def HyDE(self, query):
        print("HyDE Running 🧑‍🔬")
        try:
            system_prompt = f"""
               Generate a comprehensive, expert-level answer to this query as if you're writing documentation or academic content.

                Query: "{query}"

                REQUIREMENTS:
                1. Write in professional, authoritative tone (like a domain expert)
                2. Generate exactly one well-structured paragraph (4-6 sentences)
                3. Include technical terminology and key concepts relevant to the field
                4. Cover the main topic plus 2-3 closely related subtopics
                5. Use declarative statements, not questions
                6. Write as if explaining to a knowledgeable audience

                RETURN FORMAT:
                {{
                    "original": "{query}",
                    "generated": "your expert paragraph here"
                }}

                Return ONLY valid JSON, no additional text.
            """

            response = self.model.generate_content(system_prompt)

            if not response:
                print("No response was generated. ")
                return None

            filtered_response = filter_response(response)

            try:
                parsed_response = json.loads(filtered_response)
                return parsed_response
            except json.JSONDecodeError as e:
                print(f"JSON parsing error: {e}")
                return None

        except Exception as e:
            print("Failed to run HyDE: {e}")
            return None

4 to 6.

See the full code.

Full Code:

See the full code. (Mainly see the main function for clarification.)

Input and Output Testing:

  1. Input:

     How does transformer architecture improve natural language understanding?
    
  2. Output:

     HyDE Running 🧑‍🔬
     Transformer architecture significantly enhances natural language understanding (NLU) by leveraging0
     self-attention mechanisms. Unlike recurrent neural networks (RNNs), transformers process all input 
     tokens simultaneously, enabling them to capture long-range dependencies and contextual information. 
     This allows for better representation of semantic relationships between words and phrases, resulting 
     in improved performance on tasks like machine translation, text summarization, and question answering. 
     Furthermore, transformers' parallel processing capabilities facilitate efficient training and inference, 
     making them a highly effective architecture for NLU tasks.
    
     Decomposing Query 🧠
     1: Transformers enhance natural language understanding using self-attention mechanisms.
     2: Transformers process all input tokens simultaneously, unlike recurrent neural networks.
     3: Transformers can capture long-range dependencies and contextual information.
     4: Transformers improve performance on tasks like machine translation, text summarization, and question answering.
     5: Transformers enable better representation of semantic relationships between words and phrases.
     6: Transformers facilitate efficient training and inference due to parallel processing capabilities.
     7: Transformers are a highly effective architecture for natural language understanding tasks.
     8: Transformers are an alternative to recurrent neural networks for natural language processing.
     9: Transformer architecture has advantages over recurrent neural networks for certain tasks.
     10: The use of self-attention in transformers is crucial for their effectiveness.
    
     # There are also Parallel Query Generation is running after this. Did not show that.
    
    When you will follow the full code, you might not see Decomposed queries as single lines but as questions. Can you generate single lines like this?

Conclusion:

We have seen a bunch of Query Translation methods till now - Parallel Query (Fan Out) Retrieval, Reciprocal Rank Fusion, Query Decomposition, HyDE (Hypothetical Document Embeddings). All of these are used for structuring the response generated by LLM models. We are not doing anything fancy till now, just increasing our chances of finding the required data in our document so that we can do “further work” on them.

When we did not have these LLMs, we had to do this similarity search manually, using codes, conditionals. But these LLMs made this part quite easy. Though coding logics are still the most wanted things in engineering. We just need to center the response obtained by the LLMs, and one of the problems is solved.

In the later parts, we will see how we can optimize the whole process of automating a system more intensely.

0
Subscribe to my newsletter

Read articles from Pritom Biswas directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Pritom Biswas
Pritom Biswas