Using Hyde For Improved Retrieval
Using Hypothetical Document Embeddings (HyDE) to Improve Retrieval
📚 This cookbook has an accompanying article with a complete walkthrough "Optimizing Retrival with HyDE"
In this coookbook, we are building Haystack components that allow us to easily incorporate HyDE into our RAG pipelines, to optimize retrieval.
To learn more about HyDE and when it's useful, check out our guide to Hypothetical Document Embeddings (HyDE)
Install Requirements
In the following sections, we will be using the OpenAIGenerator, so we need to provide our API key 👇
Building a Pipeline for Hypothetical Document Embeddings
We will build a Haystack pipeline that generates 'fake' documents.
For this part, we are using the OpenAIGenerator with a PromptBuilder that instructs the model to generate paragraphs.
Next, we use the OutputAdapter to transform the generated paragraphs into a List of Documents. This way, we will be able to use the SentenceTransformersDocumentEmbedder to create embeddings, since this component expects List[Document]
Finally, we create a custom component, HypotheticalDocumentEmbedder, that expects documents and can return a list of hypotethetical_embeddings which is the average of the embeddings from the "hypothetical" (fake) documents. To learn more about this technique and where it's useful, check out our Guide to HyDE
We add all of our components into a pipeline to genereate a hypothetical document embedding 🚀👇
Build a HyDE Component That Encapsulates the Whole Logic
This section shows you how to create a HypotheticalDocumentEmbedder that instead, encapsulates the entire logic, and also allows us to provide the embedding model as an optional parameter.
This "mega" components does a few things:
- Allows the user to pick the LLM which generates the hypothetical documents
- Allows users to define how many documents should be created with
nr_completions - Allows users to define the embedding model they want to use to generate the HyDE embeddings.
Use HyDE For Retrieval
Let's see how we can use this component in a full pipeline. First, let's index some documents into an InMemoryDocumentStore
We can now run a retrieval pipeline that doesn't just retrieve based on the query embeddings, instead, it uses the HypotheticalDocumentEmbedder to create hypothetical document embeddings based on our query and uses these new embeddings to retrieve documents.