Optimizing Retrieval With Deepeval
Optimizing Retrieval with DeepEval
In this tutorial, we'll use DeepEval to evaluate a RAG pipeline's retriever built with Elasticsearch in order to select the best hyperparameters—such as top-K, embedding model, and chunk size—to optimize retrieval performance.
More specifically, we will:
- Define DeepEval metrics to measure retrieval quality
- Build a simple RAG pipeline with Elasticsearch
- Run evaluations on the Elastic retriever using DeepEval metrics
- Optimize the hyperparameters based on evaluation results
DeepEval metrics work out of the box without any additional configuration. This example demonstrates the basics of using DeepEval. For more details on advanced usage, please visit the docs.
1. Install packages and dependencies
Begin by installing the necessary libraries.
2. Define Retrieval Metrics
To optimize your Elasticsearch retriever, we'll need a way to assess retrieval quality. In this tutorial, we introduce 3 key metrics from DeepEval:
- Contextual Precision: Ensures the most relevant information are ranked higher than the irrelevant ones.
- Contextual Recall: Measures how well the retrieved information aligns with the expected LLM output
- Contextual Relevancy: Checks how well the retrieved context aligns with the query.
DeepEval metrics are powered by LLMs (LLM judge metrics). You can use any custom LLMs for evaluation, but for this tutorial we'll be using gpt-4o. Begin by setting your OPENAI_API_KEY:
After setting your OPENAI_API_KEY, DeepEval will automatically use gpt-4o as the default model for running these metrics. Now, let's define the metrics.
DeepEval metrics initialized successfully! 🚀
3. Defining Elastic Retriever
With the metrics defined, we can start building our RAG pipeline. In this tutorial, we'll construct and evaluate a QA RAG system designed to answer questions about Elasticsearch. First, let's create our Elastic retriever by setting up an index and populating it with knowledge about Elastic.
We'll use all-MiniLM-L6-v2 from the sentence_transformers library to embed our text chunks. You can learn more about this model on Hugging Face.
Initializing the Elasticsearch retriever
Instantiate the Elasticsearch python client, providing the cloud id and password in your deployment.
Before you continue, confirm that the client has connected with this test.
To store and retrieve embeddings efficiently, we need to create an index with the correct mappings. This index will store both the text data and its corresponding dense vector embeddings for semantic search.
Finally, use the following command to upload the knowledge base information about Elastic. The model.encode function encodes each text into a vector using the model we initialized earlier.
4. Define the RAG Pipeline
With the Elasticsearch database already initialized and populated, we can build our RAG pipeline.
Let's first define the search function, which serves as the Elastic retriever in our RAG pipeline. The search function:
- Encodes the input query using
all-MiniLM-L6-v2 - Performs a kNN search on the Elasticsearch index to find semantically similar documents
- Returns the most relevant knowledge from the data
Next, let's incorporate the search function into our overall RAG function. This RAG function:
- Calls the
searchfunction to retrieve the most relevant context from the Elasticsearch database - Passes this context to the prompt for generating an answer with an LLM
#. 5. Evaluating the Retriever
With the RAG pipeline, we can begin evaluating the retriever. Evaluation consists of two main steps:
-
Test Case Preparation:
Prepare an input query along with the expected LLM response. Then, use the input to generate a response from your RAG pipeline, creating anLLMTestCasethat contains:inputactual_outputexpected_outputretrieval_context
-
Test Case Evaluation:
Evaluate the test case using the selection of retrieval metrics we previously defined.
Test Case preparation
Let's begin by revisiting the input we had earlier and preparing an expected_output for it.
Next, retrieve the actual_output and retrieval_context for this input and create an LLMTestCase from it.
Run Evaluations
To run evaluations, simply pass the test case and metrics into DeepEval's evaluate function.
6. Optimizing the Retriever
Finally, even though we defined several hyperparameters like the embedding model and the number of candidates, let's iterate over top-K to find the best-performing value across these metrics. This is as simple as a for loop in DeepEval.
To optimize all hyperparameters, you'll want to iterate over all of them along with the metrics to find the best hyperparameter combination for your use case!