Elastic Optimizing Retrieval With Deepeval

Optimizing Retrieval With Deepeval

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labsoptimizing-retrieval-with-deepevallangchainapplications

alph-notebooks/elasticsearch-labs / optimizing-retrieval-with-deepeval.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Optimizing Retrieval with DeepEval

In this tutorial, we'll use DeepEval to evaluate a RAG pipeline's retriever built with Elasticsearch in order to select the best hyperparameters—such as top-K, embedding model, and chunk size—to optimize retrieval performance.

More specifically, we will:

Define DeepEval metrics to measure retrieval quality
Build a simple RAG pipeline with Elasticsearch
Run evaluations on the Elastic retriever using DeepEval metrics
Optimize the hyperparameters based on evaluation results

DeepEval metrics work out of the box without any additional configuration. This example demonstrates the basics of using DeepEval. For more details on advanced usage, please visit the docs.

1. Install packages and dependencies

Begin by installing the necessary libraries.

[ ]

2. Define Retrieval Metrics

To optimize your Elasticsearch retriever, we'll need a way to assess retrieval quality. In this tutorial, we introduce 3 key metrics from DeepEval:

Contextual Precision: Ensures the most relevant information are ranked higher than the irrelevant ones.
Contextual Recall: Measures how well the retrieved information aligns with the expected LLM output
Contextual Relevancy: Checks how well the retrieved context aligns with the query.

DeepEval metrics are powered by LLMs (LLM judge metrics). You can use any custom LLMs for evaluation, but for this tutorial we'll be using gpt-4o. Begin by setting your OPENAI_API_KEY:

[ ]

After setting your OPENAI_API_KEY, DeepEval will automatically use gpt-4o as the default model for running these metrics. Now, let's define the metrics.

[ ]

DeepEval metrics initialized successfully! 🚀

3. Defining Elastic Retriever

With the metrics defined, we can start building our RAG pipeline. In this tutorial, we'll construct and evaluate a QA RAG system designed to answer questions about Elasticsearch. First, let's create our Elastic retriever by setting up an index and populating it with knowledge about Elastic.

We'll use all-MiniLM-L6-v2 from the sentence_transformers library to embed our text chunks. You can learn more about this model on Hugging Face.

[ ]

Initializing the Elasticsearch retriever

Instantiate the Elasticsearch python client, providing the cloud id and password in your deployment.

[ ]

Before you continue, confirm that the client has connected with this test.

[ ]

To store and retrieve embeddings efficiently, we need to create an index with the correct mappings. This index will store both the text data and its corresponding dense vector embeddings for semantic search.

[ ]

Finally, use the following command to upload the knowledge base information about Elastic. The model.encode function encodes each text into a vector using the model we initialized earlier.

[ ]

4. Define the RAG Pipeline

With the Elasticsearch database already initialized and populated, we can build our RAG pipeline.

Let's first define the search function, which serves as the Elastic retriever in our RAG pipeline. The search function:

Encodes the input query using all-MiniLM-L6-v2
Performs a kNN search on the Elasticsearch index to find semantically similar documents
Returns the most relevant knowledge from the data

[ ]

Next, let's incorporate the search function into our overall RAG function. This RAG function:

Calls the search function to retrieve the most relevant context from the Elasticsearch database
Passes this context to the prompt for generating an answer with an LLM

[ ]

#. 5. Evaluating the Retriever

With the RAG pipeline, we can begin evaluating the retriever. Evaluation consists of two main steps:

Test Case Preparation:
Prepare an input query along with the expected LLM response. Then, use the input to generate a response from your RAG pipeline, creating an LLMTestCase that contains:
- input
- actual_output
- expected_output
- retrieval_context
Test Case Evaluation:
Evaluate the test case using the selection of retrieval metrics we previously defined.

Test Case preparation

Let's begin by revisiting the input we had earlier and preparing an expected_output for it.

[ ]

Next, retrieve the actual_output and retrieval_context for this input and create an LLMTestCase from it.

[ ]

Run Evaluations

To run evaluations, simply pass the test case and metrics into DeepEval's evaluate function.

[ ]

6. Optimizing the Retriever

Finally, even though we defined several hyperparameters like the embedding model and the number of candidates, let's iterate over top-K to find the best-performing value across these metrics. This is as simple as a for loop in DeepEval.

To optimize all hyperparameters, you'll want to iterate over all of them along with the metrics to find the best hyperparameter combination for your use case!

[ ]