Milvus Evaluation With Deepeval

Evaluation With Deepeval

image-searchvector-databasesemantic-searchIntegrationmilvusembeddingsunstructured-dataquestion-answeringLLMmilvus-bootcampdeep-learningimage-recognitionimage-classificationaudio-searchPythonragNLP

alph-notebooks/milvus-bootcamp / evaluation_with_deepeval.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Evaluation with DeepEval

This guide demonstrates how to use DeepEval to evaluate a Retrieval-Augmented Generation (RAG) pipeline built upon Milvus.

The RAG system combines a retrieval system with a generative model to generate new text based on a given prompt. The system first retrieves relevant documents from a corpus using Milvus, and then uses a generative model to generate new text based on the retrieved documents.

DeepEval is a framework that helps you evaluate your RAG pipelines. There are existing tools and frameworks that help you build these pipelines but evaluating it and quantifying your pipeline performance can be hard. This is where DeepEval comes in.

Prerequisites

Before running this notebook, make sure you have the following dependencies installed:

[ ]

If you are using Google Colab, to enable dependencies just installed, you may need to restart the runtime (click on the "Runtime" menu at the top of the screen, and select "Restart session" from the dropdown menu).

We will use OpenAI as the LLM in this example. You should prepare the api key OPENAI_API_KEY as an environment variable.

[1]

Define the RAG pipeline

We will define the RAG class that use Milvus as the vector store, and OpenAI as the LLM. The class contains the load method, which loads the text data into Milvus, the retrieve method, which retrieves the most similar text data to the given question, and the answer method, which answers the given question with the retrieved knowledge.

[2]

Let's initialize the RAG class with OpenAI and Milvus clients.

[3]

As for the argument of MilvusClient:

Setting the uri as a local file, e.g../milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file.

If you have large scale of data, you can set up a more performant Milvus server on docker or kubernetes. In this setup, please use the server uri, e.g.http://localhost:19530, as your uri.

If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the uri and token, which correspond to the Public Endpoint and Api key in Zilliz Cloud.

Run the RAG pipeline and get results

We use the Milvus development guide to be as the private knowledge in our RAG, which is a good data source for a simple RAG pipeline.

Download it and load it into the rag pipeline.

[4]

Creating embeddings: 100%|██████████| 47/47 [00:20<00:00,  2.26it/s]

Let's define a query question about the content of the development guide documentation. And then use the answer method to get the answer and the retrieved context texts.

[5]

('The hardware requirements specification to build and run Milvus from source code is as follows:\n\n- 8GB of RAM\n- 50GB of free disk space',
, ['Hardware Requirements\n\nThe following specification (either physical or virtual machine resources) is recommended for Milvus to build and run from source code.\n\n```\n- 8GB of RAM\n- 50GB of free disk space\n```\n\n##',
,  'Building Milvus on a local OS/shell environment\n\nThe details below outline the hardware and software requirements for building on Linux and MacOS.\n\n##',
,  "Software Requirements\n\nAll Linux distributions are available for Milvus development. However a majority of our contributor worked with Ubuntu or CentOS systems, with a small portion of Mac (both x86_64 and Apple Silicon) contributors. If you would like Milvus to build and run on other distributions, you are more than welcome to file an issue and contribute!\n\nHere's a list of verified OS types where Milvus can successfully build and run:\n\n- Debian/Ubuntu\n- Amazon Linux\n- MacOS (x86_64)\n- MacOS (Apple Silicon)\n\n##"])

Now let's prepare some questions with its corresponding ground truth answers. We get answers and contexts from our RAG pipeline.

[6]

/Users/eureka/miniconda3/envs/zilliz/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
Answering questions: 100%|██████████| 3/3 [00:03<00:00,  1.06s/it]

Evaluating Retriever

When evaluating a retriever in large language model (LLM) systems, it's crucial to assess the following:

Ranking Relevance: How effectively the retriever prioritizes relevant information over irrelevant data.
Contextual Retrieval: The ability to capture and retrieve contextually relevant information based on the input.
Balance: How well the retriever manages text chunk size and retrieval scope to minimize irrelevancies.

Together, these factors provide a comprehensive understanding of how the retriever prioritizes, captures, and presents the most useful information.

[7]

/Users/eureka/miniconda3/envs/zilliz/lib/python3.9/site-packages/deepeval/__init__.py:49: UserWarning: You are using deepeval version 1.1.6, however version 1.2.2 is available. You should consider upgrading via the "pip install --upgrade deepeval" command.
  warnings.warn(

Event loop is already running. Applying nest_asyncio patch to allow async execution...

Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:11,  3.91s/test case]

Evaluating Generation

To assess the quality of generated outputs in large language models (LLMs), it's important to focus on two key aspects:

Relevance: Evaluate whether the prompt effectively guides the LLM to generate helpful and contextually appropriate responses.
Faithfulness: Measure the accuracy of the output, ensuring the model produces information that is factually correct and free from hallucinations or contradictions. The generated content should align with the factual information provided in the retrieval context.

These factors together ensure that the outputs are both relevant and reliable.

[8]

Event loop is already running. Applying nest_asyncio patch to allow async execution...

Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:11,  3.97s/test case]