deepset Rag Eval Ragas

Rag Eval Ragas

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / rag_eval_ragas.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG pipeline evaluation using Ragas

Ragas is an open source framework for model-based evaluation to evaluate your Retrieval Augmented Generation (RAG) pipelines and LLM applications. It supports metrics like correctness, tone, hallucination (faithfulness), fluency, and more.

For more information about evaluators, supported metrics and usage, check out:

This notebook shows how to use the Ragas-Haystack integration to evaluate a RAG pipeline against various metrics.

Notebook by Anushree Bannadabhavi, Siddharth Sahu, Julian Risch

Prerequisites:

Ragas uses OpenAI key for computing some metrics, so we need an OpenAI API key.

[ ]

Install dependencies

[ ]

Importing Required Libraries

[ ]

Creating a Sample Dataset

In this section we create a sample dataset containing information about AI companies and their language models. This dataset serves as the context for retrieving relevant data during pipeline execution.

[4]

Initializing RAG Pipeline Components

This section sets up the essential components required to build a Retrieval-Augmented Generation (RAG) pipeline. These components include a Document Store for managing and storing documents, an Embedder for generating embeddings to enable similarity-based retrieval, and a Retriever for fetching relevant documents. Additionally, a Prompt Template is designed to structure the pipeline's input, while a Chat Generator handles response generation.

[6]

Calculating embeddings: 1it [00:00,  1.67it/s]

Configuring RagasEvaluator Component

Pass all the Ragas metrics you want to use for evaluation, ensuring that all the necessary information to calculate each selected metric is provided.

For example:

AnswerRelevancy: requires both the query and the response. It does not consider factuality but instead assigns lower score to cases where the response lacks completeness or contains redundant details.
ContextPrecision: requires the query, retrieved documents, and the reference. It evaluates to what extent the retrieved documents contain precisely only what is relevant to answer the query.
Faithfulness: requires the query, retrieved documents, and the response. The response is regarded as faithful if all the claims that are made in the response can be inferred from the retrieved documents.

Make sure to include all relevant data for each metric to ensure accurate evaluation.

[7]

Building and Connecting the RAG Pipeline

Here we add and connect the initialized components to form a RAG Haystack pipeline.

[8]

<haystack.core.pipeline.pipeline.Pipeline object at 0x16a0d1790>
,🚅 Components
,  - text_embedder: OpenAITextEmbedder
,  - retriever: InMemoryEmbeddingRetriever
,  - prompt_builder: ChatPromptBuilder
,  - llm: OpenAIChatGenerator
,  - answer_builder: AnswerBuilder
,  - ragas_evaluator: RagasEvaluator
,🛤️ Connections
,  - text_embedder.embedding -> retriever.query_embedding (List[float])
,  - retriever.documents -> prompt_builder.documents (List[Document])
,  - retriever.documents -> answer_builder.documents (List[Document])
,  - retriever.documents -> ragas_evaluator.documents (List[Document])
,  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
,  - llm.replies -> answer_builder.replies (List[ChatMessage])
,  - llm.replies -> ragas_evaluator.response (List[ChatMessage])

[9]

Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00,  6.33s/it]

Meta AI’s LLaMA models stand out due to their open-source nature, allowing researchers and developers to access high-quality models for free. This accessibility fosters innovation and experimentation, making it easier for individuals and organizations to collaborate and advance AI development without the constraints of expensive resources. Their strong performance further enhances their appeal in the AI community. 

{'answer_relevancy': 0.9758, 'context_precision': 1.0000, 'faithfulness': 1.0000}

Standalone Evaluation of the RAG Pipeline

This section explores an alternative approach to evaluating a RAG pipeline without using the RagasEvaluator component. It emphasizes manual extraction of outputs and organizing them for evaluation.

You can use any existing Haystack pipeline for this purpose. For demonstration, we will create a simple RAG pipeline similar to the one described earlier, but without including the RagasEvaluator component.

Setting Up a Basic RAG Pipeline

We construct a simple RAG pipeline similar to the approach above but without the RagasEvaluator component.

[11]

Calculating embeddings: 1it [00:00,  3.14it/s]

<haystack.core.pipeline.pipeline.Pipeline object at 0x16a77bbd0>
,🚅 Components
,  - text_embedder: OpenAITextEmbedder
,  - retriever: InMemoryEmbeddingRetriever
,  - prompt_builder: ChatPromptBuilder
,  - llm: OpenAIChatGenerator
,  - answer_builder: AnswerBuilder
,🛤️ Connections
,  - text_embedder.embedding -> retriever.query_embedding (List[float])
,  - retriever.documents -> prompt_builder.documents (List[Document])
,  - retriever.documents -> answer_builder.documents (List[Document])
,  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
,  - llm.replies -> answer_builder.replies (List[ChatMessage])

Extracting Outputs for Evaluation

After building the pipeline, we use it to generate the necessary outputs, such as retrieved documents and responses. These outputs are then structured into a dataset for evaluation.

[12]

When constructing the evals_list, it is important to align the keys in the single_turn dictionary with the attributes defined in the Ragas SingleTurnSample. This ensures compatibility with the Ragas evaluation framework. Use the retrieved documents and pipeline outputs to populate these fields accurately, as demonstrated in the provided code snippet.

Evaluating the pipeline using Ragas EvaluationDataset

The extracted dataset is converted into a Ragas EvaluationDataset so that Ragas can process it. We then initialize an LLM evaluator using the HaystackLLMWrapper. Finally, we call Ragas's evaluate() function with our evaluation dataset, three metrics, and the LLM evaluator.

[13]

Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:23<00:00,  2.57s/it]

{'answer_relevancy': 0.9679, 'context_precision': 1.0000, 'faithfulness': 1.0000}

Haystack Useful Sources