deepset Arize Phoenix Evaluate Haystack Rag

Arize Phoenix Evaluate Haystack Rag

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / arize_phoenix_evaluate_haystack_rag.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Tracing and Evaluating a Haystack RAG Application with Phoenix

Phoenix is a tool for tracing and evaluating LLM applications. In this tutorial, we will trace and evaluate a Haystack RAG pipeline. We'll evaluate using three different types of evaluations:

Relevance: Whether the retrieved documents are relevant to the question.
Q&A Correctness: Whether the answer to the question is correct.
Hallucination: Whether the answer contains hallucinations.

ℹ️ This notebook requires an OpenAI API key.

[ ]

Set API Keys

[2]

🔑 Enter your OpenAI API key: ··········

Launch Phoenix and Enable Haystack Tracing

If you don't have a Phoenix API key, you can get one for free at phoenix.arize.com. Arize Phoenix also provides self-hosting options if you'd prefer to run the application yourself instead.

[3]

Enter your Phoenix API Key··········

The command below connects Phoenix to your Haystack application and instruments the Haystack library. Any calls to Haystack pipelines from this point forward will be traced and logged to the Phoenix UI.

[ ]

Set up your Haystack app

For a step-by-step guide to create a RAG pipeline with Haystack, follow the Creating Your First QA Pipeline with Retrieval-Augmentation tutorial.

[17]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5e1e4be390>
,🚅 Components
,  - retriever: InMemoryBM25Retriever
,  - prompt_builder: ChatPromptBuilder
,  - llm: OpenAIChatGenerator
,🛤️ Connections
,  - retriever.documents -> prompt_builder.documents (List[Document])
,  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

Run the pipeline with a query. It will automatically create a trace on Phoenix.

[26]

Jean lives in Paris.

Evaluating Retrieved Docs

Now that we've traced our pipeline, let's start by evaluating the retrieved documents.

All evaluations in Phoenix use the same general process:

Query and download trace data from Phoenix
Add evaluation labels to the trace data. This can be done using the Phoenix library, using Haystack evaluators, or using your own evaluators.
Log the evaluation labels to Phoenix
View evaluations

We'll use the get_retrieved_documents function to get the trace data for the retrieved documents.

[ ]

[27]

/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
  warnings.warn(

Next we'll use Phoenix's RelevanceEvaluator to evaluate the relevance of the retrieved documents. This evaluator uses a LLM to determine if the retrieved documents contain the answer to the question.

[28]

run_evals |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

[29]

Finally, we'll log the evaluation labels to Phoenix.

[ ]

If you now click on your document retrieval span in Phoenix, you should see the evaluation labels.

Evaluate Response

With HallucinationEvaluator and QAEvaluator, we can detect the correctness and hallucination score of the generated response.

[32]

/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
  warnings.warn(

[33]

run_evals |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

[ ]

You should now see the Q&A correctness and hallucination evaluations in Phoenix.

[ ]