Lab6 Evals

agentslabsarize-tutorialsagent-mastery-courseLLMPython

Arize Agent Mastry Course: Evaluating Your Agent

So far, we have built our agent, added tooling, and implemented a RAG system that allows it to access information. Now, we are ready to run the agent and evaluate its outputs. Evaluations can take different forms such as LLM, code, or human, and can be applied at various scopes including trace, span, and session.

In this lab, we will demonstrate how to run evaluations in code and log the results to the Arize UI. We will also show you how to set up and run evaluations directly within the Arize UI.

Set Up

[ ]
[ ]

Note that we are tracing our agent outputs to a different project from previous labs here:

[ ]

Define Tools

[ ]
[ ]
[ ]

Create RAG System for Local Flavor Tool

[ ]

Upload local_flavor.json file

[ ]
[ ]
[ ]

Define Agent

[ ]

Evaluate the Agent

The first step before evaluating our agent is to generate multiple runs using different query types. This way, our evaluation will cover many different cases. Running the two cells below will send requests to the agent and create traces that will appear in Arize. Unlike previous labs, these traces will be logged under a new project titled “evaluate-travel-agent”.

[ ]
[ ]

Span-Level Evaluation via Arize Python SDK

Arize supports evaluations at multiple levels of granularity. You can evaluate individual steps in an agent’s run (spans) or the full workflow (trace).

Here, we’ll perform span-level evaluations on retrieval steps to measure how relevant the retrieved documents are to each query.

First, navigate to your project and click "Export to Notebook". From here, copy the export_model_to_df function in the code snippet to export your traces. Export Traces

[ ]

Next, we define the prompt template for our LLM Judge. Feel free to customize this!

[ ]

Then, we grab relevant columns from our spans dataframe and rename columns to match the variables in the LLM Judge prompt

[ ]
[ ]

Finally, we define our evaluators and run the evaluation. When the evaluation is done running, we log the results back to Arize.

[ ]
[ ]

Click on the retriever spans within each trace to view detailed evaluation results. You can also filter by evaluation outcome to quickly identify which queries successfully retrieved the most relevant documents.

Eval Result

Trace-Level Evaluation in the Arize UI

In this section, we will walk you through how to set up and run evaluations in the Arize UI. Specifically, we will be running a trace level evaluation to determine the answer quality of our agent.

  1. In the project containing your traces, go to Eval Tasks and select LLM as a Judge.

  2. Name your task and schedule it to run on historical data. Each task can include multiple evaluators, but this walkthrough focuses on setting up one.

  3. Choose a trace-level evaluation.

  4. From the predefined templates, select Q&A or another template of your choice. You can also create a custom evaluation. If you define your own, ensure the variables align with your trace structure and specify the output labels (rails).

  5. Click Create Evals. Your evaluations will begin running and will appear on your existing traces. Look for the eval result on the top span for each trace.

Trace Level Eval