Lab6 Evals
Arize Agent Mastry Course: Evaluating Your Agent
So far, we have built our agent, added tooling, and implemented a RAG system that allows it to access information. Now, we are ready to run the agent and evaluate its outputs. Evaluations can take different forms such as LLM, code, or human, and can be applied at various scopes including trace, span, and session.
In this lab, we will demonstrate how to run evaluations in code and log the results to the Arize UI. We will also show you how to set up and run evaluations directly within the Arize UI.
Set Up
Note that we are tracing our agent outputs to a different project from previous labs here:
Define Tools
Create RAG System for Local Flavor Tool
Upload local_flavor.json file
Define Agent
Evaluate the Agent
The first step before evaluating our agent is to generate multiple runs using different query types. This way, our evaluation will cover many different cases. Running the two cells below will send requests to the agent and create traces that will appear in Arize. Unlike previous labs, these traces will be logged under a new project titled “evaluate-travel-agent”.
Span-Level Evaluation via Arize Python SDK
Arize supports evaluations at multiple levels of granularity. You can evaluate individual steps in an agent’s run (spans) or the full workflow (trace).
Here, we’ll perform span-level evaluations on retrieval steps to measure how relevant the retrieved documents are to each query.
First, navigate to your project and click "Export to Notebook". From here, copy the export_model_to_df function in the code snippet to export your traces.

Next, we define the prompt template for our LLM Judge. Feel free to customize this!
Then, we grab relevant columns from our spans dataframe and rename columns to match the variables in the LLM Judge prompt
Finally, we define our evaluators and run the evaluation. When the evaluation is done running, we log the results back to Arize.
Click on the retriever spans within each trace to view detailed evaluation results. You can also filter by evaluation outcome to quickly identify which queries successfully retrieved the most relevant documents.

Trace-Level Evaluation in the Arize UI
In this section, we will walk you through how to set up and run evaluations in the Arize UI. Specifically, we will be running a trace level evaluation to determine the answer quality of our agent.
-
In the project containing your traces, go to Eval Tasks and select LLM as a Judge.
-
Name your task and schedule it to run on historical data. Each task can include multiple evaluators, but this walkthrough focuses on setting up one.
-
Choose a trace-level evaluation.
-
From the predefined templates, select Q&A or another template of your choice. You can also create a custom evaluation. If you define your own, ensure the variables align with your trace structure and specify the output labels (rails).
-
Click Create Evals. Your evaluations will begin running and will appear on your existing traces. Look for the eval result on the top span for each trace.
