Trace Level Evals
Trace-Level Evals for a Movie Recommendation Agent
This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.
In this notebook, you will:
- Build and capture interactions (traces) from your movie recommendation agent
- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage
- Format the evaluation outputs to match Arize’s schema and log them to the platform
- Learn a robust pipeline for assessing trace-level performance
✅ You will need a free Arize AX account and an OpenAI API key to run this notebook.
Set Up Keys & Dependencies
Configure Tracing
Build Movie Recommendation System
First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:
- Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming
- Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.
- Preview Summarizer: For each movie, return a 1-2 sentence description
Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews.
Let's test our agent & view traces in Arize

Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit.
Get Span Data from Arize
Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values.
Define and Run Evaluators
In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge.
Log Results Back to Arize
The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations.
