Arize AI Trace Level Evals

Trace Level Evals

arize-tutorialsevaluationLLMPython

alph-notebooks/arize-tutorials / trace-level-evals.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Slack Community

Trace-Level Evals for a Movie Recommendation Agent

This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.

In this notebook, you will:

Build and capture interactions (traces) from your movie recommendation agent
Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage
Format the evaluation outputs to match Arize’s schema and log them to the platform
Learn a robust pipeline for assessing trace-level performance

✅ You will need a free Arize AX account and an OpenAI API key to run this notebook.

Set Up Keys & Dependencies

[ ]

Configure Tracing

[ ]

Build Movie Recommendation System

First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:

Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming
Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.
Preview Summarizer: For each movie, return a 1-2 sentence description

Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews.

Let's test our agent & view traces in Arize

[ ]

Results

Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit.

[ ]

Get Span Data from Arize

Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values.

[ ]

Define and Run Evaluators

In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge.

[ ]

Log Results Back to Arize

The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations.

[ ]

Trace Evals in Arize