Arize AI Session Level Evals

Session Level Evals

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentsanthropiclangchain

alph-notebooks/arize-phoenix / session_level_evals.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Community

Session Level Evals for an AI Tutor

This tutorial demonstrates how to run session-level evaluations on conversations with an AI tutor. You'll log the results back to Phoenix for further monitoring and analysis. Session-level evaluations are valuable because they provide a holistic view of the entire interaction, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.

In this tutorial, you will:

Trace and aggregate multi-turn interactions into structured sessions
Evaluate sessions across multiple dimensions such as Correctness, Goal Completion, and Frustration
Format the evaluation outputs to match the Phoenix schema and log them to the platform

By the end, you’ll have a robust evaluation pipeline for analyzing and comparing session-level performance.

✅ You’ll need a free Phoenix Cloud account and an Anthropic API key to run this notebook.

Set up Dependencies & Keys

[ ]

Configure Tracing

[ ]

Build and Run AI Tutor

In this example, we demonstrate how to evaluate AI tutor sessions. The tutor begins by receiving a user ID, topic, and question. It then explains the topic to the student and engages them with follow-up questions in a multi-turn conversation, continuing until the student ends the session. Our goal is to assess the overall quality of this interaction from start to finish.

[ ]

Prepare Spans for Session-Level Evaluation

These following cells prepare the data for session-level evaluation. We start by loading all spans into a DataFrame, then sort them chronologically and group them by session ID. You can also group the spans by user ID.

Next, we separate user inputs from AI responses, and finally, store the structured results in a dataframe. We will use this dataframe to run our evaluations.

[ ]

Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the sesssion messages if token limits are exceeded. This prevents context window issues for longer sessions.

[ ]

Session Correctness Eval

We are ready to begin running our evals. Let's start with an eval that ensures the AI tutor is giving the student factual information:

[ ]

Session Frustration Prompt

This evaluation is used to make sure the student isn't getting frustrated with the tutor:

[ ]

Session Goal Achievement Eval

Finally, we evaluate to ensure the tutor helped the student reach their learning goals:

[ ]

Log Evaluations Back to Phoenix

Finally, we can log the evaluation results back to Phoenix. In the sessions, tab of your project, you will see the evaluation results populate for each session.

[ ]

Session Eval Results