Arize AI Datasets Experiments Quickstart Python

Datasets Experiments Quickstart Python

arize-tutorialsLLMPythonexperiments

alph-notebooks/arize-tutorials / datasets_experiments_quickstart_python.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Slack Community

This tutorial demonstrates how to use AX Datasts & Experiments to systematically evaluate and improve AI agents. You'll learn how to create datasets, define task functions that run your agent on each example, and use both code-based and LLM-as-a-Judge evaluators to measure performance. By the end, you'll be able to run experiments that compare different agent versions and track improvements over time, enabling data-driven development and deployment decisions.

The notebook covers four main sections. Follow the documention for the complete tutorial.

Define Agent: Set up a customer support agent with tools for ticket classification and policy retrieval, using the agno framework labels, then upload it to Phoenix
Create a Dataset: Build a dataset of support ticket queries with ground truth labels, then upload it to Phoenix
Define an Experiment: Create task functions and evaluators (code-based and LLM judges), then run experiments to measure agent performance and compare different versions
Iterations with Experiments: Compare different agent versions using experiments to validate improvements before deployment

[ ]

Define Support Agent

This agent is a customer support assistant that helps users resolve their issues by classifying tickets and retrieving relevant policies. The agent has two tools: classify_ticket, which categorizes support tickets into billing, technical, account, or other categories, and retrieve_policy, which fetches the appropriate internal support policy based on the ticket category.

[ ]

Section 1: Create a Dataset

[ ]

Section 2: Define an Experiment

Run an Experiment to Check Tool Call Accuracy (Code-Based Evaluator)

This is our tool function from above:

[ ]

Since our "baseline" examples have a ground truth field, we can used a code based evaluator to check if the task output matches what we expect.

[ ]

Run an Experiment to Understand Overall Agent Performance (LLM-as-a-Judge Evaluator)

[ ]

CI/CD with Experiments

[ ]