Arize AI Agents Cookbook

Agents Cookbook

agentsarize-tutorialsLLMPython

alph-notebooks/arize-tutorials / agents-cookbook.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Slack Community

Using Arize with AI agents

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

Create a customer support agent using a router template
Trace the agent activity, including function calling
Create a dataset to benchmark performance
Evaluate agent performance using code, human annotation, and LLM as a judge
Experiment with different prompts and models

Initial setup

We'll setup our libraries, keys, and OpenAI tracing using Phoenix.

Install Libraries

[ ]

Setup Keys

[ ]

Setup Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the guide here.

[ ]

Create customer support agent

We'll be creating a customer support agent using function calling following the architecture below:

Setup functions and create customer support agent

We have 6 functions that we define below.

product_comparison
product_search
customer_support
track_package
product_details
apply_discount_code

[ ]

We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions

[ ]

Let's test it and see if it returns the right function! Based on whether we set tool_choice to "auto" or "required", the router will have different behavior.

[ ]

Now we have a basic agent, let's generate a dataset of questions and run the prompt against this dataset!

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

[ ]

Now let's use this dataset and run it against the router prompt above!

[ ]

Evaluating your agent

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

Here, we are defining our evaluation templates to judge whether the router selected a function correctly, whether it selected the right function, and whether it filled the arguments correctly.

[ ]

Let's run evaluations using Phoenix's llm_classify function for our responses dataframe we generated above!

[ ]

Let's look at and inspect the results of our evaluatiion!

[ ]

Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

[ ]