Notebooks
A
Arize AI
Agents Cookbook

Agents Cookbook

agentsarize-tutorialsLLMPython

arize logo
Docs | GitHub | Slack Community

Using Arize with AI agents

This guide shows you how to create and evaluate agents with Arize to improve performance. We'll go through the following steps:

  • Create a customer support agent using a router template

  • Trace the agent activity, including function calling

  • Create a dataset to benchmark performance

  • Evaluate agent performance using code, human annotation, and LLM as a judge

  • Experiment with different prompts and models

Initial setup

We'll setup our libraries, keys, and OpenAI tracing using Phoenix.

Install Libraries

[ ]

Setup Keys

[ ]

Setup Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the guide here.

[ ]

Create customer support agent

We'll be creating a customer support agent using function calling following the architecture below:

Setup functions and create customer support agent

We have 6 functions that we define below.

  1. product_comparison
  2. product_search
  3. customer_support
  4. track_package
  5. product_details
  6. apply_discount_code
[ ]

We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions

[ ]

Let's test it and see if it returns the right function! Based on whether we set tool_choice to "auto" or "required", the router will have different behavior.

[ ]

Now we have a basic agent, let's generate a dataset of questions and run the prompt against this dataset!

Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

[ ]
[ ]
[ ]
[ ]

Now let's use this dataset and run it against the router prompt above!

[ ]
[ ]

Evaluating your agent

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

Here, we are defining our evaluation templates to judge whether the router selected a function correctly, whether it selected the right function, and whether it filled the arguments correctly.

[ ]

Let's run evaluations using Phoenix's llm_classify function for our responses dataframe we generated above!

[ ]

Let's look at and inspect the results of our evaluatiion!

[ ]
[ ]
[ ]

Create an experiment

With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.

Let's create this dataset and upload it into the platform.

[ ]
[ ]
[ ]