Arize AI Agent Trajectory

Agent Trajectory

arize-tutorialsevaluationLLMPython

alph-notebooks/arize-tutorials / agent_trajectory.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Docs | GitHub | Slack Community

Agent Trajectory Evaluation

This notebook demonstrates how to evaluate whether an agent's tool calling trajectory matches expected patterns. Agent trajectories represent the sequence of actions (tool calls) an agent takes to accomplish a task.

Why this matters: Evaluating agent trajectories helps you:

Understand if your agent follows expected problem-solving paths
Identify inefficient or incorrect tool usage patterns
Debug agent behavior

Setup

Configure your environment variables and import dependencies. You'll need to set up your Arize API key and import necessary libraries for data processing and evaluation.

[ ]

Data Extraction

Pull trace data from Arize and prepare it for analysis.

[ ]

Prompt Template Definition

The evaluation uses a carefully designed prompt template that instructs the LLM how to compare actual agent trajectories against golden trajectories. You can customize this template to fit your specific evaluation criteria.

Prompt Variables

Variable	Description	Source
`{reference_outputs}`	The golden/expected trajectory	From your reference data
`{tool_calls}`	The actual trajectory executed by the agent	Extracted from trace data

Customizing the Prompt

You may want to adjust the evaluation criteria or output format based on your specific use case:

Add specific criteria relevant to your agent's domain
Include additional metadata

[ ]

Data Preparation

These functions filter and transform trace data into the format needed for evaluation.

Core concepts:

Trace filtering: Selecting which agent executions to evaluate
Span filtering: Selecting which parts of each execution to analyze
Tool call extraction: Identifying the sequence of actions taken

The filter_spans_by_trace_criteria function is particularly important as it allows you to:

Select relevant traces using trace-level filters (e.g., by user query type, duration)
Focus on specific spans within those traces (e.g., only LLM-generated tool calls)

This two-level filtering gives you fine-grained control over your evaluation data.

[ ]

Evaluation Configuration

Reference outputs define your golden path - what tools should be called and in what order. These represent your expectation of the ideal agent behavior for a given task.

Note: This only makes sense with deterministic paths.

[ ]

Filter Data

Customize these parameters to match your specific evaluation needs:

Parameter	Description	Example
reference_outputs	Expected tool calls	`{"1": "get_llm_table_search"}`
trace_filters	Criteria for selecting traces	`{"name": {"contains": "searchrouter"}}`
span_filters	Criteria for selecting spans within traces	`{"attributes.openinference.span.kind": {"==": "LLM"}}`

Span filters are crucial as they determine which specific spans within the matched traces will be used for the evaluation. For example, filtering for "openinference.span.kind": "LLM" ensures we only analyze LLM-related spans within the selected traces.

Note: Update the trace_filters and span_filters to match your specific evaluation criteria

[ ]

We need to extract the tool calls from the output messages to use in the evaluation

[ ]

Prepare the data for the evaluation

This will group the prompt variables by trace_id and extract the required columns and append any additional data to the dataframe

[ ]

Running the Evaluation

After preparing your traces and configuring the evaluation parameters, you can execute the LLM-based evaluation:

[ ]

Analyzing Results

The evaluation results contain:

label: Overall trajectory assessment (correct/incorrect)
explanation: Detailed reasoning for the assessment

[ ]

The evaluation results can then be merged with your original data for analysis or to log back to Arize:

[ ]

See your results in Arize