Agent Trajectory
Agent Trajectory Evaluation
This notebook demonstrates how to evaluate whether an agent's tool calling trajectory matches expected patterns. Agent trajectories represent the sequence of actions (tool calls) an agent takes to accomplish a task.
Why this matters: Evaluating agent trajectories helps you:
- Understand if your agent follows expected problem-solving paths
- Identify inefficient or incorrect tool usage patterns
- Debug agent behavior
Setup
Configure your environment variables and import dependencies. You'll need to set up your Arize API key and import necessary libraries for data processing and evaluation.
Data Extraction
Pull trace data from Arize and prepare it for analysis.
Prompt Template Definition
The evaluation uses a carefully designed prompt template that instructs the LLM how to compare actual agent trajectories against golden trajectories. You can customize this template to fit your specific evaluation criteria.
Prompt Variables
| Variable | Description | Source |
|---|---|---|
{reference_outputs} | The golden/expected trajectory | From your reference data |
{tool_calls} | The actual trajectory executed by the agent | Extracted from trace data |
Customizing the Prompt
You may want to adjust the evaluation criteria or output format based on your specific use case:
- Add specific criteria relevant to your agent's domain
- Include additional metadata
Data Preparation
These functions filter and transform trace data into the format needed for evaluation.
Core concepts:
- Trace filtering: Selecting which agent executions to evaluate
- Span filtering: Selecting which parts of each execution to analyze
- Tool call extraction: Identifying the sequence of actions taken
The filter_spans_by_trace_criteria function is particularly important as it allows you to:
- Select relevant traces using trace-level filters (e.g., by user query type, duration)
- Focus on specific spans within those traces (e.g., only LLM-generated tool calls)
This two-level filtering gives you fine-grained control over your evaluation data.
Evaluation Configuration
Reference outputs define your golden path - what tools should be called and in what order. These represent your expectation of the ideal agent behavior for a given task.
Note: This only makes sense with deterministic paths.
Filter Data
Customize these parameters to match your specific evaluation needs:
| Parameter | Description | Example |
|---|---|---|
| reference_outputs | Expected tool calls | {"1": "get_llm_table_search"} |
| trace_filters | Criteria for selecting traces | {"name": {"contains": "searchrouter"}} |
| span_filters | Criteria for selecting spans within traces | {"attributes.openinference.span.kind": {"==": "LLM"}} |
Span filters are crucial as they determine which specific spans within the matched traces will be used for the evaluation. For example, filtering for "openinference.span.kind": "LLM" ensures we only analyze LLM-related spans within the selected traces.
Note: Update the
trace_filtersandspan_filtersto match your specific evaluation criteria
We need to extract the tool calls from the output messages to use in the evaluation
Prepare the data for the evaluation
This will group the prompt variables by trace_id and extract the required columns and append any additional data to the dataframe
Running the Evaluation
After preparing your traces and configuring the evaluation parameters, you can execute the LLM-based evaluation:
Analyzing Results
The evaluation results contain:
- label: Overall trajectory assessment (correct/incorrect)
- explanation: Detailed reasoning for the assessment
The evaluation results can then be merged with your original data for analysis or to log back to Arize:
See your results in Arize
