Tool Calling Eval Dataset
Tool Calling Evaluation — Dataset Preparation
This notebook is a companion resource to the How to Evaluate Tool-Calling Agents with Phoenix tutorial (link here).
It uploads the travel-assistant-tool-calling dataset and travel-assistant prompt to your Phoenix instance — the starting point for the full evaluation workflow covered in the tutorial.
Install Dependencies
Section 1: Define the Tool Set
Six tools define the capabilities of the travel planning assistant used in the tutorial.
| Tool | Description |
|---|---|
search_flights | Search available flights between two cities on a given date |
get_weather | Get current weather or forecast for a location |
search_hotels | Find hotels in a city for given dates and guest count |
get_directions | Get travel directions and estimated time between two locations |
convert_currency | Convert an amount from one currency to another |
search_restaurants | Find restaurants in a location by cuisine or criteria |
Section 2: Load the Evaluation Dataset
The evaluation dataset contains 30 travel assistant queries with ground truth tool calls, covering three scenarios:
| Pattern | Count | Description |
|---|---|---|
| Single-tool | 18 | One tool needed; tests parameter extraction, implicit dates, ambiguous phrasing |
| Parallel (2 tools) | 10 | Two tools needed simultaneously; all 10 two-tool combinations represented |
| No tool needed | 2 | General travel questions the assistant should answer directly |
Each query has an expected_tool_calls label with the full tool name and arguments.
Section 3: Build the Dataset DataFrame
The dataset has two columns:
| Column | Type | Purpose |
|---|---|---|
query | string | User's travel query — mapped to {{query}} in the experiment prompt |
expected_tool_calls | JSON string | Full name + arguments for each call — used for invocation alignment evaluation |
Section 4: Upload to Phoenix
The cell below launches an in-process Phoenix server — no additional setup required. Run it and open the printed URL to access the UI.
If you'd prefer to connect to an existing instance, skip that cell and set your connection details before running the upload cells:
- Phoenix Cloud: Set
PHOENIX_COLLECTOR_ENDPOINTto your workspace URL andPHOENIX_API_KEYto your API key (both available at phoenix.arize.com under Settings → API Keys). - Existing local server: If Phoenix is already running (e.g.
python -m phoenix.server.main serve),Client()connects tohttp://localhost:6006automatically — no env vars needed.
Section 5: Create a Phoenix Prompt with the Tool Set
This creates a versioned travel-assistant prompt in Phoenix with all six tool schemas attached. Once pushed, it will be available in Phoenix UI → Prompts → travel-assistant and can be selected directly when creating an experiment.
Next Steps
The travel-assistant-tool-calling dataset and travel-assistant prompt are now in Phoenix. The full walkthrough is covered in the tutorial — here's a quick reference for the steps that happen in the UI.
1. Run an experiment
Open Phoenix → Datasets → travel-assistant-tool-calling → New Experiment.
Select the travel-assistant prompt in the playground and run the experiment.
2. Add evaluators
After the experiment completes, click Add Evaluator:
- Tool Selection — from the built-in template; map
inputto your dataset'sinputcolumn - Tool Invocation — same input mapping
- Matches Expected (optional) — create a custom LLM evaluator to compare output tool calls against the labeled
expected_tool_callscolumn
3. Inspect and iterate
Review per-example explanations to identify failure patterns. Look for:
- Systematic issues (like date assumptions) → fix the system prompt
- Evaluator over-strictness → adjust the evaluator prompt
- Missing capabilities (like "current date") → extend the tool set
Rerun the experiment and compare versions side by side.