Arize AI Ai Evals Hw2 Solution

Ai Evals Hw2 Solution

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringai_evals_courseaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentsanthropiclangchain

alph-notebooks/arize-phoenix / ai_evals_hw2_solution.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

HW 2: Recipe Bot Error Analysis

🎯 Assignment Overview

This notebook helps you perform error analysis for your Recipe Bot by:

Part 1: Generate Test Queries - Create diverse queries using key dimensions
Part 2: Run & Annotate - Test your bot and identify failure patterns
Part 3: Create Taxonomy - Build structured failure mode categories

Goal: Systematically identify what goes wrong with your bot and why.

[ ]

Part 1: Define Dimensions & Generate Initial Queries

Step 1.1: Identify Key Dimensions

Identify 3-4 key dimensions relevant to your Recipe Bot's functionality. For each dimension, list at least 3 example values.

[ ]

Step 1.2: Generate Unique Combinations (Tuples)

Generate 15-20 unique combinations of these dimension values using programmatic sampling.

[ ]

Step 1.3: Generate Natural Language User Queries

Take 5-7 of the generated tuples and create a natural language user query for your Recipe Bot for each selected tuple. Review these generated queries to ensure they are realistic and representative of how a user might interact with your bot.

[ ]

Quality Review

Review the generated queries to make sure they're diverse and realistic:

[ ]

Save Dataset

Save the dataset for testing:

[ ]

Upload to Phoenix

You can either:

Option A: Manually upload the CSV file to Phoenix UI
Option B: Use the SDK upload below

[ ]

Part 1 Complete ✅

What you now have:

25 diverse test queries saved as CSV
Dataset uploaded to Phoenix (ready for testing)
Systematic coverage across key user dimensions

Next steps:

Go to Phoenix UI
Run your Recipe Bot on these queries
Annotate problems you find
Come back to this notebook for analysis

Part 2: Initial Error Analysis

Step 2.1: Run Bot on Synthetic Queries

Upload Dataset: Load your synthetic queries into Phoenix playground
Configure Bot: Import your Recipe Bot prompt
Run Tests: Execute all queries through your bot
Record Results: Save the interaction traces

Step 2.2: Open Coding

Review the recorded traces and perform open coding to identify themes, patterns, and potential errors in your bot's responses.

What to look for:

Factual errors or incorrect recommendations
Confusing or unhelpful responses
Inconsistent behavior across similar queries
Format and communication issues

How to annotate:

Be specific about what went wrong
Note why something is problematic for users

Part 3: Axial Coding & Taxonomy Definition

Step 3.1: Export Annotated Traces

Export your annotated traces and annotations from Phoenix.

[ ]

Step 3.2: Axial Coding & Taxonomy Definition

Group your observations from open coding into broader categories or failure modes. We'll use an LLM to make this easier!

What the LLM will do:

Find Patterns: Analyze all your annotations to identify common themes
Create Categories: Generate 4-6 systematic failure mode labels
Apply Labels: Classify each trace using the discovered failure modes

What you'll get:

Clear Title for each failure mode
One-sentence Definition explaining the failure
1-2 Examples from your actual bot traces
Labeled dataset with each trace classified

Example failure modes:

"Dietary Mismatch" - Bot suggests food that violates stated dietary restrictions
"Missing Steps" - Recipe instructions are incomplete or unclear
"Wrong Context" - Bot misunderstands what the user is asking for

[ ]

Summary & Expected Outputs

What You'll Create

Files you'll generate:

generated_synthetic_queries.csv - Your test dataset
labeled_synthetic_data.csv - Your final analysis with failure mode labels

Steps to Complete

Run Part 1 code - Generate test queries and upload to Phoenix
Part 2 (Phoenix UI) - Run your prompt on queries, annotate problems with open coding
Run Part 3 code - Export traces, use LLM to discover patterns and create taxonomy

What Part 3 Creates

The LLM analysis will automatically generate:

Failure mode categories discovered from your annotations
Systematic classification of each trace
Complete taxonomy with definitions and examples
Analysis spreadsheet with binary failure mode columns