Ai Evals Hw2 Solution
HW 2: Recipe Bot Error Analysis
šÆ Assignment Overview
This notebook helps you perform error analysis for your Recipe Bot by:
- Part 1: Generate Test Queries - Create diverse queries using key dimensions
- Part 2: Run & Annotate - Test your bot and identify failure patterns
- Part 3: Create Taxonomy - Build structured failure mode categories
Goal: Systematically identify what goes wrong with your bot and why.
Part 1: Define Dimensions & Generate Initial Queries
Step 1.1: Identify Key Dimensions
Identify 3-4 key dimensions relevant to your Recipe Bot's functionality. For each dimension, list at least 3 example values.
Step 1.2: Generate Unique Combinations (Tuples)
Generate 15-20 unique combinations of these dimension values using programmatic sampling.
Step 1.3: Generate Natural Language User Queries
Take 5-7 of the generated tuples and create a natural language user query for your Recipe Bot for each selected tuple. Review these generated queries to ensure they are realistic and representative of how a user might interact with your bot.
Quality Review
Review the generated queries to make sure they're diverse and realistic:
Save Dataset
Save the dataset for testing:
Upload to Phoenix
You can either:
- Option A: Manually upload the CSV file to Phoenix UI
- Option B: Use the SDK upload below
Part 1 Complete ā
What you now have:
- 25 diverse test queries saved as CSV
- Dataset uploaded to Phoenix (ready for testing)
- Systematic coverage across key user dimensions
Next steps:
- Go to Phoenix UI
- Run your Recipe Bot on these queries
- Annotate problems you find
- Come back to this notebook for analysis
Part 2: Initial Error Analysis
Step 2.1: Run Bot on Synthetic Queries
- Upload Dataset: Load your synthetic queries into Phoenix playground
- Configure Bot: Import your Recipe Bot prompt
- Run Tests: Execute all queries through your bot
- Record Results: Save the interaction traces
Step 2.2: Open Coding
Review the recorded traces and perform open coding to identify themes, patterns, and potential errors in your bot's responses.
What to look for:
- Factual errors or incorrect recommendations
- Confusing or unhelpful responses
- Inconsistent behavior across similar queries
- Format and communication issues
How to annotate:
- Be specific about what went wrong
- Note why something is problematic for users
Part 3: Axial Coding & Taxonomy Definition
Step 3.1: Export Annotated Traces
Export your annotated traces and annotations from Phoenix.
Step 3.2: Axial Coding & Taxonomy Definition
Group your observations from open coding into broader categories or failure modes. We'll use an LLM to make this easier!
What the LLM will do:
- Find Patterns: Analyze all your annotations to identify common themes
- Create Categories: Generate 4-6 systematic failure mode labels
- Apply Labels: Classify each trace using the discovered failure modes
What you'll get:
- Clear Title for each failure mode
- One-sentence Definition explaining the failure
- 1-2 Examples from your actual bot traces
- Labeled dataset with each trace classified
Example failure modes:
- "Dietary Mismatch" - Bot suggests food that violates stated dietary restrictions
- "Missing Steps" - Recipe instructions are incomplete or unclear
- "Wrong Context" - Bot misunderstands what the user is asking for
Summary & Expected Outputs
What You'll Create
Files you'll generate:
generated_synthetic_queries.csv- Your test datasetlabeled_synthetic_data.csv- Your final analysis with failure mode labels
Steps to Complete
- Run Part 1 code - Generate test queries and upload to Phoenix
- Part 2 (Phoenix UI) - Run your prompt on queries, annotate problems with open coding
- Run Part 3 code - Export traces, use LLM to discover patterns and create taxonomy
What Part 3 Creates
The LLM analysis will automatically generate:
- Failure mode categories discovered from your annotations
- Systematic classification of each trace
- Complete taxonomy with definitions and examples
- Analysis spreadsheet with binary failure mode columns