Notebooks
A
Arize AI
Ai Evals Hw2 Solution

Ai Evals Hw2 Solution

agentsllmsLlamaIndexarize-phoenixopenaitutorialsevalsllmopsai-monitoringai_evals_courseaiengineeringprompt-engineeringdatasetsllm-evalai-observabilityllm-evaluationsmolagentsanthropiclangchain

Open In Colab

HW 2: Recipe Bot Error Analysis

šŸŽÆ Assignment Overview

This notebook helps you perform error analysis for your Recipe Bot by:

  1. Part 1: Generate Test Queries - Create diverse queries using key dimensions
  2. Part 2: Run & Annotate - Test your bot and identify failure patterns
  3. Part 3: Create Taxonomy - Build structured failure mode categories

Goal: Systematically identify what goes wrong with your bot and why.

[ ]
[ ]
[ ]

Part 1: Define Dimensions & Generate Initial Queries

Step 1.1: Identify Key Dimensions

Identify 3-4 key dimensions relevant to your Recipe Bot's functionality. For each dimension, list at least 3 example values.

[ ]

Step 1.2: Generate Unique Combinations (Tuples)

Generate 15-20 unique combinations of these dimension values using programmatic sampling.

[ ]

Step 1.3: Generate Natural Language User Queries

Take 5-7 of the generated tuples and create a natural language user query for your Recipe Bot for each selected tuple. Review these generated queries to ensure they are realistic and representative of how a user might interact with your bot.

[ ]

Quality Review

Review the generated queries to make sure they're diverse and realistic:

[ ]

Save Dataset

Save the dataset for testing:

[ ]

Upload to Phoenix

You can either:

  • Option A: Manually upload the CSV file to Phoenix UI
  • Option B: Use the SDK upload below
[ ]

Part 1 Complete āœ…

What you now have:

  • 25 diverse test queries saved as CSV
  • Dataset uploaded to Phoenix (ready for testing)
  • Systematic coverage across key user dimensions

Next steps:

  1. Go to Phoenix UI
  2. Run your Recipe Bot on these queries
  3. Annotate problems you find
  4. Come back to this notebook for analysis

Part 2: Initial Error Analysis

Step 2.1: Run Bot on Synthetic Queries

  1. Upload Dataset: Load your synthetic queries into Phoenix playground
  2. Configure Bot: Import your Recipe Bot prompt
  3. Run Tests: Execute all queries through your bot
  4. Record Results: Save the interaction traces

Step 2.2: Open Coding

Review the recorded traces and perform open coding to identify themes, patterns, and potential errors in your bot's responses.

What to look for:

  • Factual errors or incorrect recommendations
  • Confusing or unhelpful responses
  • Inconsistent behavior across similar queries
  • Format and communication issues

How to annotate:

  • Be specific about what went wrong
  • Note why something is problematic for users

Part 3: Axial Coding & Taxonomy Definition

Step 3.1: Export Annotated Traces

Export your annotated traces and annotations from Phoenix.

[ ]
[ ]
[ ]

Step 3.2: Axial Coding & Taxonomy Definition

Group your observations from open coding into broader categories or failure modes. We'll use an LLM to make this easier!

What the LLM will do:

  1. Find Patterns: Analyze all your annotations to identify common themes
  2. Create Categories: Generate 4-6 systematic failure mode labels
  3. Apply Labels: Classify each trace using the discovered failure modes

What you'll get:

  • Clear Title for each failure mode
  • One-sentence Definition explaining the failure
  • 1-2 Examples from your actual bot traces
  • Labeled dataset with each trace classified

Example failure modes:

  • "Dietary Mismatch" - Bot suggests food that violates stated dietary restrictions
  • "Missing Steps" - Recipe instructions are incomplete or unclear
  • "Wrong Context" - Bot misunderstands what the user is asking for
[ ]
[ ]
[ ]
[ ]
[ ]

Summary & Expected Outputs

What You'll Create

Files you'll generate:

  • generated_synthetic_queries.csv - Your test dataset
  • labeled_synthetic_data.csv - Your final analysis with failure mode labels

Steps to Complete

  1. Run Part 1 code - Generate test queries and upload to Phoenix
  2. Part 2 (Phoenix UI) - Run your prompt on queries, annotate problems with open coding
  3. Run Part 3 code - Export traces, use LLM to discover patterns and create taxonomy

What Part 3 Creates

The LLM analysis will automatically generate:

  • Failure mode categories discovered from your annotations
  • Systematic classification of each trace
  • Complete taxonomy with definitions and examples
  • Analysis spreadsheet with binary failure mode columns