Notebooks
A
Arize AI
Phoenix Support Query Classification

Improving Classification with LLMs using Prompt Learning

In this notebook we will leverage the PromptLearningOptimizer developed here at Arize to improve upon the accuracy of LLMs on classification tasks. Specifically we will be classifying support queries into 30 different classes, including

Account Creation

Login Issues

Password Reset

Two-Factor Authentication

Profile Updates

Billing Inquiry

Refund Request

and 24 more.

You can view the dataset in datasets/support_queries.csv.

Note: This notebook phoenix_support_query_classification.ipynb complements support_query_classification.ipynb by using Phoenix datasets, experiments, and prompt management for Prompt Learning. It's a more end to end way for you to visualize your iterative prompt improvement and see how it performs on train/test sets, and also leverages Phoenix methods for advanced features.

[ ]
[ ]
[ ]
[ ]

Setup

[ ]

Make train/test sets

We use an 80/20 train/test split to train our prompt. The optimizer will use the training set to visualize and analyze its errors and successes, and make prompt updates based on these results. We will then test on the test set to see how that prompt performs on unseen data.

We will be exporting these datasets to Phoenix. In Phoenix you will be able to view the experiments we run on the train/test sets.

[ ]

Base Prompt for Optimization

This is our base prompt - our 0th iteration. This is the prompt we will be optimizing for our task.

We also upload our prompt to Phoenix. Phoenix Prompt Hub serves as a repository for your prompts. You will be able to view all iterations of your prompt as its optimized, along with some metrics.

[ ]

Output Generator

This function calls OpenAI with our prompt on every row of our dataset to generate outputs. It leverages llm_generate, a Phoenix function, for concurrency in calling LLMs.

We return the output column, which contains outputs for every row of our dataset, or every support query in our dataset.

[ ]
[ ]

Evaluator

In this section we define our LLM-as-judge eval.

Prompt Learning works by generating natural language evaluations on your outputs. These evaluations help guide the prompt optimizer towards building an optimized prompt.

You should spend time thinking about how to write an informative eval. Your eval makes or breaks this prompt optimizer. With helpful feedback, our prompt optimizer will be able to generate a stronger optimized prompt much more effectively than with sparse or unhelpful feedback.

Below is a great example for building a strong eval. You can see that we return many evaluations, including

  • correctness: correct/incorrect - whether the support query was classified correctly or incorrectly.

  • explanation: Brief explanation of why the predicted classification is correct or incorrect, referencing the correct label if relevant.

  • confusion_reason: If incorrect, explains why the model may have made this choice instead of the correct classification. Focuses on likely sources of confusion. If correct, 'no confusion'.

  • error_type: One of: 'broad_vs_specific', 'keyword_bias', 'multi_intent_confusion', 'ambiguous_query', 'off_topic', 'paraphrase_gap', 'other'. Use 'none' if correct. Include the definition of the chosen error type, which are passed into the evaluator's prompt.

  • evidence_span: Exact phrase(s) from the query that strongly indicate the correct classification.

  • prompt_fix_suggestion: One clear instruction to add to the classifier prompt to prevent this error.

Take a look at support_query_classification/evaluator_prompt.txt for the full prompt!

Our evaluator leverages llm_generate once again to build these llm evals with concurrency. We use an output parser to ensure that our eval is returned in proper json format.

[ ]

Metrics

Below we define some metrics that will compute on each iteration of prompt optimization. It will help us measure how our classifier with the current iteration's prompt performs.

Specifically we use scikit learn for precision, recall, f1 score, and simple accuracy.

[ ]

Experiment Processor

This function pulls a Phoenix experiment and loads the data into a pandas dataframe so it can run through the optimizer.

Specifically it:

  • Pulls the experiment data from Phoenix
  • Adds the input column to the dataframe
  • Adds the evals to the dataframe
  • Adds the output to the dataframe
  • Returns the dataframe
[ ]

Prompt Optimization Loop with Phoenix Experiments

This code implements an iterative prompt optimization system that uses Phoenix experiments to evaluate and improve prompts based on feedback from LLM evaluators.

Overview

The optimize_loop function automates prompt engineering by:

  • Evaluating prompts using Phoenix experiments
  • Collecting detailed feedback from LLM evaluators
  • Optimizing prompts via a learning-based optimizer
  • Iterating until the performance threshold is met or the loop limit is reached

Step-by-Step Breakdown

Each of these numbers are added as comments in the code.

1. Initialization

  • Set up tracking variables:
    • train_metrics, test_metrics, raw_dfs for storing evaluation results
  • Convert training dataset to a DataFrame for easy updates

2. Baseline Evaluation

  • Run an initial experiment using the test set
  • Establish a baseline metric (e.g., accuracy, F1) to compare against future improvements

3. Early Exit Check

  • If the initial prompt already meets the performance threshold, skip further optimization to save time and compute

4. Main Optimization Loop

For each iteration (up to loops):

4a. Run Training Experiment

  • Execute the current prompt on the training set
  • Use LLM evaluators to generate natural language feedback

4b. Process Feedback

  • Extract structured information from evaluator outputs:
    • Correctness
    • Explanation
    • Confusion reason
    • Error type
    • Prompt fix suggestions
  • Update the training DataFrame with this feedback

4c. Generate Learning Annotations

  • Convert feedback into structured annotations for the optimizer to learn from
  • This allows learning from evaluator insights in a consistent format

4d. Optimize the Prompt

  • Pass feedback to the PromptLearningOptimizer
  • Generate an improved prompt that attempts to correct issues found in the previous iteration

4e. Evaluate on Test Set

  • Evaluate the updated prompt on the held-out test set
  • Assess generalization beyond the training data

4f. Track Metrics

  • Log metrics for:
    • Training set performance
    • Test set performance
  • Store raw results for further analysis or visualization

4g. Convergence Check

  • If the new prompt's test metric meets or exceeds the threshold, exit the loop early
[ ]

Prompt Optimized!

The code below picks the prompt with the highest score on the test set, and displays the training/test metrics and delta for that prompt.

[ ]
[ ]