Notebooks
A
Arize AI
BizNorm 100 Evaluator Optimization

My Image

Evaluator Prompt Optimization

In this notebook we'll be using the Prompt Learning SDK to optimize an LLM-as-Judge Eval Prompt. LLM-as-Judge evaluators use an LLM to evaluate LLM outputs, and are effective and versatile in testing/evaluating your LLM applications. You can learn more here.

Since your evals use LLMs, the prompts you provide to those LLMs dictate what your eval does. In practice, the goal is to ALIGN your eval with your goals. You want to bring your eval to a level of competence that you would expect from a human who manually evaluates outputs.

This notebook shows you how to build an evaluator that checks if outputs are normalized/sanitized, and then align the evaluator with your expectations for normalization/sanitization so you can trust this eval in production by optimizing its prompt.

Learn more about Arize Prompt Learning here.

[ ]
[10]
[11]

BizNorm-100 Benchmark

BizNorm-100 is a synthetically created dataset containing 100 queries. The goal is to normalize these queries with respect to certain ruleset.

For example, the query

My card 3333-4444-5555-6666 was charged $1200 on 1/12/2025. The record still shows my old phone, 646-555-2201, and the system emailed the receipt to anthony.rogers@company.org. Can you fix this ASAP?

should be normalized to

[PII ALERT] My card [CARD] was charged usd 1200.00 on 2025-01-12. The record still shows my old phone, [PHONE], and the system emailed the receipt to [EMAIL]. Can you fix this as soon as possible? -- Company Confidential --

See the normalization ruleset in BizNorm-ruleset.md.

Train/Test Split

We will be using the training set to train our evaluator with Prompt Learning. We will be using the test set to test our evaluator's accuracy on data it has not been trained on.

[12]

Application System Prompt

This is the application system prompt, or the prompt to the LLM used to generate outputs.

This is NOT the prompt we are optimizing! This simply generates outputs.

We are optimizing the evaluator prompt, or the prompt for the LLM-as-judge eval which EVALUATES the generated outputs.

[7]

Output Generator

Uses the application system prompt to generate outputs.

[6]

Sanitization Helpers

clean and clean_series are used to normalize text before comparing generated outputs with ground truths. This prevents false mismatches caused by superficial formatting differences.

For example, the string:

"today's year is 1/1/2025"

might be normalized to:

"today's year is 2025-01-01"

If we compare it against a ground truth like:

"today’s year is 2025-01-01"

a raw string comparison would incorrectly flag them as different because of the straight vs. curly apostrophe. Normalization ensures both strings are treated as equivalent, so the comparison is judged correctly.

[5]

Accuracy Computation

Computes accuracy, f1, precision, recall.

[4]

Evaluator

This is the code for our LLM-as-Judge evaluator.

It checks whether outputs are normalized properly.

You can see the prompt below. THIS IS THE PROMPT WE ARE OPTIMIZING.

We want to build evals that align with how we expect them to perform. Good evals are very important. They allow you to filter and classify the information you feed to your users. Because LLM outputs are not deterministic, you need something to check those outputs. It's too time consuminng to do this manually, so employing an LLM to evaluate these outputs is a common and essential practice.

[ ]

Generate Output and Evaluate

This combines our output generator and our evaluator into one function, and also computes accuracies for our outputs and also our evaluator.

Evaluator accuracy is computed by comparing what the eval thinks ("correct" or "incorrect") versus whether the output is equal to the ground truth or not (actual "correct" or "incorrect").

[14]

Run this below cell!

[ ]

Helper Function - calling the Prompt Learning SDK

You can see the optimize_iteration helper function here actually initializes the optimizer with feedback and produces a new, optimized prompt.

The next step is figuring out what feedback to provide to the optimizer in order for it to generate optimized prompts.

[ ]

🔄 Optimization Workflow using Prompt Learning

This notebook implements an interactive optimization loop where we:

  1. Collect Feedback — Display examples from the dataset, and let the user label correctness/explanations.
  2. Optimize Prompt — Use the feedback to generate an updated evaluator prompt.
  3. Review & Confirm — Show the optimized prompt, allow manual edits for formatting, and confirm it.
  4. Evaluate — Re-run the evaluator with the new prompt on train/test sets, log metrics, and save results.
  5. Loop — Repeat the cycle for N rounds, carrying forward the updated evaluator prompt and re-evaluated outputs.

The feedback we provide to the Prompt Learning optimizer is HUMAN ANNOTATED FEEDBACK. We show the power of just needing to annotate 5 examples per loop, and seeing optimization boosts! This shows the data efficiency of Prompt Learning. Rather than RL or an programmatic optimizer, where you need lots of data to make effective accuracy boosts, just hand annotating 5 outputs and giving that feedback to Prompt Learning allows for huge boosts in accuracy.

The workflow is composed of modular helper functions:

  • collect_feedback_ui: interactive widget interface for gathering manual feedback.
  • review_and_confirm_prompt: UI for reviewing and editing the optimized prompt before saving.
  • run_one_round: runs a single loop round (feedback → optimize → confirm → evaluate).
  • interactive_optimization_loop: orchestrates the full multi-round optimization process.

📝 collect_feedback_ui

This function creates an interactive feedback form using ipywidgets:

  • Displays a sample of query, ground_truth, output, and evaluator outputs.
  • Provides dropdowns / textareas for feedback fields (evaluator_correctness, evaluator_explanation).
  • Saves the annotated feedback set (feedback_set) to CSV.
  • Calls on_save(feedback_set) after the user clicks Save Feedback, triggering the next step in the workflow.
[ ]

🔍 review_and_confirm_prompt

This function displays the auto-optimized evaluator prompt:

  • Shows the generated prompt in a styled block.
  • Provides a large text area for manual edits (to fix formatting, braces, JSON requirements, etc.).
  • Only after the user clicks Confirm Prompt does it call on_confirm(edited_prompt).
  • Ensures the downstream evaluation always uses a user-validated prompt.
[ ]

🔁 run_one_round

Runs a single optimization cycle:

  1. Samples a batch of examples from the dataset for feedback.
  2. Calls collect_feedback_ui to gather manual corrections.
  3. Optimizes the evaluator prompt using optimize_iteration.
  4. Calls review_and_confirm_prompt to display and edit the new prompt.
  5. After confirmation:
    • Saves the new prompt to file.
    • Re-evaluates train/test with the updated prompt.
    • Logs metrics and appends results.
    • Starts the next round (if any).
[ ]

🚀 interactive_optimization_loop

The master orchestrator of the workflow:

  • Computes baseline evaluator performance with the initial prompt.
  • Logs round 0 metrics.
  • Iteratively calls run_one_round for the specified number of loops.
  • Maintains a record of:
    • All prompts across rounds (results["prompts"])
    • Evaluation metrics (results["metrics"])
  • Saves metrics history to all_metrics.csv.
  • Stops after N confirmed rounds of optimization.
[ ]

Run the Prompt Learning Loop!

[ ]

View your Results

[ ]