Evaluate Human Vs Ai Classifications
Human/GroundTruth Versus AI Evals
Arize provides tooling to evaluate LLM applications, including tools to determine whether AI answers match Human Groundtruth answers. In many Q&A systems its important to test the AI answer results as compared to Human answers prior to deployment. These help assess how often the answers are correctly generated by the AI system.
The purpose of this notebook is:
- to evaluate the performance of an LLM-assisted Evals for AI vs Human answers
- to provide an experimental framework for users to iterate and improve on the default classification template.
Note: This notebook was last updated on May 30, 2025.
Install Dependencies and Import Libraries
Download the Dataset
We've crafted a dataset of common questions and answers about the Arize platform.
Vizualization of Prompts/Templates Evals in Phoenix (Optional Section)
Visualization of Evals is not required but can be helpful to see the actual calls to the LLM. The link below starts the Phoenix UI/server and is a link to Phoenix running locally
Human vs AI Template
View the default template used to evaluate the AI answers.
The template variables are:
- question: the question asked by a user
- correct_answer: human labeled correct answer
- ai_answer: AI generated answer
Configure the LLM
Configure your OpenAI API key.
LLM Evals:Human Groundtruth vs AI GPT-4
Run Human vs AI Eval against a subset of the data. Instantiate the LLM and set parameters.
Classifications with explanations
When evaluating a dataset for relevance, it can be useful to know why the LLM classified an AI answer as relevant or irrelevant. The following code block runs llm_classify with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.
Evaluate Classifications
Evaluate the predictions against human-labeled ground-truth relevance labels.
LLM Evals: Human Groundtruth vs AI Classifications GPT-3.5 Turbo
Run against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as we will see below.