Evaluate QA Classifications
Q&A Classification Evals
The purpose of this notebook is:
- to evaluate the performance of an LLM-assisted approach to detecting issues with Q&A systems on retrieved context data
- to provide an experimental framework for users to iterate and improve on the default classification template.
Install Dependencies and Import Libraries
Note: This notebook was last updated on May 30, 2025.
ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.
Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.
Download Benchmark Dataset
- Squad 2: The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf
- Supplemental Data to Sqaud 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.
- sampled_answer is a sampled column of randomly original Squad 2 or incorrect answers
- question: This is the question the Q&A system is running against
- sampled_answer: This is a random sample of correct_answer from Squad 2 or wrong_answer which is a made up incorrect answer. This is the column we test against as it has wrong and right answers.
- correct_answer: True if answer is correct, False if not. The ground truth to test against.
- answers: This is the right answer to the question.
- wrong_answer: This is an incorrect answer generated by the context.
- context: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer.
Display Binary Q&A Classification Template
View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.
Configure the API Key
Configure your OpenAI API key.
Benchmark Dataset Sample
Sample size determines run time Recommend iterating small: 100 samples Then increasing to large test set
LLM Evals: Q&A Classifications GPT-4
Run Q&A classifications against a subset of the data.
Instantiate the LLM and set parameters.
Run LLM Eval using the template against the dataset: This is the main Eval function
Evaluate the predictions against human-labeled ground-truth Q&A labels.
LLM Evals: Q&A Classifications GPT-3.5
Evaluate the predictions against human-labeled ground-truth Q&A labels.
LLM Evals: Q&A Classifications GPT-4 Turbo
Evaluate the predictions against human-labeled ground-truth Q&A labels.