Notebooks
A
Arize AI
Evaluate Hallucination Classifications

Evaluate Hallucination Classifications

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Hallucination Classification Evals

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted approach to detecting hallucinations,
  • to provide an experimental framework for users to iterate and improve on the default classification template.

Install Dependencies and Import Libraries

Note: This notebook was last updated on May 30, 2025.
[ ]
[ ]

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

[ ]
[ ]

Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include "halueval_qa_data" from the HaluEval benchmark:

[ ]

Display Binary Hallucination Classification Template

View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.

[ ]

Template variables:

  • input : The question or prompt asked on the context data.
  • reference : The context data used to answer the question
  • output : The answer generated from the context data, we are checking this answer for halluciations relative to the reference context

Configure the LLM

Configure your OpenAI API key.

[ ]

Benchmark Dataset Sample

Sample size determines run time Recommend iterating small: 100 samples Then increasing to large test set

[ ]

LLM Evals: hallucination Classifications GPT-4

Run hallucination against a subset of the data.

Instantiate the LLM and set parameters.

[ ]
[ ]
[ ]

Evaluate the predictions against human-labeled ground-truth hallucination labels.

[ ]

Classifications with explanations

When evaluating a dataset for hallucinations, it can be useful to know why the LLM classified a response as a hallucination or not. The following code block runs llm_classify with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

[ ]
[ ]

LLM Evals: hallucination Classifications GPT-3.5

Run hallucination against a subset of the data.

[ ]
[ ]
[ ]

Preview: GPT-4 Turbo

[ ]
[ ]