Notebooks
A
Arize AI
Evaluate Relevance Classifications

Evaluate Relevance Classifications

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Retrieval Relevance Evals

Arize provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) applications. This relevance is then used to measure the quality of each retrieval using ranking metrics such as precision@k. In order to determine whether each retrieved document is relevant or irrelevant to the corresponding query, our approach is straightforward: ask an LLM.

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
  • to provide an experimental framework for users to iterate and improve on the default classification template.
Note: This notebook was last updated on May 30, 2025.

Install Dependencies and Import Libraries

[ ]
[ ]

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

[ ]
[ ]

Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include:

  • "wiki_qa-train"
  • "ms_marco-v1.1-train"
[ ]

Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

[ ]

The template variables are:

  • input: the question asked by a user
  • reference: the text of the retrieved document
  • output: a ground-truth relevance label

Configure the LLM

Configure your OpenAI API key.

[ ]

Benchmark Dataset Sample

Sample size determines run time Recommend iterating small: 100 samples Then increasing to large test set

[ ]

LLM Evals: Retrieval Relevance Classifications GPT-4

Run relevance against a subset of the data. Instantiate the LLM and set parameters.

[ ]
[ ]

Run Relevance Classifications

Run relevance classifications against a subset of the data.

[ ]

Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

[ ]

Classifications with explanations

When evaluating a dataset for relevance, it can be useful to know why the LLM classified a document as relevant or irrelevant. The following code block runs llm_classify with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

[ ]
[ ]

LLM Evals: relevance Classifications GPT-3.5 Turbo

Run relevance against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as we will see below.

[ ]
[ ]
[ ]

Preview: Running with GPT-4 Turbo

[ ]
[ ]