Notebooks
A
Arize AI
Evaluate Reference Link Correctness Classifications

Evaluate Reference Link Correctness Classifications

arize-tutorialsphoenix_evals_examplescookbooksPython

phoenix logo
Docs | GitHub | Community

Reference Link Evals

The purpose of this notebook is:

  • to evaluate the performance of an LLM-assisted approach to detecting the quality of Reference links provided in Q&A answers,
  • to provide an experimental framework for users to iterate and improve on the default classification template.
Note: This notebook was last updated on May 30, 2025.

Reference Links in Q&A

In only chatbots and Q&A systems, many times reference links are provided to along with an answer to help point users to documentation or pages that contain more information or the source for the answer.

EXAMPLE: Q&A from Arize-Phoenix Documentation

QUESTION: Does Phoenix Evals support models besides OpenAI for running Evals?

ANSWER: Phoenix does support a large set of LLM models through the model object. Phoenix supports OpenAI (GPT-4, GPT-4-32k, GPT-3.5 Turbo, GPT-3.5 Instruct, etc...), Azure OpenAI, Google Palm2 Text Bison, and All AWS Bedrock models (Claude, Mistral, etc...).

REFERENCE LINK: https://arize.com/docs/phoenix/api/evaluation-models

This Eval checks the reference link returned answers the question asked in a coversation

[ ]

Install Dependencies and Import Libraries

[ ]

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use nest_asyncio. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without nest_asyncio, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

[ ]
[ ]
[ ]

Screenshot 2023-11-13 at 11.37.49 PM.png

Visualize your evals using Phoenix, click link above to open local phoenix session

Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and ground truth. This dataset was created based on questions and answers on the Arize documentation. There are answers with correct reference links and others with wrong reference links.

[ ]
[ ]

Display Binary Ref Link Eval Template

This Eval template checks for correct link based on a question or conversation, it checks whether the text from the page that the URL reference link refers, correctly answers the quesiton.

[ ]

Template variables:

  • input : The customer and assistant conversation, where the assistants supplies a link to answer the customers question
  • reference : The content of the text from the page that was supplied in the link

Configure the LLM

Configure your OpenAI API key.

[ ]

LLM Evals: Reference Link Classifications GPT-4

Run reference link classifications against a subset of the data.

Instantiate the LLM and set parameters.

[ ]
[ ]
[ ]

Evaluate the predictions against human-labeled ground-truth labels.

[ ]

LLM Evals: Reference Link Classifications GPT-3.5

Run reference link evaluations against a subset of the data.

[ ]
[ ]
[ ]

LLM Evals: Ref Link Evaluations GPT-4 Turbo

Run evaluations of the reference link against the data

[ ]
[ ]
[ ]