Arize AI Uqlm Phoenix Confidence Example

Uqlm Phoenix Confidence Example

arize-tutorialsphoenix_evals_examplescookbooksPython

alph-notebooks/arize-tutorials / uqlm_phoenix_confidence_example.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

UQLM × Arize Phoenix — Response-Level Confidence & Hallucination Risk

This notebook shows how to compute model-agnostic, ground-truth-free confidence / risk scores using UQLM after creating a dataset and starting an experiment in Arize Phoenix.

What you'll do:

Install and configure Phoenix & UQLM
Create a small demo dataset with prompts & responses
Upload your Dataset into Phoenix & Configure a task to create your pre-sampled responses
Compute UQLM per-scorer confidences and an ensemble confidence
Derive risk = 1 - confidence and an optional high_risk flag

🧩 Why: UQLM provides a production-friendly, model-agnostic uncertainty signal that complements judge-style evals and helps you flag risky answers without labeled ground truth.

In your terminal, after run pip install arize-phoenix, please run phoenix serve to locally host Phoenix. Then proceed with this notebook.

0) Install packages

[ ]

1) Imports & version check

[ ]

2) Configure Phoenix connection

Set these if you're sending results to Phoenix Cloud or your own self-hosted Phoenix.

[ ]

3) Demo dataset

We'll make a small dataset with:

input – the user prompt
output – the model response

Once we upload this dataset into Phoenix, we can run a task to create our:

sampled_responses – a list of stochastic responses for the same prompt (for black-box UQLM)

[ ]

4) UQLM adapter: per-scorer + ensemble confidence

Below is a compact adapter that:

Runs BlackBoxUQ score(...) if you already have responses and sampled_responses. # CVS: Added 'responses'
Optionally runs BlackBoxUQ generate_and_score(...) if you pass an llm and set num_responses > 0.
Optionally runs WhiteBoxUQ if you pass an llm that returns token-level logprobs. # CVS: Removed 'generate_and_score(...)'
Computes an ensemble confidence (mean/median/weighted) and adds uqlm_confidence, uqlm_risk, and uqlm_high_risk.

[ ]

4.a) Run BlackBoxUQ scoring on pre-sampled responses

This path requires no LLM calls—it's the fastest way to test-drive UQLM.

[ ]

4.b) (Optional) Generate-and-score with your LLM client

If you provide an llm client to UQLM (e.g., OpenAI/Anthropic), set num_responses > 0 to have UQLM sample K responses per prompt and score on the fly.

⚠️ Note: Replace the placeholder MyLLMClient with your real client that UQLM supports.

[ ]

4.c) (Optional) WhiteBox scoring (token-level logprobs)

If your client returns token logprobs, UQLM can compute white-box signals like minimum token probability. Replace the placeholder client below with a real logprob-capable client and set USE_WHITEBOX=True.

[ ]