Uqlm Phoenix Confidence Example
UQLM Γ Arize Phoenix β Response-Level Confidence & Hallucination Risk
This notebook shows how to compute model-agnostic, ground-truth-free confidence / risk scores using UQLM after creating a dataset and starting an experiment in Arize Phoenix.
What you'll do:
- Install and configure Phoenix & UQLM
- Create a small demo dataset with prompts & responses
- Upload your Dataset into Phoenix & Configure a task to create your pre-sampled responses
- Compute UQLM per-scorer confidences and an ensemble confidence
- Derive risk = 1 - confidence and an optional high_risk flag
π§© Why: UQLM provides a production-friendly, model-agnostic uncertainty signal that complements judge-style evals and helps you flag risky answers without labeled ground truth.
In your terminal, after run pip install arize-phoenix, please run phoenix serve to locally host Phoenix. Then proceed with this notebook.
0) Install packages
1) Imports & version check
2) Configure Phoenix connection
Set these if you're sending results to Phoenix Cloud or your own self-hosted Phoenix.
3) Demo dataset
We'll make a small dataset with:
inputβ the user promptoutputβ the model response
Once we upload this dataset into Phoenix, we can run a task to create our:
sampled_responsesβ a list of stochastic responses for the same prompt (for black-box UQLM)
4) UQLM adapter: per-scorer + ensemble confidence
Below is a compact adapter that:
- Runs BlackBoxUQ
score(...)if you already haveresponsesandsampled_responses. # CVS: Added 'responses' - Optionally runs BlackBoxUQ
generate_and_score(...)if you pass anllmand setnum_responses > 0. - Optionally runs WhiteBoxUQ if you pass an
llmthat returns token-level logprobs. # CVS: Removed 'generate_and_score(...)' - Computes an ensemble confidence (mean/median/weighted) and adds
uqlm_confidence,uqlm_risk, anduqlm_high_risk.
4.a) Run BlackBoxUQ scoring on pre-sampled responses
This path requires no LLM callsβit's the fastest way to test-drive UQLM.
4.b) (Optional) Generate-and-score with your LLM client
If you provide an llm client to UQLM (e.g., OpenAI/Anthropic), set num_responses > 0 to have UQLM sample K responses per prompt and score on the fly.
β οΈ Note: Replace the placeholder
MyLLMClientwith your real client that UQLM supports.
4.c) (Optional) WhiteBox scoring (token-level logprobs)
If your client returns token logprobs, UQLM can compute white-box signals like minimum token probability. Replace the placeholder client below with a real logprob-capable client and set USE_WHITEBOX=True.