Weights and Biases Scorers Local Weave Scorers

Scorers Local Weave Scorers

wandb-examplesweavedocs

alph-notebooks/wandb-examples / scorers_local_weave_scorers.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Running Weave Scorers

The notebook will walk you through how to load and call Weave Scorers. It will also show you how to use them in a Weave Evaluation as well as a Weave Guardrail

To learn more about how these local model scorers were trained and evaluated, see this W&B Report here

Note: This notebook runs best with a L4 GPU or higher

Setup

Installation & Login

[ ]

Log in to Weights & Biases and start Weave

[ ]

Weave Scorers

Initialising scorers with local models

These local models first need to be downloaded from W&B Arifacts on initialisation:

from weave.scorers import WeaveHallucinationScorerV1

hallu_scorer = WeaveHallucinationScorerV1()

Running scorers

All Weave scorers are called using the .score method and passing it the scorer-specific parameters required.

scores = hallu_scorer.score(
  query="what is the capital of antartica?"
  context="Penguins love antartica."
  output="The capital of antartica is Quito"
)

Example - Running a single Scorer

Here we will run the hallucination scorer

[ ]

Example - Running an Eval with 2 Weave Scorers

For a full understanding of Weave Evaluations please see the Evaluation documentation here

[ ]

Lets evaluate our data using 2 Scorers, the WeaveBiasScorer and WeaveContextRelevanceScorer. First we'll download the model weights for each.

Customising your Scorer for Evaluations

Sometimes when runnnig a Weave Evaluation it is necessary to modify the signature of your scorers score method in order to work as expected with the ouputs from model.

For example in this case, WeaveBiasScorer.score expects only a string to be passed to its output parameter. However our AI model outputs a dict.

To pass the "query" string from dict from the model output to the WeaveBiasScorer you can subclass WeaveBiasScorer so that we can extract the value for "query" and pass it to the output param of WeaveBiasScorer

[ ]

We do the same mapping, WeaveContextRelevanceScorer expects a query param and an output param, where output is the context

[ ]

Now lets initialise and download the model weights

[ ]

Now lets run the evaluation. You can click on the weave link generated once the evaluation is finished to see the results.

[ ]

Weave Guardrails

When using Weave Guardrails you can see the metrics from the guardrail inline with your function's inputs and outputs.

Below is an example function which calls the WeaveToxicityScorer and returns returns different outputs depending on whether or not the Guardrail scorer was triggered.

The two main points are:

retrieve the call from a weave op'd function that has been called
use call.apply_scorer to apply a scorer to the output of that function that was just called

For a full understanding of Weave Guardrails, please see the Guardrails documentation here.

[ ]

Safe input:

[ ]

Unsafe input:

[ ]

All Scorers

Context Relevance

The context relevance scorer returns a pass boolean to determine whether or not the output is relevant to the input and context.

For additional granularity it also returns an additional score, which is the degree of relevance.

Passing verbose = True to the score method will return scores for each context span (chunk of text) given.

[ ]

Return scores for each chunk of text with longer chunks:

[ ]

Hallucination

[ ]

Hallucinated output:

[ ]

Non-hallucinated output:

[ ]

Adjusing the threshold - a lower threshold results in higher recall but lower precision.

[ ]

Testing a longer text that contains a hallucinations - Edison's last words aren't mentioned in the query or context.

[ ]

Bias/Stereotype

[ ]

Toxicity

[ ]

The model scores 5 different categories from 0 to 3. If the sum of these scores is above total_threshold (default 5) then the input will be flagged. If any single category has a score higher than category_threshold (default 2) then the input will also be flagged. We tuned these default values to decrease false positives and improve recall.

If you want a more aggressive filtering you could override the category_threshold parameter total_threshold parameter in the constructor:

[ ]

Coherence

[ ]

Incoherent output

[ ]

Coherent output

[ ]

Fluency

[ ]

Low fluency

[ ]

High fluency

[ ]

Trustworthiness

The Trustworthiness scorer runs 5 scorers in parallel for an overall assesment of the query, context and input:

3 "critical" scorers: WeaveToxicityScorer, WeaveHallucinationScorer, WeaveContextRelevanceScorer
2 "advisory" scorers: WeaveCoherenceScorer, WeaveFluencyScorer

[ ]

There are 2 issues with the following:

irrelevant context
hallucinated output

[ ]

Personally Identifiable Information (PII)

The PresidioScorer uses Microsoft's Presidio library to detect and anonymize PII.

Parameters: selected_entities: A list of entity types to detect in the text. If now value is passed then presidio will try and detect all entity types in its default entities list

language: The language of the input text

custom_recognizers: A list of custom presidio recognizers of type presidio.EntityRecognizer

[ ]

Helper function to display results:

[ ]

Run the scorer:

[ ]

Running again, but now only detecting email addresses:

[ ]