Scorers Local Weave Scorers
Running Weave Scorers
The notebook will walk you through how to load and call Weave Scorers. It will also show you how to use them in a Weave Evaluation as well as a Weave Guardrail
To learn more about how these local model scorers were trained and evaluated, see this W&B Report here
Note: This notebook runs best with a L4 GPU or higher
Setup
Installation & Login
Log in to Weights & Biases and start Weave
Weave Scorers
Initialising scorers with local models
These local models first need to be downloaded from W&B Arifacts on initialisation:
from weave.scorers import WeaveHallucinationScorerV1
hallu_scorer = WeaveHallucinationScorerV1()
Running scorers
All Weave scorers are called using the .score method and passing it the scorer-specific parameters required.
scores = hallu_scorer.score(
query="what is the capital of antartica?"
context="Penguins love antartica."
output="The capital of antartica is Quito"
)
Example - Running a single Scorer
Here we will run the hallucination scorer
Example - Running an Eval with 2 Weave Scorers
For a full understanding of Weave Evaluations please see the Evaluation documentation here
Lets evaluate our data using 2 Scorers, the WeaveBiasScorer and WeaveContextRelevanceScorer. First we'll download the model weights for each.
Customising your Scorer for Evaluations
Sometimes when runnnig a Weave Evaluation it is necessary to modify the signature of your scorers score method in order to work as expected with the ouputs from model.
For example in this case, WeaveBiasScorer.score expects only a string to be passed to its output parameter. However our AI model outputs a dict.
To pass the "query" string from dict from the model output to the WeaveBiasScorer you can subclass WeaveBiasScorer so that we can extract the value for "query" and pass it to the output param of WeaveBiasScorer
We do the same mapping, WeaveContextRelevanceScorer expects a query param and an output param, where output is the context
Now lets initialise and download the model weights
Now lets run the evaluation. You can click on the weave link generated once the evaluation is finished to see the results.
Weave Guardrails
When using Weave Guardrails you can see the metrics from the guardrail inline with your function's inputs and outputs.
Below is an example function which calls the WeaveToxicityScorer and returns returns different outputs depending on whether or not the Guardrail scorer was triggered.
The two main points are:
- retrieve the
callfrom a weave op'd function that has been called - use
call.apply_scorerto apply a scorer to the output of that function that was just called
For a full understanding of Weave Guardrails, please see the Guardrails documentation here.
Safe input:
Unsafe input:
All Scorers
Context Relevance
The context relevance scorer returns a pass boolean to determine whether or not the output is relevant to the input and context.
For additional granularity it also returns an additional score, which is the degree of relevance.
Passing verbose = True to the score method will return scores for each context span (chunk of text) given.
Return scores for each chunk of text with longer chunks:
Hallucination
Hallucinated output:
Non-hallucinated output:
Adjusing the threshold - a lower threshold results in higher recall but lower precision.
Testing a longer text that contains a hallucinations - Edison's last words aren't mentioned in the query or context.
Bias/Stereotype
Toxicity
The model scores 5 different categories from 0 to 3. If the sum of these scores is above total_threshold (default 5) then the input will be flagged. If any single category has a score higher than category_threshold (default 2) then the input will also be flagged. We tuned these default values to decrease false positives and improve recall.
If you want a more aggressive filtering you could override the category_threshold parameter total_threshold parameter in the constructor:
Coherence
Incoherent output
Coherent output
Fluency
Low fluency
High fluency
Trustworthiness
The Trustworthiness scorer runs 5 scorers in parallel for an overall assesment of the query, context and input:
-
3 "critical" scorers:
WeaveToxicityScorer, WeaveHallucinationScorer, WeaveContextRelevanceScorer -
2 "advisory" scorers:
WeaveCoherenceScorer, WeaveFluencyScorer
There are 2 issues with the following:
- irrelevant context
- hallucinated output
Personally Identifiable Information (PII)
The PresidioScorer uses Microsoft's Presidio library to detect and anonymize PII.
Parameters:
selected_entities: A list of entity types to detect in the text. If now value is passed then presidio will try and detect all entity types in its default entities list
language: The language of the input text
custom_recognizers: A list of custom presidio recognizers of type presidio.EntityRecognizer
Helper function to display results:
Run the scorer:
Running again, but now only detecting email addresses: