02 Finetune And Eval
Mistral and Weights & Biases: Finetune an LLM judge so detect hallucination
In this notebooks you will learn how to trace your MistralAI Api calls using W&B Weave, how to evaluate the performance of your models and how to close the gap by leveraging the MistralAI finetuning capabilities.
In this notebooks we will fine-tune a mistral 7b model as an LLM Judge. This idea comes from the amazing blog post from Eugene. The main goal is to fine-tune a small model like Mistral 7B to act as an hallucination judge. We will do this in 2 steps:
- Training on a Factual Inconsistency Benchmark challenging dataset to improve the model performance to detect hallucination by detecting inconsistencies beween a piece of text and a "summary"
- We will then mix that dataset with Wikipedia summaries dataset to increase the performance even more.
- Weights & Biases: https://wandb.ai/
- Mistral finetuning docs: https://docs.mistral.ai/capabilities/finetuning/
- Tracing with W&B Weave: https://wandb.me/weave
Load some data
let's import the relevant pieces
some globals

We are going to map to 0 and 1 for the sake of it!
You will probably integrate MistralAI API calls in your codebase by creating a function like the one below:
Let's create a prompt that explains the task...
Eval
Let's evaluate the model on the validation split of the dataset
7B

Iterate a bit on the prompt...
Let's try adding the example from Eugene's blog:
This is a hard dataset!
Large

This model is considerably better! over 80% accuracy is great on this hard task 😎
Fine-Tune FTW
Let's see if fine-tuning improves this.
You will need to format your prompts slightly different for FT
- instead of
ChatMessageuse adict - Add the output
You could use other fancy datasets or pandas, but this is a small dataset so let's not add more complexity...
Upload dataset
Create a fine-tuning job
Ok, now let's create a fine-tune job with the mistral api. Some thing to know:
- You only have 2 parameters to play wtih:
training_stepsandlearning_rate - You can use
dry_run=Trueto get an estimate cost training_stepsis not exactly linked to epochs in a direct way, they have a rule of thumbs on the docs. If you do a dry run the epochs will be calculated for you.
We want to run for 10 epochs to reproduce Eugene's results.

Use a fine-tuned model
Let's compute the predictions using the fine-tuned 7B model
quite substantial improvement! Some take aways:
- the Mistral 7B is a much more powerful model than the original Bart that eugene was using on his blog post
- With a relatively small high quality dataset the improvements for this downstream task are enormous!
- Now we can leverage a faster and cheaper 7B instead of taping into
mistral-large. Of course we could have some filtering logic to decide when to use the big gun anyway.
Pre-finetuning on USB to improve performance on FIB
The Unified Summarization Benchmark (USB) is made up of eight summarization tasks including abstractive summarization, evidence extraction, and factuality classification. While FIB documents are based on news, USB documents are based on a different domain—Wikipedia. Labels for factual consistency were created based on edits to summary sentences; inconsistent and consistent labels were assigned to the before and after versions respectively. Here’s the first sample in the dataset:
Check Eugene's Analysis here
Let's mix the USB dataset in the training data...
Final results
The fine-tuned model over USB + FIB is now 90%+ accurate!

