04 Human Like RAG Evaluation
RAG Human-Like Evaluation - LLM-as-a-Judge
This notebook demonstrates how to use a high quality LLM to generate evaluation scores (human-like) for RAG system final outputs.
The notebook uses the Llama 3 70B Instruct model to evaluate the example RAG pipeline. The score granularity is from 1 to 5 where:
- Score 1: Answer irrelevant or invalid, does not follow the context of the question or is irrelevant
- Score 2: Answer barely useable, missing significant accurate information
- Score 3: Answer mostly helpful, missing some information or added erroneous information
- Score 4: Answer helpful, room for some improvement, could be more concise
- Score 5: Answer helpful, accurate, relevant and concise
Step 1: Load the Data
Let's first load the JSON dataset. The structure should be:
{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts': "xxx xxx xxxx",
'answer':"xxx xxx xxxx",
}
Populate your NVIDIA API key as the bearer token in the following cell.
Step 2: Design the LLM-as-a-Judge Prompt
The evaluation axes are the helpfulness, relevance, accuracy, and level of detail. Prompting the high quality LLM to generate human-like evaluation requires a careful prompt engineering with an explicit instructions
We must provide the evaluation criteria and the methodology in the same fashion as if we were giving human instructions on how to evaluate. We also ask the LLM to consider both the reference answer and context (ground truth) when evaluating the response provided by the RAG pipeline. Finally, we ask the LLM to provide a score on a scale of 1-5 (likert scale) and ask it to provide an explanation.
Here is an example of the judge_template that we will use with Llama 3 70B Instruct. Notice the evaluation examples provided in the prompt. This will help guide the LLM.
Now call the judge LLM on the RAG results.
Parse the rating and evaluations from the judge responses.
Let's take a peek at the results!
Now let's calculate the mean Likert score and then display a historgram of all the ratings.
Lastly, write your evaluation results to a CSV file so you can examine them in more detail later.
Be aware that a few LLM judge evaluation responses might be malformed and therefore unparseable. In these cases the rating and explanation fields are empty.
Bonus! A good practice for improving a RAG pipeline is to look at the responses that were rated poorly and then determine actions to improve.