NVIDIA 04 Human Like RAG Evaluation

04 Human Like RAG Evaluation

gpu-accelerationretrieval-augmented-generationllm-inferencetensorrtnvidia-generative-ai-examplestoolslarge-language-modelsmicroservicetriton-inference-serverevaluationLLMnotebooksragnemo

alph-notebooks/nvidia-generative-ai-examples / 04_Human_Like_RAG_Evaluation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG Human-Like Evaluation - LLM-as-a-Judge

This notebook demonstrates how to use a high quality LLM to generate evaluation scores (human-like) for RAG system final outputs.

The notebook uses the Llama 3 70B Instruct model to evaluate the example RAG pipeline. The score granularity is from 1 to 5 where:

Score 1: Answer irrelevant or invalid, does not follow the context of the question or is irrelevant
Score 2: Answer barely useable, missing significant accurate information
Score 3: Answer mostly helpful, missing some information or added erroneous information
Score 4: Answer helpful, room for some improvement, could be more concise
Score 5: Answer helpful, accurate, relevant and concise

Step 1: Load the Data

Let's first load the JSON dataset. The structure should be:

{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts': "xxx xxx xxxx",
'answer':"xxx xxx xxxx",
}

[ ]

Populate your NVIDIA API key as the bearer token in the following cell.

[ ]

Step 2: Design the LLM-as-a-Judge Prompt

The evaluation axes are the helpfulness, relevance, accuracy, and level of detail. Prompting the high quality LLM to generate human-like evaluation requires a careful prompt engineering with an explicit instructions

We must provide the evaluation criteria and the methodology in the same fashion as if we were giving human instructions on how to evaluate. We also ask the LLM to consider both the reference answer and context (ground truth) when evaluating the response provided by the RAG pipeline. Finally, we ask the LLM to provide a score on a scale of 1-5 (likert scale) and ask it to provide an explanation.

Here is an example of the judge_template that we will use with Llama 3 70B Instruct. Notice the evaluation examples provided in the prompt. This will help guide the LLM.

[ ]

Now call the judge LLM on the RAG results.

[ ]

Parse the rating and evaluations from the judge responses.

[ ]

Let's take a peek at the results!

[ ]

Now let's calculate the mean Likert score and then display a historgram of all the ratings.

[ ]

Lastly, write your evaluation results to a CSV file so you can examine them in more detail later.

Be aware that a few LLM judge evaluation responses might be malformed and therefore unparseable. In these cases the rating and explanation fields are empty.

[ ]

Bonus! A good practice for improving a RAG pipeline is to look at the responses that were rated poorly and then determine actions to improve.

[ ]