deepset Prometheus2 Evaluation

Prometheus2 Evaluation

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / prometheus2_evaluation.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

RAG Evaluation with Prometheus 2

Evaluating the responses of Language Models and LLM-based applications often involves using model-based metrics that do not require ground truth labels. Large proprietary models like GPT-4 and Claude 3 Opus are frequently employed as evaluators and demonstrate a good correlation with human evaluations.

However, relying on closed models poses several challenges:

fairness: the training data of these models is unknown.
controllability: the behavior of these models can change unpredictably.
data privacy: sending data to external providers may raise privacy concerns.
affordability: using these powerful models can be expensive.

Using open models for evaluation is an active research area, but their practical use is often limited. They typically do not correlate well with human judgments and lack flexibility.

🔥 Prometheus 2 is a new family of open-source models designed to address these gaps:

two variants, respectively fine-tuned from Mistral-7B and Mixtral8x7B
trained on open-source data
demonstrate high correlation with human evaluations and proprietary models
highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.

In this experimental notebook, we will use Prometheus 2 to evaluate the responses of a RAG pipeline.

First, we will build the RAG pipeline and collect some results. Then, we will code a custom Prometheus Evaluator component for Haystack. Finally, we will initialize three different evaluators and run them in an evaluation pipeline.

Create the RAG pipeline to evaluate

We want to use Prometheus 2 to evaluate the answers generated by a RAG, so we first need to build our RAG Pipeline.

This part is quite similar to the "Evaluating RAG Pipelines" tutorial. Take a look at it for more details.

If you want, you can simply read this section. We will provide the generated data for later evaluation steps.

[ ]

We will be using a labeled PubMed dataset with questions, contexts and answers. This allows us to use the contexts as Documents and provides the necessary labeled data for some of the evaluation metrics we will define.

In this example, we will use the first 100 rows.

First, let's fetch the dataset and extract all_documents, all_questions and all_ground_truth_answers.

[ ]

Indexing pipeline

Next, let’s build a simple indexing pipeline and write the documents into a Document Store.

[ ]

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:173: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

{'document_writer': {'documents_written': 100}}

RAG pipeline

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we'll be using:

InMemoryEmbeddingRetriever to retrieve the relevant documents for the query.
HuggingFaceLocalGenerator with google/gemma-1.1-2b-it to generate answers to queries. It is a small model, and later we will evaluate the quality of the generated responses based on custom criteria.

[ ]

Your Hugging Face token··········

<haystack.core.pipeline.pipeline.Pipeline object at 0x7b1b5c30bdc0>
,🚅 Components
,  - query_embedder: SentenceTransformersTextEmbedder
,  - retriever: InMemoryEmbeddingRetriever
,  - prompt_builder: PromptBuilder
,  - generator: HuggingFaceLocalGenerator
,  - answer_builder: AnswerBuilder
,🛤️ Connections
,  - query_embedder.embedding -> retriever.query_embedding (List[float])
,  - retriever.documents -> prompt_builder.documents (List[Document])
,  - retriever.documents -> answer_builder.documents (List[Document])
,  - prompt_builder.prompt -> generator.prompt (str)
,  - generator.replies -> answer_builder.replies (List[str])

You can try the RAG pipeline by asking a question:

[ ]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

 **Yes.**

The study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.

[ ]

{'answers': [GeneratedAnswer(data=' **Yes.**\n\nThe study found that high levels of procalcitonin (PCT) on postoperative day (POD) 2 were associated with worse outcomes, including higher International Normalized Ratio values, primary graft non-function, longer stay in the pediatric intensive care unit, and mechanical ventilation.', query='Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?', documents=[Document(id=9928bb3fd5bfd294a30717df6f590301c0f7c82f65fec5ff9ae7a00ac4956571, content: 'To date, no data is available about procalcitonin (PCT) levels and its relevance to morbidity and gr...', score: 0.7619273394960041), Document(id=2f1be411b8673646b72551e57af84872e39a788c3602c9b22af2ae901eda0da4, content: 'Intrahepatic cholestasis of pregnancy (ICP) is defined by pruritus, elevated total fasting serum bil...', score: 0.4159278001751194), Document(id=b112787486a85ff8086de3f2562d80497bc4cc76bc9d8cf9d3d5b3ee3b663975, content: 'Most hepatocellular carcinomas (HCCs) are associated with cirrhosis. Portal hypertension (PHT) and e...', score: 0.34273266043157447)], meta={})]}

Run the RAG pipeline and save results

Let's run our RAG pipeline with a set of questions, and make sure to save the data we need for evaluation: questions, ground truth answers and generated answers.

In this example, we will use 10 random questions.
In the evaluation part, we will not evaluate the retrieved context, so we will not save it. However, you can choose to consider context in the evaluation: as we will see later, evaluation with Prometheus is very customizable.

[ ]

Evaluation with Prometheus 2

After the preparation work, we can use Prometheus 2 to evaluate the responses generated along several desired axes.

This model expects a prompt like the one below and returns a text containing feedback and a score.

	###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{orig_instruction}

###Response to evaluate:
{orig_response}

###Reference Answer (Score 5):
{orig_reference_answer}

###Score Rubrics:
[{orig_criteria}]
Score 1: {orig_score1_description}
Score 2: {orig_score2_description}
Score 3: {orig_score3_description}
Score 4: {orig_score4_description}
Score 5: {orig_score5_description}

###Feedback:

Create a Prometheus Evaluator component

To perform evaluation, we create a custom Haystack Evaluator component. In Haystack, it is easy to create custom components, and we can implement Prometheus Evaluator with just a few lines of code.

Design choices

Our implementation is hacky and and directed at experimentation, but some choices are worth explaining.

the component is inspired and extends our LLMEvaluator, but with specific adaptations for Prometheus
init parameters
- template: Prometheus is highly customizable, so we can easily create different evaluators with different prompt templates
- inputs: The inputs that the evaluator expects and that it evaluates. They should match those defined in the template.
- generator: (hacky) allows passing different types of Haystack generators to use the Prometheus model. Examples: HuggingFaceLocalGenerator, LlamaCPPGenerator, etc.
run method: for each example to evaluate, the inputs are integrated into the prompt and passed to the model; then the model output is parsed to extract score and feedback. This method returns a dictionary containing an aggregate score, individual_scores and feedbacks.

[ ]

Load the Prometheus 2 model

We are going to use prometheus-7b-v2.0: the smallest variant of Prometheus 2, which can run on a standard Colab notebook with 8-bit quantization.

In particular, we will use the model via HuggingFaceLocalGenerator, based on the Transformers library.

The generation_kwargs simply replicate those used in the prometheus-eval library. For practical applications, it would be worth experimenting and seeing if there is a better combination of parameters that provides good evaluation performance and reproducibility.

As mentioned earlier, there are several other options for running this open model with Haystack:

resource-constrained environments: [LlamaCPPGenerator] (can run on CPU-only environments thanks to the GGUF quantized format; example commented below)
in production, with available GPU resources: TGI (via HuggingFaceAPIGenerator), vLLM.

[ ]

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:89: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.97G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.92G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/789M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

The model 'MistralForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SeamlessM4Tv2ForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].

[ ]

Initialize different Prometheus Evaluators

We will define 3 prompt templates and corresponding Prometheus Evaluators:

Correctness: Evaluates the generated answer considering both relevance to the question and similarity to the ground truth answer.
Response Relevance: Evaluates the generated answer in terms of its relevance to the user's question.
Logical Robustness: Evaluates the logical organization and progression of the response.

As shown, by customizing the prompt model, a diverse range of evaluators can be created.

In general, the first section (Task Description) should be left intact. the only aspect to be changed, as illustrated in the following examples, is whether or not to use a reference answer.

⚠️ Although these evaluator names may be similar to evaluation metrics used in Haystack or other libraries, it is important to understand that they are created specifically for Prometheus and produce scores between 1 and 5. They are not comparable to conceptually similar but differently defined metrics.

[ ]

Let's try the logical_robustness_evaluator

[ ]

{'score': 3.0,
, 'individual_scores': [5, 1],
, 'feedbacks': ["The generated response is well-organized and presents a clear progression of ideas. It starts by establishing a link between ILC2s and CRSwNP, then describes the role of ILC2s in nasal polyps formation and eosinophilia. The response then draws a conclusion about the pathogenesis of CRS, which is a coherent and logical flow of information. Each sentence builds on the previous, ensuring that the reader is able to follow the argument without confusion. The response maintains a consistent structure and makes smooth transitions between the different points, making it easy to follow. The logical flow and seamless transitions indicate a high level of organization, which aligns well with the score rubric's criteria for a score of 5. Therefore, the response is of high quality in terms of logical organization.",
, 'The response provided does not follow the logical structure expected as per the score rubric. There is a lack of clear organization and progression of ideas. The statement is abrupt and does not flow into a logical argument or question, making it difficult to follow the reasoning behind it. It fails to establish a connection between poor sleep, symptoms of depression, and disability retirement due to depression, which is the main focus of the question. The lack of a clear progression of ideas and arguments, and the absence of smooth transitions, makes it challenging to follow the response. Thus, the response fails to meet the criteria for a well-organized and logically flowing answer. Therefore, based on the score rubric, the response is disorganized and lacks a clear structure, making it difficult to follow. So the overall score is 1.']}

Ok, nice!

Evaluation pipeline

We can now add our evaluators to an Evaluation pipeline and run the pipeline with our RAG results.

[ ]

Let's download the RAG results. If you have run the RAG pipeline, you can skip the next cell.

[ ]

Evaluation results

Once we've run our evaluation pipeline, we can also create a full evaluation report. Haystack provides an EvaluationRunResult which we can use to display a score_report.

[ ]

In general, in our small sample, Gemma-1.1-2b-it seems to generate relevant answers, but the responses are different from ground truth answers and the logical organization is not optimal.

Let's inspect the specific metrics in a dataframe.

[ ]

Since Prometheus provides a feedback for each evaluation, it can be interesting to take a look at them.

[ ]

['The generated answer, while accurate, does not exhibit a strong logical organization. It simply states the conclusion without a detailed explanation of the underlying data or the process that led to this conclusion. Furthermore, there are no transition phrases or linking sentences that would guide the reader from one point to the next, making it hard to follow the progression of ideas.\n\nDespite the absence of transition phrases or linking sentences, the answer maintains a certain degree of coherence, but this coherence could be greatly improved by providing more context or by elaborating on the reasons behind the observed relationship between cDK1 and CDK2 activity and renal cell carcinoma recurrence. For example, it could explain why a lower CDK2SA-CDK1SA ratio is associated with better survival outcomes.\n\nTherefore, although the response contains the necessary information, it lacks the clear progression of ideas and arguments that would make it easy to follow. In contrast, a response with excellent organization would include detailed explanations, smoothly transitioning from one point to the next, and a clear progression of ideas. The absence of these elements in the response means that it falls short of the expected standard of logical organization and flow. \n\nSo the overall score is 2.',
, "This response provides a concise answer to the question, effectively stating that metabolic control analysis identified tryparedoxin as a suitable drug target. It succinctly describes the pathway's regulation by the redox pairs TXN-TXNPx and TXN-GPxA, which demonstrates the clear flow of information and aligns well with the expected logical structure of the response.\n\nHowever, while this response is accurate and follows a logical progression, it lacks the detail found in more elaborate answers. For instance, it does not explicitly mention the percentage of pathway flux controlled by these redox pairs, which could have added more depth to the answer. Moreover, the explanation could be further refined to improve the clarity of the connections between the different components.\n\nDespite these minor drawbacks, the response maintains a well-organized structure and smooth transitions, making it easy to follow. The information is presented in a logical sequence, which helps to enhance the overall coherence of the answer.\n\nIn light of the criteria outlined in the score rubric, the response fulfills the expectations for a score of 4. It presents the information in a logical, coherent, and well-structured manner, although there is room for improvement in terms of detail and connection clarity.\n\nSo the overall score is 4.",
, 'This response is disorganized and lacks clear structure. It does not provide any details or reasoning behind its claim. The transition from presenting the study to confirming the link between the promoter variant and schizophrenia is abrupt and lacks any logical flow. The reader is left without any explanation or understanding of how the study reached its conclusion, making it difficult to follow. This failure to elaborate or substantiate the claim results in a response that does not meet the required standards for logical organization. Thus, it can be concluded that this response falls short in fulfilling the criteria outlined in the score rubric.',
, "This response succinctly affirms the question, with a clear structure that logically follows from the contextual information provided. The connection between the increase in PP release and the subsequent inhibition of glucagon release is presented in a logical sequence that's easy to understand. There are no abrupt transitions or unclear connections in the response, ensuring a smooth flow from one point to another. This response effectively demonstrates a coherent and seamless logical progression of ideas. As per the scoring rubric, it shows that the answer is not only well-organized but also has clear and smooth transitions. Therefore, it adheres to the criteria of being easy to follow and exhibiting a flawless logical flow, hence it is awarded a score of 5.",
, 'The response provided has shown an excellent logical flow, which aligns with the requirements of the score rubric. The answer directly addresses the question, presenting a clear and well-structured argument. It starts with an affirmation of the initial question, then elaborates on the process of tetraploid complementation, explaining the implications in terms of the pluripotency of the induced pluripotent stem cells. The transition from the premise to the conclusion is seamless, making it easy for the reader to follow the logic. There are no abrupt transitions or disorganized elements in the response, which further contributes to its overall clarity and coherence. So, the response fully meets the criteria of a score 5, as it is excellently organized with flawless logical flow and seamless transitions.',
, 'The generated answer demonstrates an excellent logical flow and seamless transitions between the information provided, which aligns with the highest score of the rubric. It effectively establishes the connection between osteoprotegerin (OPG) and subclinical left ventricular systolic dysfunction in diabetic hypertensive patients. The response succinctly presents the results of the speckle tracking study and clearly defines how OPG is an independent predictor of impaired GLS. The conclusion drawn from the receiver operating characteristic curve analysis reinforces the connection between OPG levels and the identification of patients with GLS ≤ 18.5. The organization of the response is logical and clear, making it easy for readers to follow the line of reasoning from the introduction to the conclusion. Therefore, according to the score rubric, the response is well-structured and offers an in-depth and coherent understanding of the topic. So the overall score is 5.',
, 'When evaluating the organization of the response, the primary concern is the clarity and smoothness of the progression of ideas. In this case, the answer is well-structured with a clear statement followed by supporting evidence from the study. The transition from stating the conclusion ("Yes, CD30 expression is a novel prognostic indicator") to presenting the study\'s findings is smooth and logical.\n\nHowever, there is room for improvement in terms of providing more context to the initial statement. By mentioning what the study found in relation to the 5-year OS and PFS, the answer could have provided a more thorough explanation that directly relates to the question. The connection between the initial statement and the supporting evidence is clear but could benefit from a more explicit explanation.\n\nDespite these minor areas for improvement, the response does a good job at presenting the argument in a coherent manner. Therefore, according to the score rubric, which emphasizes the clear progression of ideas and arguments, this response meets the requirements for a score of 4. The overall structure is sound, but a slightly more detailed presentation of the evidence would have elevated it to a perfect score.',
, "Upon reviewing the generated response, it is evident that there is a lack of content that directly addresses the posed question. The response fails to provide any argument or information related to the relationship between mild cognitive dysfunction and diabetes mellitus control in minority elderly adults. The text's structure is disorganized, as it merely states the inability to answer, without any attempt to explore the question or provide a logical flow of information. This makes it very difficult for the reader to follow or understand the content. Consequently, according to the score rubric, this response would be evaluated as having a score of 1, as it is disorganized, lacks clear structure, and is difficult to follow.",
, 'This response presents a straightforward statement that walks through the central point in a linear fashion. The progression of ideas is logical and easy to follow, as it moves from indicating a difference in rates to specifying the relationship between these rates and the PEDS score. However, the response does not provide the depth of analysis that could have made the argument more robust. For example, it does not delve into why this significant difference might exist or consider any potential variables that could affect these rates. The logical flow and clarity of the response meet the requirements of a score of 4, but it falls short of achieving a score of 5 due to the absence of more detailed explanations or comparisons. Therefore, while the response is generally organized, it could benefit from further elaboration and a more comprehensive analysis of the data. So the overall score is 4.',
, "The generated response presents a clear and structured argument, aligning with the scoring rubric's criteria for a score of 4. The response successfully establishes the presence of enteroviruses in CSF samples, acknowledging the need for more research to definitively link these viruses to neurological impairments. The argument flows logically, from acknowledging the initial findings to suggesting the necessity of additional studies. This structure, along with the smooth transitions between ideas, facilitates easy comprehension, which is a critical aspect as per the score rubric. However, the response could be further enhanced by providing a bit more context or detail about the research process or the specific types of neurological impairments associated with the virus, which might elevate it to a score of 5. Nevertheless, the response does not present abrupt transitions, nor does it contain unclear connections, which are key factors negatively impacting the scoring."]