Elastic Phi3 As Relevance Judge

Phi3 As Relevance Judge

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticevaluating-search-relevance-part-2openaiAIchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

alph-notebooks/elasticsearch-labs / phi3-as-relevance-judge.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Using Phi-3 as relevance judge

In this notebook we will use Phi-3 as a relevance judge between a query and a document.

Requirements

For this notebook, you will need a working Python environment (Python 3.10.x or later) and some Python dependencies:

torch, to use PyTorch as our backend
Huggingface's transformers library
accelerate andbitsandbytes for quantization support (a GPU is required to enable quantization)
scikit-learn for metrics computation
pandas for generic data handling

Installing packages

Let's start by installing the necessary Python libraries (preferably in a virtual environment)

[ ]

Implementation

First, the necessary imports:

[ ]

Now, let's create a class that will responsible for loading the Phi-3 model and perform inference on its inputs. A few notes before we dive into the code:

Even though Phi-3 is a small language model (SLM) with a parameter count of 3.8B we load it with 4-bit quantization that makes it a good choice even for consumer-grade GPUs
Following the example code provided in the corresponding HF page here we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this notebook
Regular expressions are used to extract the answer from the LLM output. The response_types argument defines the set of acceptable classes (e.g. Relevant, Not Relevant)
There are two options for decoding:
- greedy decoding, where sampling is disabled and the outputs are (more or less) deterministic
- beam decoding, where multiple LLM calls for the same set of inputs are performed and the results are aggregated through majority voting. In the code below iterations is the number of LLM calls requested with an appropriate setting for the temperature (e.g. 0.5)

[ ]

Prompts

In this section we define the prompts that we will use later for LLM inference.

There are three types of prompt templates namely:

pointwise
pointwise with chain-of-thought
pairwise

[ ]

We also define a helper structure containing:

prompt_inputs, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training data
prompt_template, the prompt template to use
response_types, the names of the expected output classes.
metadata, the extra attributes that need to be preserved
max_output_tokens, the maximum number of tokens that the LLM outputs

[ ]

We are now ready to define the parameters of our run:

MODEL_NAME, the name of the language model
BATCH_SIZE, the batch size to use for inference
TASK_TYPE, one of qa_pointwise, qa_pairwise, chain_of_thought
TEMPERATURE & ITERATIONS are decoding options explained at the beginning of the notebook

[ ]

and create an instance of our evaluator

[ ]

Running the pipeline

Let's execute the pipeline by first adding a few test data points

[ ]

Each item in the list if a dictionary with the following keys:

qid: The query id in the original MSMARCO dataset
query_text: self-explanatory
positive_text: The text for the document that has been marked as relevant in the oringal qrels file
retrieved_doc_id: the id of the retrieved document (after reranking) which is being judged for relevance
retrieved_text: the text of the retrieved document
human_judgment: The result of the human annotation, here it is either "Relevant" or "Not Relevant"

Let's also add two helper functions that allow us to iterate over the data

[ ]

And now we are ready to execute the pipeline and store the results

[ ]

Collect outputs into a Pandas dataframe

[ ]

Quick scan of the outputs

[ ]

And finally, let's measure the performance of the LLM.

First, we compute the micro-F1 score which takes into account both classes

[ ]

or we can focus on the Relevant class

Precision

[ ]

Recall

[ ]

binary-F1

[ ]