Phi3 As Relevance Judge
Using Phi-3 as relevance judge
In this notebook we will use Phi-3 as a relevance judge between a query and a document.
Requirements
For this notebook, you will need a working Python environment (Python 3.10.x or later) and some Python dependencies:
torch, to use PyTorch as our backend- Huggingface's
transformerslibrary accelerateandbitsandbytesfor quantization support (a GPU is required to enable quantization)scikit-learnfor metrics computationpandasfor generic data handling
Installing packages
Let's start by installing the necessary Python libraries (preferably in a virtual environment)
Implementation
First, the necessary imports:
Now, let's create a class that will responsible for loading the Phi-3 model and perform inference on its inputs. A few notes before we dive into the code:
- Even though Phi-3 is a small language model (SLM) with a parameter count of 3.8B we load it with 4-bit quantization that makes it a good choice even for consumer-grade GPUs
- Following the example code provided in the corresponding HF page here we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this notebook
- Regular expressions are used to extract the answer from the LLM output. The
response_typesargument defines the set of acceptable classes (e.g.Relevant,Not Relevant) - There are two options for decoding:
greedy decoding, where sampling is disabled and the outputs are (more or less) deterministicbeam decoding, where multiple LLM calls for the same set of inputs are performed and the results are aggregated through majority voting. In the code belowiterationsis the number of LLM calls requested with an appropriate setting for thetemperature(e.g. 0.5)
Prompts
In this section we define the prompts that we will use later for LLM inference.
There are three types of prompt templates namely:
pointwisepointwisewith chain-of-thoughtpairwise
We also define a helper structure containing:
prompt_inputs, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training dataprompt_template, the prompt template to useresponse_types, the names of the expected output classes.metadata, the extra attributes that need to be preservedmax_output_tokens, the maximum number of tokens that the LLM outputs
We are now ready to define the parameters of our run:
MODEL_NAME, the name of the language modelBATCH_SIZE, the batch size to use for inferenceTASK_TYPE, one ofqa_pointwise,qa_pairwise,chain_of_thoughtTEMPERATURE&ITERATIONSare decoding options explained at the beginning of the notebook
and create an instance of our evaluator
Running the pipeline
Let's execute the pipeline by first adding a few test data points
Each item in the list if a dictionary with the following keys:
qid: The query id in the original MSMARCO datasetquery_text: self-explanatorypositive_text: The text for the document that has been marked as relevant in the oringalqrelsfileretrieved_doc_id: the id of the retrieved document (after reranking) which is being judged for relevanceretrieved_text: the text of the retrieved documenthuman_judgment: The result of the human annotation, here it is either "Relevant" or "Not Relevant"
Let's also add two helper functions that allow us to iterate over the data
And now we are ready to execute the pipeline and store the results
Collect outputs into a Pandas dataframe
Quick scan of the outputs
And finally, let's measure the performance of the LLM.
First, we compute the micro-F1 score which takes into account both classes
or we can focus on the Relevant class
Precision
Recall
binary-F1