Notebooks
E
Elastic
Phi3 As Relevance Judge

Phi3 As Relevance Judge

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticevaluating-search-relevance-part-2openaiAIchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainapplications

Using Phi-3 as relevance judge

In this notebook we will use Phi-3 as a relevance judge between a query and a document.

Requirements

For this notebook, you will need a working Python environment (Python 3.10.x or later) and some Python dependencies:

  • torch, to use PyTorch as our backend
  • Huggingface's transformers library
  • accelerate andbitsandbytes for quantization support (a GPU is required to enable quantization)
  • scikit-learn for metrics computation
  • pandas for generic data handling

Installing packages

Let's start by installing the necessary Python libraries (preferably in a virtual environment)

[ ]

Implementation

First, the necessary imports:

[ ]

Now, let's create a class that will responsible for loading the Phi-3 model and perform inference on its inputs. A few notes before we dive into the code:

  • Even though Phi-3 is a small language model (SLM) with a parameter count of 3.8B we load it with 4-bit quantization that makes it a good choice even for consumer-grade GPUs
  • Following the example code provided in the corresponding HF page here we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this notebook
  • Regular expressions are used to extract the answer from the LLM output. The response_types argument defines the set of acceptable classes (e.g. Relevant, Not Relevant)
  • There are two options for decoding:
    • greedy decoding, where sampling is disabled and the outputs are (more or less) deterministic
    • beam decoding, where multiple LLM calls for the same set of inputs are performed and the results are aggregated through majority voting. In the code below iterations is the number of LLM calls requested with an appropriate setting for the temperature (e.g. 0.5)
[ ]

Prompts

In this section we define the prompts that we will use later for LLM inference.

There are three types of prompt templates namely:

  • pointwise
  • pointwise with chain-of-thought
  • pairwise
[ ]

We also define a helper structure containing:

  • prompt_inputs, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training data
  • prompt_template, the prompt template to use
  • response_types, the names of the expected output classes.
  • metadata, the extra attributes that need to be preserved
  • max_output_tokens, the maximum number of tokens that the LLM outputs
[ ]

We are now ready to define the parameters of our run:

  • MODEL_NAME, the name of the language model
  • BATCH_SIZE, the batch size to use for inference
  • TASK_TYPE, one of qa_pointwise, qa_pairwise, chain_of_thought
  • TEMPERATURE & ITERATIONS are decoding options explained at the beginning of the notebook
[ ]

and create an instance of our evaluator

[ ]

Running the pipeline

Let's execute the pipeline by first adding a few test data points

[ ]

Each item in the list if a dictionary with the following keys:

  • qid: The query id in the original MSMARCO dataset
  • query_text: self-explanatory
  • positive_text: The text for the document that has been marked as relevant in the oringal qrels file
  • retrieved_doc_id: the id of the retrieved document (after reranking) which is being judged for relevance
  • retrieved_text: the text of the retrieved document
  • human_judgment: The result of the human annotation, here it is either "Relevant" or "Not Relevant"

Let's also add two helper functions that allow us to iterate over the data

[ ]
[ ]

And now we are ready to execute the pipeline and store the results

[ ]

Collect outputs into a Pandas dataframe

[ ]

Quick scan of the outputs

[ ]

And finally, let's measure the performance of the LLM.

First, we compute the micro-F1 score which takes into account both classes

[ ]

or we can focus on the Relevant class

Precision

[ ]

Recall

[ ]

binary-F1

[ ]