Retrieve And Rerank
Retrieve and Rerank
In this example we will:
- index a BEIR dataset to Elasticsearch
- retrieve data with BM25
- optimize relevance with a reranking module running locally to our machine
Regarding the last point, even though we are going to focus on small-size reranking modules it would be beneficial to run this notebook on a machine with access to GPUs to speed up the execution.
Requirements
For this notebook, you will need an Elastic deployment, we will be using Elastic Cloud (if you don't have a deployment please see below to setup a free trial), Python 3.10.x or later and some Python dependencies:
elasticsearch(Elastic's Python client)sentence-transformers(to load the reranking module locally)datasets(Hugginface's library to download datasets with minimal effort)pytrec_eval(Needed to compute accuracy scores such asnDCG@10)
Create Elastic Cloud deployment
If you don't have an Elastic Cloud deployment, sign up here for a free trial. Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.
Installing packages
Let's start by installing the necessary Python libraries (preferably in a virtual environment)
and let's gradually build our code structure
Before we dive deeper into the code, let's set the dataset name as a constant variable in our script.
Let us also define once the necessay credentials required to access the Elastic Cloud deployment
and initialize the Elasticseach Python client
Test the client
Before you continue, confirm that the client has connected with this test.
Helper functions
In this section we define some helper functions to increase the readability of our code.
Let's start with the functions that will handle the interaction with our Elastic Cloud deployment such as:
- creating an index
- storing the documents
- retrieving documents with BM25
Then, we move to functions that rely on Hugginface's datasets library to fetch the corpus, queries and qrels files
Running the pipeline
Now, we can execute the "retrieve and rerank" pipeline step by step
Corpus to our Elasticsearch index
First, we create the index that will host the corpus
Then, we download the corpus and push it into the index
Let's move to the retrieval part
1st stage retrieval with BM25
First, we download the test split of the dataset we have selected
- The
queriesfile is a Hugginface dataset with two keys ['_id', 'text'], - The
qrelsfile contains the relationships between aquery_idand a list of documents. We have transformed into apytrec_eval-compatible format i.e. it's a nested dictionary where the outer key is the query id that points to dictionary with (doc_id,score) key-value pairs (a score >0 denotes relevance)
Now, let's retrieve the top-100 documents per query using BM25
And finally, let's compute the performance of BM25 on this dataset. We are using nDCG@10 as our metric
2nd stage reranking
Now, let's move to the reranking part. In this example we are using a small cross-encoder model to optimize the ordering of our results. We will use the sentence-transformers library to load the model and do the scoring
Some helper structures to speed up processing
and now it's time for the reranking part
and let's calculate the metric scores for the reranked results
which in most cases will provide a significant boost in performance
Bonus section
Judge rate
Let's do some extra analysis and try to answer the question "How many times is an evaluator presented with (query, document) pairs for which there is no ground truth information?"
In other words, we calculate the percentage of cases where the qrels file contains a relevance score for a particular document in the result list.
Let's start with BM25 by focusing on the top-10 retrieved documents
while for the reranked documents it is:
Confidence intervals
In this section we will briefly touch upon the concepts of confidence intervals and statistical significance and we will see how we can use them to determine whether improvements in our pipelines are significant or not.
We can think of it as follows: Our goal is to estimate the performance of our pipeline (retrieval and/or reranking) on a target corpus. Ideally, we would like to have access to all queries that our end-users will run against it but of course this is impossible. Instead, we have the set of test queries provided by the benchmark and we implicitly assume that the performance on this set can act as an accurate proxy of the overall performance (in the ideal scenario).
But we can make some extra assumptions to increase the reliability of our analysis. Confidence intervals, a concept from statistical theory, give us a tool to handle our uncertainty. By setting a certain level of confidence, let's go with 95% in this example, we can derive a range of values that will likely contain the parameter of interest (here the performance in the ideal scenario). In other words, if we repeated the same process an infinite number of times (by drawing different test sets) we could be confident that in 95% of them the confidence interval would encompass the true value.
The code below shows an example of deriving confidence intervals using bootstrapping combined with the percentile method. It should be noted that this statistic is affected a lot by the number and the variability of queries in the dataset i.e. smaller confidence intervals are expected for larger query sets and vice versa
and we can apply it to our results as follows:
The way to interpret this would be to say that we are 95% confident that the nDCG@10 score in the ideal scenario lies within that interval
Confidence intervals can be used in the context of significance testing. For example, if we wanted to compare two pipelines (retrieval and/or reranking) on a dataset one way to do this would be to:
- Decide on a confidence level (e.g. 90% or 95%)
- Compute confidence intervals for the performance of model A
- Compute confidence intervals for the performance of model B
- Check whether there is an overlap between the two intervals.
In the last step, if there is no overlap we can say that the observed difference in performance between the two pipelines is statistically significant.