Notebooks
E
Elastic
Retrieve And Rerank

Retrieve And Rerank

openai-chatgptlangchain-pythonchatgptgenaielasticsearchelasticopenaiAIchatlogvectordatabasePythonsearchgenaistacksupporting-blog-contentvectorelasticsearch-labslangchainevaluating-search-relevance-part-1applications

Retrieve and Rerank

In this example we will:

  • index a BEIR dataset to Elasticsearch
  • retrieve data with BM25
  • optimize relevance with a reranking module running locally to our machine

Regarding the last point, even though we are going to focus on small-size reranking modules it would be beneficial to run this notebook on a machine with access to GPUs to speed up the execution.

Requirements

For this notebook, you will need an Elastic deployment, we will be using Elastic Cloud (if you don't have a deployment please see below to setup a free trial), Python 3.10.x or later and some Python dependencies:

  • elasticsearch (Elastic's Python client)
  • sentence-transformers (to load the reranking module locally)
  • datasets (Hugginface's library to download datasets with minimal effort)
  • pytrec_eval (Needed to compute accuracy scores such as nDCG@10)

Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up here for a free trial. Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.

Installing packages

Let's start by installing the necessary Python libraries (preferably in a virtual environment)

[ ]

and let's gradually build our code structure

[ ]

Before we dive deeper into the code, let's set the dataset name as a constant variable in our script.

[ ]

Let us also define once the necessay credentials required to access the Elastic Cloud deployment

[ ]

and initialize the Elasticseach Python client

[ ]

Test the client

Before you continue, confirm that the client has connected with this test.

[ ]

Helper functions

In this section we define some helper functions to increase the readability of our code.

Let's start with the functions that will handle the interaction with our Elastic Cloud deployment such as:

  • creating an index
  • storing the documents
  • retrieving documents with BM25
[ ]
[ ]

Then, we move to functions that rely on Hugginface's datasets library to fetch the corpus, queries and qrels files

[ ]

Running the pipeline

Now, we can execute the "retrieve and rerank" pipeline step by step

Corpus to our Elasticsearch index

First, we create the index that will host the corpus

[ ]

Then, we download the corpus and push it into the index

[ ]
[ ]

Let's move to the retrieval part

1st stage retrieval with BM25

First, we download the test split of the dataset we have selected

[ ]
  • The queries file is a Hugginface dataset with two keys ['_id', 'text'],
  • The qrels file contains the relationships between a query_id and a list of documents. We have transformed into a pytrec_eval-compatible format i.e. it's a nested dictionary where the outer key is the query id that points to dictionary with (doc_id, score) key-value pairs (a score >0 denotes relevance)
[ ]

Now, let's retrieve the top-100 documents per query using BM25

[ ]

And finally, let's compute the performance of BM25 on this dataset. We are using nDCG@10 as our metric

[ ]

2nd stage reranking

Now, let's move to the reranking part. In this example we are using a small cross-encoder model to optimize the ordering of our results. We will use the sentence-transformers library to load the model and do the scoring

[ ]

Some helper structures to speed up processing

[ ]

and now it's time for the reranking part

[ ]

and let's calculate the metric scores for the reranked results

[ ]

which in most cases will provide a significant boost in performance

Bonus section

Judge rate

Let's do some extra analysis and try to answer the question "How many times is an evaluator presented with (query, document) pairs for which there is no ground truth information?" In other words, we calculate the percentage of cases where the qrels file contains a relevance score for a particular document in the result list. Let's start with BM25 by focusing on the top-10 retrieved documents

[ ]

while for the reranked documents it is:

[ ]

Confidence intervals

In this section we will briefly touch upon the concepts of confidence intervals and statistical significance and we will see how we can use them to determine whether improvements in our pipelines are significant or not.

We can think of it as follows: Our goal is to estimate the performance of our pipeline (retrieval and/or reranking) on a target corpus. Ideally, we would like to have access to all queries that our end-users will run against it but of course this is impossible. Instead, we have the set of test queries provided by the benchmark and we implicitly assume that the performance on this set can act as an accurate proxy of the overall performance (in the ideal scenario).

But we can make some extra assumptions to increase the reliability of our analysis. Confidence intervals, a concept from statistical theory, give us a tool to handle our uncertainty. By setting a certain level of confidence, let's go with 95% in this example, we can derive a range of values that will likely contain the parameter of interest (here the performance in the ideal scenario). In other words, if we repeated the same process an infinite number of times (by drawing different test sets) we could be confident that in 95% of them the confidence interval would encompass the true value.

The code below shows an example of deriving confidence intervals using bootstrapping combined with the percentile method. It should be noted that this statistic is affected a lot by the number and the variability of queries in the dataset i.e. smaller confidence intervals are expected for larger query sets and vice versa

[ ]

and we can apply it to our results as follows:

[ ]

The way to interpret this would be to say that we are 95% confident that the nDCG@10 score in the ideal scenario lies within that interval

Confidence intervals can be used in the context of significance testing. For example, if we wanted to compare two pipelines (retrieval and/or reranking) on a dataset one way to do this would be to:

  • Decide on a confidence level (e.g. 90% or 95%)
  • Compute confidence intervals for the performance of model A
  • Compute confidence intervals for the performance of model B
  • Check whether there is an overlap between the two intervals.

In the last step, if there is no overlap we can say that the observed difference in performance between the two pipelines is statistically significant.