Bm25 Quora
Overview
BM25 is a popular technique for retrieving text. It uses term frequencies to determine the relative importance of the term to the query. It is simple but effective and only requires knowing the number of documents in a corpus and the frequency of terms across documents. In the following guide we will show how to use BM25 with Pinecone's sparse-dense index for use in hybrid search.
Learn how to create embeddings in the companion guide.
Prerequisites
We'll install the required libraries: the pinecone-client for interacting with Pinecone, the pinecone-datasets library that we will use for fast processing of the Quora dataset, and numpy.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.1/181.1 KB 3.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 41.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 KB 4.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 32.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 7.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.0/35.0 MB 21.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.2/12.2 MB 44.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.4/15.4 MB 39.2 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 218.0/218.0 KB 10.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 218.0/218.0 KB 15.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.7/211.7 KB 11.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.8/66.8 KB 4.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 43.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.6/140.6 KB 6.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.6/115.6 KB 5.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.5/115.5 KB 7.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 KB 4.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.1/115.1 KB 4.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 114.6/114.6 KB 8.7 MB/s eta 0:00:00
Quora Dataset
We'll load the popular Quora dataset with precomputed embeddings. Both dense and sparse embeddings have been precomputed using the following models:
-
Sparse: BM25
As you can see, this data is already loaded with the sparse and dense representations of each document. To learn about the generation process of this values, see this walkthrough.
Index Creation
We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free API key. We initialize the connection like so:
WhoAmIResponse(username='load', user_label='label', projectname='load-test')
Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all available providers and regions here.
We create the index like so:
Upsert
Now let's upsert vectors to the index. We are using async upload with batching. For more information on performance boosting, see the Pinecone documentation for Performance Tuning.
sending upsert requests: 0%| | 0/522931 [00:00<?, ?it/s]
collecting async responses: 0%| | 0/1046 [00:00<?, ?it/s]
upserted_count: 522931
{'dimension': 384,
, 'index_fullness': 0.1,
, 'namespaces': {'': {'vector_count': 522931}},
, 'total_vector_count': 522931} Query
The dataset comes with a set of prewritten queries that can be used. We view them like so:
Here we define a function that merges the query results with the actual texts of the documents and shows them as a dataframe.
We can load a sample query like so:
'How can I teach my kids the alphabet?'
Now we find the similarity scores for the top 5 returned items from the index:
Because we have both dense and sparse vectors in the index, the score above is calculated like so:
alpha * dense_score + (1 - alpha) * sparse_score
The alpha parameter specifies the weighting of the two scores. In the following code, we explore the impact of various alpha values using a sample query.
Only Sparse (alpha = 0.0)
Hybrid (0 < alpha < 1)
Only Dense (alpha = 1.0)
Once we're done, delete the index to save resources: