Notebooks
d
deepset
Hybrid Retrieval Bm42

Hybrid Retrieval Bm42

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

Hybrid Retrieval: BM42 + Dense Retrieval

In this notebook, we will see how to create Hybrid Retrieval pipelines, combining BM42 (a new Sparse embedding Retrieval approach) and Dense embedding Retrieval.

We will use the Qdrant Document Store and Fastembed Embedders.

⚠️ Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please keep this in mind while reviewing the content.

Why BM42?

Qdrant introduced BM42, an algorithm designed to replace BM25 in hybrid RAG pipelines (dense + sparse retrieval).

They found that BM25, while relevant for a long time, has some limitations in common RAG scenarios.

Let's first take a look at BM25 and SPLADE to understand the motivation and the inspiration for BM42.

BM25 \begin{equation} \text{score}(D,Q) = \sum_{i=1}^{N} \text{IDF}(q_i) \times \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}
\end{equation}

BM25 is an evolution of TF-IDF and has two components:

  • Inverse Document Frequency = term importance within a collection
  • a component incorporating Term Frequency = term importance within a document

Qdrant folks observed that the TF component relies on document statistics, which only makes sense for longer texts. This is not the case with common RAG pipelines, where documents are short.

SPLADE

Another interesting approach is SPLADE, which uses a BERT-based model to create a bag-of-words representation of the text. While it generally performs better than BM25, it has some drawbacks:

  • tokenization issues with out-of-vocabulary words
  • adaptation to new domains requires fine-tuning
  • computationally heavy

For using SPLADE with Haystack, see this notebook.

BM42

\begin{equation} \text{score}(D,Q) = \sum_{i=1}^{N} \text{IDF}(q_i) \times \text{Attention}(\text{CLS}, q_i) \end{equation}

Taking inspiration from SPLADE, the Qdrant team developed BM42 to improve BM25.

IDF works well, so they kept it.

But how to quantify term importance within a document?

The attention matrix of Transformer models comes to our aid: we can the use attention row for the [CLS] token!

To fix tokenization issues, BM42 merges subwords and sums their attention weights.

In their implementation, Qdrant team used all-MiniLM-L6-v2 model, but this technique can work with any Transformer, no fine-tuning needed.

⚠️ Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please keep this in mind while reviewing the content.

Install dependencies

[ ]

Hybrid Retrieval

Indexing

Create a Qdrant Document Store

[12]

Download Wikipedia pages and create raw documents

We download a few Wikipedia pages about animals and create Haystack documents from them.

[3]

Indexing pipeline

Our indexing pipeline includes both a Sparse Document Embedder (based on BM42) and a Dense Document Embedder.

[13]
[14]
<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33a2f0>
,πŸš… Components
,  - cleaner: DocumentCleaner
,  - splitter: DocumentSplitter
,  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
,  - dense_doc_embedder: FastembedDocumentEmbedder
,  - writer: DocumentWriter
,πŸ›€οΈ Connections
,  - cleaner.documents -> splitter.documents (List[Document])
,  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
,  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
,  - dense_doc_embedder.documents -> writer.documents (List[Document])

Let's index our documents!

⚠️ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the embedding generation with Fastembed could be not as fast as it can be on a standard machine.

[15]
Calculating sparse embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 340/340 [00:27<00:00, 12.52it/s]
Calculating embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 340/340 [01:23<00:00,  4.07it/s]
400it [00:00, 1179.66it/s]                         
{'writer': {'documents_written': 340}}
[16]
340

Retrieval

Retrieval pipeline

As already mentioned, BM42 is designed to perform best in Hybrid Retrieval (and Hybrid RAG) pipelines.

  • FastembedSparseTextEmbedder: transforms the query into a sparse embedding
  • FastembedTextEmbedder: transforms the query into a dense embedding
  • QdrantHybridRetriever: looks for relevant documents, based on the similarity of both the embeddings

Qdrant Hybrid Retriever compares dense and sparse query and document embeddings and retrieves the most relevant documents, merging the scores with Reciprocal Rank Fusion.

If you want to customize the fusion behavior more, see Hybrid Retrieval Pipelines (tutorial).

[28]
<haystack.core.pipeline.pipeline.Pipeline object at 0x7fb6bc33ae30>
,πŸš… Components
,  - sparse_text_embedder: FastembedSparseTextEmbedder
,  - dense_text_embedder: FastembedTextEmbedder
,  - retriever: QdrantHybridRetriever
,πŸ›€οΈ Connections
,  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
,  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])

Try the retrieval pipeline

[29]
Calculating sparse embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 82.10it/s]
Calculating embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  7.75it/s]
[30]
[31]
Calculating sparse embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 71.98it/s]
Calculating embeddings: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00,  8.90it/s]
[32]

(Notebook by Stefano Fiorucci)