deepset Sparse Embedding Retrieval

Sparse Embedding Retrieval

agentic-aiagenticagentsgenaiAIhaystack-cookbookgenai-usecaseshaystack-ainotebooksPythonragai-tools

alph-notebooks/haystack-cookbook / sparse_embedding_retrieval.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Sparse Embedding Retrieval with Qdrant and FastEmbed

In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.

We will use the Qdrant Document Store and FastEmbed Sparse Embedders.

Why SPLADE?

Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.

While good results can be achieved by combining the two approaches (tutorial), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques. In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).

Main features:

Better than dense embedding Retrievers on precise keyword matching
Better than BM25 on semantic matching
Slower than BM25
Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores

Resources

Install dependencies

[ ]

Sparse Embedding Retrieval

Indexing

Create a Qdrant Document Store

[2]

Download Wikipedia pages and create raw documents

We download a few Wikipedia pages about animals and create Haystack documents from them.

[3]

Initialize a `FastembedSparseDocumentEmbedder`

The FastembedSparseDocumentEmbedder enrichs a list of documents with their sparse embeddings.

We are using prithvida/Splade_PP_en_v1, a good sparse embedding model with a permissive license.

We also want to embed the title of the document, because it contains relevant information.

For more customization options, refer to the docs.

[ ]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/133 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]

{'documents': [Document(id=cd69a8e89f3c179f243c483a337c5ecb178c58373a253e461a64545b669de12d, content: 'An example document', sparse_embedding: vector with 19 non-zero elements)]}

Indexing pipeline

[5]

[6]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21068632e0>
,🚅 Components
,  - cleaner: DocumentCleaner
,  - splitter: DocumentSplitter
,  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
,  - writer: DocumentWriter
,🛤️ Connections
,  - cleaner.documents -> splitter.documents (List[Document])
,  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
,  - sparse_doc_embedder.documents -> writer.documents (List[Document])

Let's index our documents!

⚠️ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.

[7]

Calculating sparse embeddings: 100%|██████████| 152/152 [02:29<00:00,  1.02it/s]
200it [00:00, 2418.48it/s]

{'writer': {'documents_written': 152}}

[8]

Retrieval

Retrieval pipeline

Now, we create a simple retrieval Pipeline:

FastembedSparseTextEmbedder: transforms the query into a sparse embedding
QdrantSparseEmbeddingRetriever: looks for relevant documents, based on the similarity of the sparse embeddings

[9]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21067cf3d0>
,🚅 Components
,  - sparse_text_embedder: FastembedSparseTextEmbedder
,  - sparse_retriever: QdrantSparseEmbeddingRetriever
,🛤️ Connections
,  - sparse_text_embedder.sparse_embedding -> sparse_retriever.query_sparse_embedding (SparseEmbedding)

Try the retrieval pipeline

[10]

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00,  9.02it/s]

[11]

Understanding SPLADE vectors

(Inspiration: FastEmbed SPLADE notebook)

We have seen that our model encodes text into a sparse vector (= a vector with many zeros). An efficient representation of sparse vectors is to save the indices and values of nonzero elements.

Let's try to understand what information resides in these vectors...

[12]

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.06it/s]

[13]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

[14]

Very nice! 🦫

tokens are ordered by relevance
the query is expanded with relevant tokens/terms: "location", "habitat"...

Hybrid Retrieval

Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.

However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper (An Analysis of Fusion Functions for Hybrid Retrieval). Make sure this works for your use case and conduct an evaluation.

Below we show how to create such an application in Haystack.

In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.

If you want to customize the behavior more, see Hybrid Retrieval Pipelines (tutorial).

[15]

[16]

[17]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8292170>
,🚅 Components
,  - cleaner: DocumentCleaner
,  - splitter: DocumentSplitter
,  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
,  - dense_doc_embedder: FastembedDocumentEmbedder
,  - writer: DocumentWriter
,🛤️ Connections
,  - cleaner.documents -> splitter.documents (List[Document])
,  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
,  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
,  - dense_doc_embedder.documents -> writer.documents (List[Document])

[18]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

ort_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Calculating sparse embeddings: 100%|██████████| 152/152 [02:14<00:00,  1.13it/s]
Calculating embeddings: 100%|██████████| 152/152 [00:41<00:00,  3.68it/s]
200it [00:00, 655.45it/s]

{'writer': {'documents_written': 152}}

[19]

Document(id=5e2d65ac05a8a238b359773c3d855e026aca6e617df8a011964b401d8b242a1e, content: ' Overall, they tend to be dwarfed by other Cetartiodactyls. Several species have female-biased sexua...', meta: {'title': 'Dolphin', 'url': 'https://en.wikipedia.org/wiki/Dolphin', 'source_id': '6584a10fad50d363f203669ff6efc19e7ae2a5a28ca9351f5cceb5ba88f8e847'}, embedding: vector of size 384, sparse_embedding: vector with 129 non-zero elements)

[20]

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8293190>
,🚅 Components
,  - sparse_text_embedder: FastembedSparseTextEmbedder
,  - dense_text_embedder: FastembedTextEmbedder
,  - retriever: QdrantHybridRetriever
,🛤️ Connections
,  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
,  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])

[21]

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00,  9.95it/s]
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]

[22]

📚 Docs on Sparse Embedding support in Haystack

(Notebook by Stefano Fiorucci)

Sparse Embedding Retrieval

Sparse Embedding Retrieval with Qdrant and FastEmbed

Why SPLADE?

Install dependencies

Sparse Embedding Retrieval

Indexing

Create a Qdrant Document Store

Download Wikipedia pages and create raw documents

Initialize a FastembedSparseDocumentEmbedder

Indexing pipeline

Let's index our documents!

Retrieval

Retrieval pipeline

Try the retrieval pipeline

Understanding SPLADE vectors

Hybrid Retrieval

📚 Docs on Sparse Embedding support in Haystack

Initialize a `FastembedSparseDocumentEmbedder`