Sparse Embedding Retrieval
Sparse Embedding Retrieval with Qdrant and FastEmbed
In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.
We will use the Qdrant Document Store and FastEmbed Sparse Embedders.
Why SPLADE?
- Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
- Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.
While good results can be achieved by combining the two approaches (tutorial), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques. In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).
Main features:
- Better than dense embedding Retrievers on precise keyword matching
- Better than BM25 on semantic matching
- Slower than BM25
- Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores
Resources
Install dependencies
Sparse Embedding Retrieval
Indexing
Create a Qdrant Document Store
Download Wikipedia pages and create raw documents
We download a few Wikipedia pages about animals and create Haystack documents from them.
Initialize a FastembedSparseDocumentEmbedder
The FastembedSparseDocumentEmbedder enrichs a list of documents with their sparse embeddings.
We are using prithvida/Splade_PP_en_v1, a good sparse embedding model with a permissive license.
We also want to embed the title of the document, because it contains relevant information.
For more customization options, refer to the docs.
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
README.md: 0%| | 0.00/133 [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/90.0 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
config.json: 0%| | 0.00/755 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.38k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
model.onnx: 0%| | 0.00/532M [00:00<?, ?B/s]
Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]
{'documents': [Document(id=cd69a8e89f3c179f243c483a337c5ecb178c58373a253e461a64545b669de12d, content: 'An example document', sparse_embedding: vector with 19 non-zero elements)]}
Indexing pipeline
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21068632e0> ,🚅 Components , - cleaner: DocumentCleaner , - splitter: DocumentSplitter , - sparse_doc_embedder: FastembedSparseDocumentEmbedder , - writer: DocumentWriter ,🛤️ Connections , - cleaner.documents -> splitter.documents (List[Document]) , - splitter.documents -> sparse_doc_embedder.documents (List[Document]) , - sparse_doc_embedder.documents -> writer.documents (List[Document])
Let's index our documents!
⚠️ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.
Calculating sparse embeddings: 100%|██████████| 152/152 [02:29<00:00, 1.02it/s] 200it [00:00, 2418.48it/s]
{'writer': {'documents_written': 152}} 152
Retrieval
Retrieval pipeline
Now, we create a simple retrieval Pipeline:
FastembedSparseTextEmbedder: transforms the query into a sparse embeddingQdrantSparseEmbeddingRetriever: looks for relevant documents, based on the similarity of the sparse embeddings
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21067cf3d0> ,🚅 Components , - sparse_text_embedder: FastembedSparseTextEmbedder , - sparse_retriever: QdrantSparseEmbeddingRetriever ,🛤️ Connections , - sparse_text_embedder.sparse_embedding -> sparse_retriever.query_sparse_embedding (SparseEmbedding)
Try the retrieval pipeline
Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 9.02it/s]
Understanding SPLADE vectors
(Inspiration: FastEmbed SPLADE notebook)
We have seen that our model encodes text into a sparse vector (= a vector with many zeros). An efficient representation of sparse vectors is to save the indices and values of nonzero elements.
Let's try to understand what information resides in these vectors...
Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.06it/s]
tokenizer_config.json: 0%| | 0.00/1.38k [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/712k [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
Very nice! 🦫
- tokens are ordered by relevance
- the query is expanded with relevant tokens/terms: "location", "habitat"...
Hybrid Retrieval
Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.
However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper (An Analysis of Fusion Functions for Hybrid Retrieval). Make sure this works for your use case and conduct an evaluation.
Below we show how to create such an application in Haystack.
In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.
If you want to customize the behavior more, see Hybrid Retrieval Pipelines (tutorial).
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8292170> ,🚅 Components , - cleaner: DocumentCleaner , - splitter: DocumentSplitter , - sparse_doc_embedder: FastembedSparseDocumentEmbedder , - dense_doc_embedder: FastembedDocumentEmbedder , - writer: DocumentWriter ,🛤️ Connections , - cleaner.documents -> splitter.documents (List[Document]) , - splitter.documents -> sparse_doc_embedder.documents (List[Document]) , - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document]) , - dense_doc_embedder.documents -> writer.documents (List[Document])
Fetching 9 files: 0%| | 0/9 [00:00<?, ?it/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
README.md: 0%| | 0.00/28.0 [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/711k [00:00<?, ?B/s]
.gitattributes: 0%| | 0.00/1.52k [00:00<?, ?B/s]
ort_config.json: 0%| | 0.00/1.27k [00:00<?, ?B/s]
config.json: 0%| | 0.00/706 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/1.24k [00:00<?, ?B/s]
model_optimized.onnx: 0%| | 0.00/66.5M [00:00<?, ?B/s]
vocab.txt: 0%| | 0.00/232k [00:00<?, ?B/s]
Calculating sparse embeddings: 100%|██████████| 152/152 [02:14<00:00, 1.13it/s] Calculating embeddings: 100%|██████████| 152/152 [00:41<00:00, 3.68it/s] 200it [00:00, 655.45it/s]
{'writer': {'documents_written': 152}} Document(id=5e2d65ac05a8a238b359773c3d855e026aca6e617df8a011964b401d8b242a1e, content: ' Overall, they tend to be dwarfed by other Cetartiodactyls. Several species have female-biased sexua...', meta: {'title': 'Dolphin', 'url': 'https://en.wikipedia.org/wiki/Dolphin', 'source_id': '6584a10fad50d363f203669ff6efc19e7ae2a5a28ca9351f5cceb5ba88f8e847'}, embedding: vector of size 384, sparse_embedding: vector with 129 non-zero elements) <haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8293190> ,🚅 Components , - sparse_text_embedder: FastembedSparseTextEmbedder , - dense_text_embedder: FastembedTextEmbedder , - retriever: QdrantHybridRetriever ,🛤️ Connections , - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding) , - dense_text_embedder.embedding -> retriever.query_embedding (List[float])
Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 9.95it/s] Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]
📚 Docs on Sparse Embedding support in Haystack
(Notebook by Stefano Fiorucci)