Introduction To Qdrant Sparse (Neural) Text Retrieval In Qdrant
Qdrant Essentials: Day 3 - Using Sparse Vectors for Keyword-Based Text Retrieval in Qdrant
To interact with Qdrant, we'll install the Qdrant Python client and Qdrant's lightweight embedding library called FastEmbed.
Step 1: Install the Qdrant Client & FastEmbed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 329.0/329.0 kB 8.8 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.9/100.9 kB 6.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 4.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 101.6/101.6 kB 6.5 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 93.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.8/324.8 kB 22.9 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 2.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 4.9 MB/s eta 0:00:00
Step 2: Import Required Libraries
Step 3: Connect to Qdrant Cloud
Lexical Retrieval with BM25 in Qdrant
The BM25 formula can be represented as follows:
Qdrant provides tooling to compute IDF on the server side. To enable this, we need to activate the IDF modifier when configuring sparse vectors in a collection.
Once enabled, IDF is maintained at the collection level.
When using any retrieval formula that includes IDF, such as BM25, we no longer need to include the IDF component in the sparse document representations. This leaves us with the following values of the documents' words:
The IDF component will be applied by Qdrant automatically when computing similarity scores.
Step 4: Create a Collection for BM25-based Retrieval
True
Step 5: Create BM25-based Sparse Vectors with FastEmbed & Insert Them into the Collection
The FastEmbed Qdrant library provides a way to generate BM25 formula-based sparse representations tailored for Qdrant specifics.
The integration between Qdrant and FastEmbed allows you to simply pass your texts and BM25 formula parameters when indexing documents to Qdrant. The conversion to sparse vectors happens under the hood.
Update: Since Qdrant's release 1.15.2, the conversion to BM25 sparse vectors happens directly in Qdrant, for all supported Qdrant clients.
Interface-wise, it looks the same as the local inference with FastEmbed, as shown in this notebook.
Note: Don’t forget to enable the
IDFmodifier when using BM25-based sparse representations generated by FastEmbed, as they intentionally exclude this component.
BM25 in FastEmbed: Implementation Details
Corpus Average Length
Qdrant and FastEmbed do not compute $\mathrm{avg}_{\text{corpus}}|D|$ (the average document length in the corpus). You must estimate and provide this value as a BM25 parameter.
Default BM25 Parameters in FastEmbed
k = 1.2b = 0.75
Text Processing Pipeline
FastEmbed uses the Snowball stemmer to reduce words to their root or base form, and applies language-specific stop word lists (e.g., and, or in English) to reduce vocabulary size and improve retrieval quality.
Therefore, FastEmbed’s BM25 works out of the box for the following languages:
Arabic, Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, and Turkish.
Average document length: 3.3333333333333335
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
Fetching 18 files: 0%| | 0/18 [00:00<?, ?it/s]
english.txt: 0%| | 0.00/936 [00:00<?, ?B/s]
config.json: 0%| | 0.00/2.00 [00:00<?, ?B/s]
finnish.txt: 0.00B [00:00, ?B/s]
arabic.txt: 0.00B [00:00, ?B/s]
danish.txt: 0%| | 0.00/424 [00:00<?, ?B/s]
dutch.txt: 0%| | 0.00/453 [00:00<?, ?B/s]
french.txt: 0%| | 0.00/813 [00:00<?, ?B/s]
german.txt: 0.00B [00:00, ?B/s]
romanian.txt: 0.00B [00:00, ?B/s]
norwegian.txt: 0%| | 0.00/851 [00:00<?, ?B/s]
italian.txt: 0.00B [00:00, ?B/s]
hungarian.txt: 0.00B [00:00, ?B/s]
greek.txt: 0.00B [00:00, ?B/s]
portuguese.txt: 0.00B [00:00, ?B/s]
russian.txt: 0.00B [00:00, ?B/s]
spanish.txt: 0.00B [00:00, ?B/s]
turkish.txt: 0%| | 0.00/260 [00:00<?, ?B/s]
swedish.txt: 0%| | 0.00/559 [00:00<?, ?B/s]
UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)
Here, FastEmbed downloads the Qdrant BM25 model from Hugging Face and performs the conversion to Qdrant-compatible sparse representations (so, arrays of indices and values) under the hood. These vectors are then upserted to Qdrant.
In this example, inference - computing sparse representations with FastEmbed - is performed locally, using Google Colab resources.
Qdrant also offers Cloud Inference for both sparse and dense vectors.
Step 6: Lexical Retrieval with BM25 & Qdrant
Now let's test our BM25-based lexical search in Qdrant.
Suppose we're searching for the word "cheese" — this is our query. Let's break down what happens with this query and the documents indexed to Qdrant in the previous step.
Step 1
For every keyword in the query that is not a stop word in the target language (in our case, English, and "cheese" is not a stop word):
- FastEmbed extracts the stem (root/base form) of the word.
"cheese"becomes"chees"
- The stem is then mapped to a corresponding index from the vocabulary.
"chees"->1496964506
Step 2
Qdrant lookups up this keyword index (1496964506) in the inverted index, introduced in the previous video.
For every document (found via the inverted index) that contains the keyword "cheese", we have the BM25-based score for "cheese" in that particular document, precomputed by FastEmbed and stored in Step 5:
Step 3
Qdrant scales this document-specific score by the IDF of the keyword "cheese", calculated across the entire corpus:
Step 4
The final similarity score between the query and a document is the sum of the scores of all matching keywords:
QueryResponse(points=[ScoredPoint(id=2, version=0, score=0.5619608, payload={'text': 'Mac and cheese'}, vector={'bm25_sparse_vector': SparseVector(indices=[1303191493, 1496964506], values=[1.1956521, 1.1956521])}, shard_key=None, order_value=None), ScoredPoint(id=0, version=0, score=0.49005118, payload={'text': 'Grated hard cheese'}, vector={'bm25_sparse_vector': SparseVector(indices=[862853134, 1277694805, 1496964506], values=[1.042654, 1.042654, 1.042654])}, shard_key=None, order_value=None)]) BM25 retrieved only the documents that contain the keyword "cheese", as BM25-based retrieval works strictly with exact keyword matches.
The description "Mac and cheese" was ranked higher because the BM25-estimated value of "cheese" is greater in this text than in "Grated hard cheese".
It's higher because "and" is a stop word and is excluded from the calculation.
So in "Mac and cheese", "cheese" is one of two considered words, whereas in "Grated hard cheese", it's one of three - giving it lower relative importance.
🎉 Now you know how to use BM25 in Qdrant. This will come in handy when you want to combine the precision and explainability of lexical search with the flexibility and semantic understanding of dense vectors - in a hybrid search scenario.
But before we dive into hybrid search in Qdrant, let’s explore an approach making keyword-based retrieval semantically aware: sparse neural retrieval.
Sparse Neural Retrieval with SPLADE++ in Qdrant
Sparse Lexical and Expansion Model (SPLADE) is a family of sparse neural retrievers built on top of Bidirectional Encoder Representations from Transformers (BERT).
These models are intended for retrieval in English, unless fine-tuned or retrained for other languages.
In addition to assigning weights to terms in the input text, SPLADE also expands inputs with contextually relevant terms. This is done to solve the vocabulary mismatch problem, allowing the model to match queries and documents that use different but semantically close terms.
ℹ️ Check out more about SPLADE and its architecture in the "Modern Sparse Neural Retrieval" article.
Step 7: Create a Collection for Sparse Neural Retrieval with SPLADE++
Note that we’re not configuring the Inverse Document Frequency (IDF) modifier here, unlike in BM25-based retrieval. SPLADE models don’t rely on corpus-level statistics like IDF to estimate word relevance. Instead, they generate term weights in sparse representations based on their interactions within the encoded text.
True
Step 8: Create SPLADE++ Sparse Vectors with FastEmbed & Insert Them into the Collection
The FastEmbed library provides SPLADE++; one of the latest models in the SPLADE family.
Update: Since the release of Qdrant Cloud Inference, you can move SPLADE++ embedding inference from local execution (as shown in this notebook) to the cloud, reducing latency and centralizing resource usage.
As a result, this step looks mostly identical to Step 5 of this tutorial. However, under the hood, the process of converting a document to a sparse representation is quite different.
Documents to SPLADE++ Sparse Representations
SPLADE models generate sparse text representations made up of tokens produced by the SPLADE tokenizer.
Tokenizers break text into smaller units called tokens, which form the model's vocabulary. Depending on the tokenizer, these tokens can be words, subwords, or even characters.
SPLADE models operate on a fixed vocabulary of 30,522 tokens.
Text to Tokens
Each document is first tokenized and the resulting tokens are mapped to their corresponding indices in the model’s vocabulary.
These indices are then used in the final sparse representation.
You can explore this process in the Tokenizer Playground by selecting the
customtokenizer and enteringQdrant/Splade_PP_en_v1.
For example, "cheese" is mapped to token index8808, and "mac" to6097.
Weighting Tokens
The tokenized text, now represented as token indices, is passed through the SPLADE model.
SPLADE expands the input by adding contextually relevant tokens and simultaneously assigns each token in the final sparse representation a weight that reflects its role in the text.
SPLADE++ Document Expansion
For example, "mac and cheese" will be expanded to: "mac and cheese dairy apple dish & variety brand food made , foods difference eat restaurant or", resulting in a SPLADE-generated sparse representation with 17 non-zero values.
If you’d like to experiment with SPLADE's expansion behavior, check out our documentation on using SPLADE in FastEmbed. It includes a utility function to decode SPLADE++ sparse representations back into tokens with their corresponding weights.
Fetching 5 files: 0%| | 0/5 [00:00<?, ?it/s]
config.json: 0%| | 0.00/755 [00:00<?, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
special_tokens_map.json: 0%| | 0.00/695 [00:00<?, ?B/s]
model.onnx: 0%| | 0.00/532M [00:00<?, ?B/s]
UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)
Step 9: Sparse Neural Retrieval with SPLADE++ & Qdrant
Conversion of a query to a sparse representation by SPLADE++ works exactly the same way as for documents.
Let’s see what sparse neural retrieval brings to the table compared to BM25-based lexical retrieval.
We'll test a query where the meaning of the keyword depends heavily on context: "a not soft cheese". In our toy dataset of grocery item descriptions, the most fitting result should be "grated hard cheese".
QueryResponse(points=[ScoredPoint(id=0, version=0, score=15.463482, payload={'text': 'Grated hard cheese'}, vector={'splade_sparse_vector': SparseVector(indices=[1010, 2081, 2524, 2828, 3067, 3528, 4383, 4435, 6211, 8808, 9841, 11825, 21774, 24665], values=[0.51333773, 0.26326102, 2.4472122, 0.30004033, 0.12786251, 0.79050905, 2.211527, 0.31894565, 1.1508049, 2.51135, 0.5997427, 0.99693406, 0.12898666, 2.4744098])}, shard_key=None, order_value=None), ScoredPoint(id=2, version=0, score=13.2577, payload={'text': 'Mac and cheese'}, vector={'splade_sparse_vector': SparseVector(indices=[1004, 1010, 1998, 2030, 2081, 2833, 3528, 4435, 4489, 4521, 4825, 6097, 6207, 8808, 9440, 9841, 11825], values=[0.7900756, 0.3015864, 1.7403926, 0.010728066, 0.33394834, 0.40943024, 0.48172736, 0.4619382, 0.11402316, 0.07198032, 0.055049647, 3.0968075, 1.1688995, 3.042247, 0.1626582, 0.94423723, 1.5706416])}, shard_key=None, order_value=None), ScoredPoint(id=1, version=0, score=2.1539285, payload={'text': 'White crusty bread roll'}, vector={'splade_sparse_vector': SparseVector(indices=[1010, 2100, 2317, 2806, 2828, 2833, 3528, 3609, 4435, 4897, 5291, 5660, 6243, 7852, 8808, 9372, 9841, 12461, 14704, 19116], values=[0.07693227, 1.1669595, 2.280204, 0.06772788, 0.56041586, 0.461585, 0.36701784, 0.6813059, 0.1603209, 2.2790196, 0.11719603, 0.2420436, 0.7851173, 2.2026603, 0.28555784, 1.0245491, 0.94225013, 0.8348646, 0.17922312, 2.32752])}, shard_key=None, order_value=None)]) Yet for BM25, "mac and cheese" would be ranked higher, since "cheese", the only matching keyword between the query and the documents, plays a more prominent role in that description compared to "grated hard cheese" as we saw in Step 6.
QueryResponse(points=[ScoredPoint(id=2, version=0, score=0.5619608, payload={'text': 'Mac and cheese'}, vector={'bm25_sparse_vector': SparseVector(indices=[1303191493, 1496964506], values=[1.1956521, 1.1956521])}, shard_key=None, order_value=None), ScoredPoint(id=0, version=0, score=0.49005118, payload={'text': 'Grated hard cheese'}, vector={'bm25_sparse_vector': SparseVector(indices=[862853134, 1277694805, 1496964506], values=[1.042654, 1.042654, 1.042654])}, shard_key=None, order_value=None)]) Role of SPLADE++ Document & Query Expansion in Vocabulary Mismatch
You may have noticed that SPLADE++ returned a non-zero similarity score between the query "A not soft cheese" and the document "White crusty bread roll", even though they have no overlapping keywords.
This happened due to SPLADE++’s internal expansion mechanism.
SPLADE expands both documents and queries.
Let’s now see SPLADE++ in action solving the vocabulary mismatch problem.
QueryResponse(points=[ScoredPoint(id=0, version=0, score=0.7621683, payload={'text': 'Grated hard cheese'}, vector={'splade_sparse_vector': SparseVector(indices=[1010, 2081, 2524, 2828, 3067, 3528, 4383, 4435, 6211, 8808, 9841, 11825, 21774, 24665], values=[0.51333773, 0.26326102, 2.4472122, 0.30004033, 0.12786251, 0.79050905, 2.211527, 0.31894565, 1.1508049, 2.51135, 0.5997427, 0.99693406, 0.12898666, 2.4744098])}, shard_key=None, order_value=None), ScoredPoint(id=2, version=0, score=0.5772799, payload={'text': 'Mac and cheese'}, vector={'splade_sparse_vector': SparseVector(indices=[1004, 1010, 1998, 2030, 2081, 2833, 3528, 4435, 4489, 4521, 4825, 6097, 6207, 8808, 9440, 9841, 11825], values=[0.7900756, 0.3015864, 1.7403926, 0.010728066, 0.33394834, 0.40943024, 0.48172736, 0.4619382, 0.11402316, 0.07198032, 0.055049647, 3.0968075, 1.1688995, 3.042247, 0.1626582, 0.94423723, 1.5706416])}, shard_key=None, order_value=None), ScoredPoint(id=1, version=0, score=0.3388096, payload={'text': 'White crusty bread roll'}, vector={'splade_sparse_vector': SparseVector(indices=[1010, 2100, 2317, 2806, 2828, 2833, 3528, 3609, 4435, 4897, 5291, 5660, 6243, 7852, 8808, 9372, 9841, 12461, 14704, 19116], values=[0.07693227, 1.1669595, 2.280204, 0.06772788, 0.56041586, 0.461585, 0.36701784, 0.6813059, 0.1603209, 2.2790196, 0.11719603, 0.2420436, 0.7851173, 2.2026603, 0.28555784, 1.0245491, 0.94225013, 0.8348646, 0.17922312, 2.32752])}, shard_key=None, order_value=None)]) SPLADE expands the query "parmesan" with 10+ additional tokens, making it possible to match and rank the (also expanded at indexing time) "grated hard cheese" as the top hit, even though "parmesan" doesn’t appear in any document in our dataset.
Qdrant's Sparse Neural Retrievers
We’ve been exploring sparse neural retrieval as a promising approach for domains where keyword-based matching is useful, but traditional methods like BM25 fall short due to their lack of semantic understanding.
Our goal is to push this field forward by fostering adoption of lightweight, explainable, and practical sparse neural retrievers.
To support this, we’ve developed and open-sourced two custom sparse neural retrievers, both built on top of the BM25 formula.
You can find all the details in the following articles: BM42 Sparse Neural Retriever and miniCOIL Sparse Neural Retriever.
Both models can be used with FastEmbed and Qdrant in the same way we demonstrated with BM25 and SPLADE++ in this tutorial.
- FastEmbed handle for BM42:
Qdrant/bm42-all-minilm-l6-v2-attentions - FastEmbed handle for miniCOIL:
Qdrant/minicoil-v1
You can check all sparse retrievers supported in FastEmbed using:
from fastembed import SparseTextEmbedding
SparseTextEmbedding.list_supported_models()
🚀 We encourage you to experiment and find the sparse retriever that fits your data best!
Congratulations! You’re now well-equipped with everything you need to know about sparse text retrieval in Qdrant.
This approach shines in domains where exact keyword matches are critical, like e-commerce, medical, legal, and many more.
However, sparse (even neural) retrieval has its limits. It’s not so ideal when semantically similar content is expressed in entirely different ways.
Models like SPLADE++ try to close that semantic gap, but doing so makes their representations less sparse and harder to interpret. After all, why "apple" is related to "mac and cheese"? 🤔
That’s where dense retrieval comes in, great for discovery and bridging the vocabulary mismatch natively.
So now we have:
- Sparse retrieval: precise, lightweight, and explainable
- Dense retrieval: flexible and great for exploration
Why not combine both? In the next videos, we’ll show you how to do it with Hybrid Search.