Vector Recall Notebook
This notebook accompys the article "Trading precision for performance: How much recall do you actually lose with quantized vector search?"
The point of the exercise is not to measure the embedding model or BBQ specifically, it’s to demonstrate how you can easily measure the recall of your dataset with minimal setup.
0. Install required python libraries
NOTE: This notebook was tested with Python 3.12. Lower versions may throw warnings but should still function overall.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 979.4/979.4 kB 10.0 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.3/65.3 kB 1.1 MB/s eta 0:00:00
1. Connect to Elasticsearch
First, we need to establish a secure connection to your Elastic Serverless instance. We will use getpass to handle the API Key securely so it doesn't appear in your history.
--- Elastic Connection Setup --- Enter Elastic URL: https://chatty-mcchatbot-c0f827.es.us-east-1.aws.elastic.cloud:443 Enter Elastic API Key: ·········· ✅ Connected to Elastic
2. Load and Sample Data
We will use the DBPedia-14 dataset (specifically the "Film" category) to create a realistic test corpus.
To start, we will load a fixed sample of 2,500 movies. This is large enough to test vector search accuracy but small enough to run quickly in this notebook.
Loading DBPedia-14 dataset...
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
README.md: 0.00B [00:00, ?B/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
dbpedia_14/train-00000-of-00001.parquet: 0%| | 0.00/106M [00:00<?, ?B/s]
dbpedia_14/test-00000-of-00001.parquet: 0%| | 0.00/13.3M [00:00<?, ?B/s]
Generating train split: 0%| | 0/560000 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/70000 [00:00<?, ? examples/s]
✅ Dataset ready: 2500 movie summaries.
3. Ingestion: creating the "Server-Side Ground Truth"
This is the most critical part of the experiment setup. To measure accuracy, we need a "Gold Standard" to compare against.
We will map the content field twice in the same index:
content(HNSW): This uses the defaultsemantic_textsettings. It is optimized for speed and scale using the HNSW graph algorithm and likely uses BBQ (Binary Quantization). This is the "Test Subject."content.raw(Flat): We explicitly configure this sub-field to usetype: "flat". This forces Elasticsearch to perform a Brute-Force Exact Scan of the full Float32 vectors. This is the "Control Group" (Ground Truth).
By querying both fields and comparing the results, we can measure exactly how much accuracy we trade for speed.
We will be using the Jina Embedding V5 Text Model for this small test
♻️ Refreshing index 'movies-recall-demo'... ✅ Index ready with Flat + HNSW fields. Preparing 2500 documents... 🚀 Indexing documents (Chunk Size = 500)... ✅ Ingestion complete! Indexed 2500 documents. Waiting 10 seconds for inference to catch up...
4. Phase 1: The Baseline Recall Test
Now we run the experiment. We will calculate Recall@10 to measure the performance of the HNSW algorithm.
How this works:
- Select 50 Random Queries: We pick 50 movies from our index to serve as search queries.
- Get the "Perfect" Answer: We query the
content.raw(Flat) field. Because this uses a brute-force scan of uncompressed vectors, the top 10 results are mathematically guaranteed to be the closest neighbors. - Get the "Real-World" Answer: We query the
content(HNSW) field using the exact same text. This uses the fast, approximate graph search (which might use quantization). - Compare: We count how many documents overlap.
- If the HNSW search finds 9 of the same movies as the Flat search, the Recall is 0.9 (90%).
Code Walkthrough:
test_indices: Randomly selects ID numbers to use as queries.client.search(..., field="content.raw"): The slow, exact search (Truth).client.search(..., field="content"): The fast, approximate search (Test).set(true_ids).intersection(...): The math that calculates the score.
--- Phase 1: Baseline Evaluation --- Running 50 queries... (Comparing 'content' [HNSW] against 'content.raw' [Exact]) --- Phase 1 Results --- Baseline Recall: 0.9720
5. Phase 2: Hybrid Search Check
If the baseline recall is low (e.g., due to heavy quantization), Hybrid Search is often the best fix. By combining the vector score with a standard Keyword (BM25) score, we can often "rescue" relevant documents that the vector graph missed.
We will run the same 50 queries using a bool query that combines semantic and match clauses, and check if it gets closer to the Ground Truth.
--- Phase 2: Hybrid Search Evaluation --- (Checking if Keywords help fix any gaps) --- Phase 2 Results --- Baseline Recall: 0.9720 Hybrid Recall: 0.9700 Improvement: +-0.2%
6. Bonus: The Scale Test (For the Blog)
This block runs a loop to test how Recall holds up as we increase the dataset size. We will test against: 1,000, 2,500, 5,000, and 10,000 documents.
Note: This will take several minutes to run as it deletes, re-creates, and re-indexes the data for every iteration.
🔄 Reloading full dataset for scale testing...
✅ Full pool available: 40000 documents.
🚀 Starting Scale Test on sizes: [1000, 5000, 10000, 20000, 40000]
=== 🧪 Testing Scale: 1000 Documents ===
Indexed 1000 docs. Waiting 15s for inference...
📊 Score for 1000 docs: 1.0000
=== 🧪 Testing Scale: 5000 Documents ===
Indexed 5000 docs. Waiting 15s for inference...
📊 Score for 5000 docs: 0.9980
=== 🧪 Testing Scale: 10000 Documents ===
Indexed 10000 docs. Waiting 15s for inference...
📊 Score for 10000 docs: 0.9920
=== 🧪 Testing Scale: 20000 Documents ===
Indexed 20000 docs. Waiting 15s for inference...
📊 Score for 20000 docs: 0.9960
=== 🧪 Testing Scale: 40000 Documents ===
Indexed 40000 docs. Waiting 15s for inference...
📊 Score for 40000 docs: 0.9920
🏆 --- Scale Test Final Results ---
Docs Recall
0 1000 1.000
1 5000 0.998
2 10000 0.992
3 20000 0.996
4 40000 0.992
7. Visualizing the Scale Test
Now we plot the data we just collected to visualize the stability of our vector search.
What this graph shows:
- X-Axis: The number of documents in the index (Logarithmic scale effectively, stepping from 1k to 10k).
- Y-Axis: The Recall Score (0.0 to 1.0).
- The Blue Line: This represents the accuracy of the approximate HNSW search compared to the perfect Brute Force scan.
What we want to see: We want to see a flat, horizontal line near the top (1.0). This indicates that even as we add more data (scaling up), the search quality does not degrade. A sharp drop-off would indicate that the HNSW graph is struggling to maintain accuracy at scale.
NOTE Due to random sampling, the chart will be a bit different each run
--- The "Can it scale" Test (100k) ---
We combine 3 categories to get enough data:
- Label 11: Album
- Label 12: Film
- Label 13: Written Work (Books)
🚀 Starting 'Go Big' 100k Scale Test...
1. Reloading and filtering for Albums, Films, and Books...
✅ Combined Pool Size: 120000 documents (Should be ~120k)
=== 🦖 Testing Massive Scale: 1000 Documents ===
Indexed 1000 docs. Waiting 20s for inference...
📊 Score for 1000 docs: 1.0000
=== 🦖 Testing Massive Scale: 10000 Documents ===
Indexed 10000 docs. Waiting 20s for inference...
📊 Score for 10000 docs: 0.9980
=== 🦖 Testing Massive Scale: 50000 Documents ===
Indexed 50000 docs. Waiting 20s for inference...
📊 Score for 50000 docs: 0.9980
=== 🦖 Testing Massive Scale: 100000 Documents ===
Indexed 100000 docs. Waiting 20s for inference...
📊 Score for 100000 docs: 0.9880
🏆 --- Massive Scale Test Results ---
Docs Recall
0 1000 1.000
1 10000 0.998
2 50000 0.998
3 100000 0.988
Docs Recall
0 1000 1.000
1 10000 0.998
2 50000 0.998
3 100000 0.988
Lets Chart it!
--- The throw the whole thing at it - test 560k) ---
We load the entire DBPedia-14 training set.
Total Docs: 560,000
( ) ) ( )
)\ ) ( /( * ) ( /( )\ ) ( /( (
( ( ( ( (()/( )\())` ) /( )\())(()/( )\()) )\ )
)\ )\ )\ )\ /(_))((_)\ ( )(_))((_)\ /(_))((_)\ (()/(
((_) ((_)((_)((_) (_)) __ ((_)(_(_()) _((_)(_)) _((_) /(_))_
| __|\ \ / / | __|| _ \\ \ / /|_ _| | || ||_ _| | \| |(_)) __|
| _| \ V / | _| | / \ V / | | | __ | | | | .` | | (_ |
|___| \_/ |___||_|_\ |_| |_| |_||_||___| |_|\_| \___|
🚀 Starting 'Whole Enchilada' Scale Test...
1. Reloading the ENTIRE dataset...
✅ Total Pool Size: 560,000 documents.
=== 🦖 Testing Mega Scale: 10,000 Documents ===
Uploading 10,000 documents...
✅ Indexed 10,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 10,000 docs: 1.0000
⏱️ Time taken: 150.6s (2.5 min)
=== 🦖 Testing Mega Scale: 50,000 Documents ===
Uploading 50,000 documents...
✅ Indexed 50,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 50,000 docs: 0.9950
⏱️ Time taken: 435.7s (7.3 min)
=== 🦖 Testing Mega Scale: 100,000 Documents ===
Uploading 100,000 documents...
✅ Indexed 100,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 100,000 docs: 0.9970
⏱️ Time taken: 850.5s (14.2 min)
=== 🦖 Testing Mega Scale: 200,000 Documents ===
Uploading 200,000 documents...
✅ Indexed 200,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 200,000 docs: 0.9960
⏱️ Time taken: 1730.0s (28.8 min)
=== 🦖 Testing Mega Scale: 400,000 Documents ===
Uploading 400,000 documents...
✅ Indexed 400,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 400,000 docs: 0.9900
⏱️ Time taken: 2462.6s (41.0 min)
=== 🦖 Testing Mega Scale: 560,000 Documents ===
Uploading 560,000 documents...
✅ Indexed 560,000 docs. Waiting 60s for inference/indexing to settle...
Running 100 queries...
📊 Score for 560,000 docs: 0.9800
⏱️ Time taken: 2509.9s (41.8 min)
🏆 --- Mega Scale Test Results ---
Docs Recall Time (s)
0 10000 1.000 150.6
1 50000 0.995 435.7
2 100000 0.997 850.5
3 200000 0.996 1730.0
4 400000 0.990 2462.6
5 560000 0.980 2509.9
Lets Compare!
✅ Mapping prepared for model: .multilingual-e5-small-elasticsearch Target Index: movies-recall-e5
🚀 Starting E5 Scale Test on index: movies-recall-e5 === 🧪 Testing E5 Scale: 10,000 Documents ===
WARNING:elastic_transport.node_pool:Node <Urllib3HttpNode(https://chatty-mcchatbot-c0f827.es.us-east-1.aws.elastic.cloud:443)> has been marked alive after a successful request
Uploading 10,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 10,000 docs: 0.9930
=== 🧪 Testing E5 Scale: 50,000 Documents ===
Uploading 50,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 50,000 docs: 0.9760
=== 🧪 Testing E5 Scale: 100,000 Documents ===
Uploading 100,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 100,000 docs: 0.9740
=== 🧪 Testing E5 Scale: 200,000 Documents ===
Uploading 200,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 200,000 docs: 0.9530
=== 🧪 Testing E5 Scale: 400,000 Documents ===
Uploading 400,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 400,000 docs: 0.9420
=== 🧪 Testing E5 Scale: 560,000 Documents ===
Uploading 560,000 docs to movies-recall-e5...
✅ Indexed. Waiting 60s for inference...
📊 E5 Score for 560,000 docs: 0.9290
🏆 --- E5 Scale Test Results ---
Docs Recall Model Time
0 10000 0.993 E5 Small 218.730276
1 50000 0.976 E5 Small 375.640476
2 100000 0.974 E5 Small 587.012662
3 200000 0.953 E5 Small 885.498575
4 400000 0.942 E5 Small 2270.705233
5 560000 0.929 E5 Small 2643.094693