Advanced Evaluation Of Quantized Vectors Using Cohere Mongodb Beir
Advanced Evaluation of Quantized Vectors: Cohere, MongoDB, and BEIR Integration
You can view an article that explains concepts in this notebook:
What to Expect
In this notebook, we will conduct an advanced evaluation of quantized vectors using Cohere's embedding models, MongoDB for vector storage and search, and the BEIR (Benchmarking IR) framework for performance assessment. Here's what you can expect to learn and explore:
-
Vector Quantization: Understand the process and benefits of vector quantization in the context of embedding storage and retrieval.
-
Cohere Integration: Learn how to generate embeddings using Cohere's state-of-the-art models, including both float32 and quantized int8 versions.
-
MongoDB Vector Storage: Explore efficient ways to store and index vector embeddings in MongoDB, including BSON encoding for optimized storage.
-
Vector Search Implementation: Implement and compare vector search capabilities using different vector representations (float32, BSON float32, and BSON int8).
-
BEIR Framework Usage: Utilize the BEIR framework to evaluate the performance of our vector search implementations across various information retrieval metrics.
-
Performance Comparison: Analyze and visualize the performance differences between non-quantized and quantized vector representations in terms of:
- Storage efficiency
- Search accuracy
- Retrieval speed
-
Advanced Metrics: Dive deep into advanced IR metrics such as NDCG, MAP, Recall, and Precision at different cut-off points.
-
Visualization: Create informative plots to compare the performance of different vector representations across multiple metrics.
-
Practical Insights: Gain insights into the trade-offs between storage efficiency and search performance, helping you make informed decisions for your own vector search applications.
By the end of this notebook, you will have a comprehensive understanding of how vector quantization affects the performance of embedding-based search systems, and you'll be equipped with the knowledge to implement and evaluate such systems using industry-standard tools and frameworks.
Step 1: Load the BEIR scifact dataset
0%| | 0/5183 [00:00<?, ?it/s]
('4983', {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 versus 1.1 microm2/ms). Relative anisotropy was higher the closer birth was to term with greater absolute values in the internal capsule than in the central white matter. Preterm infants at term showed higher mean diffusion coefficients in the central white matter (1.4 +/- 0.24 versus 1.15 +/- 0.09 microm2/ms, p = 0.016) and lower relative anisotropy in both areas compared with full-term infants (white matter, 10.9 +/- 0.6 versus 22.9 +/- 3.0%, p = 0.001; internal capsule, 24.0 +/- 4.44 versus 33.1 +/- 0.6% p = 0.006). Nonmyelinated fibers in the corpus callosum were visible by diffusion tensor MRI as early as 28 wk; full-term and preterm infants at term showed marked differences in white matter fiber organization. The data indicate that quantitative assessment of water diffusion by diffusion tensor MRI provides insight into microstructural development in cerebral white matter in living infants.', 'title': 'Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.'})
('1', '0-dimensional biomaterials show inductive properties.')
('1', {'31715818': 1})
Corpus: The corpus is a dictionary where each key is a document ID, and the value is another dictionary containing the document's text and title. For example:
'4983': {
'text': 'Alterations of the architecture of cerebral white matter...',
'title': 'Microstructural development of human newborn cerebral white matter...'
}
This corresponds to the scientific abstracts in our earlier example.
Queries: The queries dictionary contains the scientific claims, where the key is a query ID and the value is the claim text. For example:
'1': '0-dimensional biomaterials show inductive properties.'
Qrels: The qrels (query relevance) dictionary contains the ground truth relevance judgments. It's structured as a nested dictionary where the outer key is the query ID, the inner key is a document ID, and the value is the relevance score (typically 1 for relevant, 0 for non-relevant). For example:
'1': {'31715818': 1}
This indicates that for query '1', the document with ID '31715818' is relevant.
Step 2: Generate embedddings for float 32, Int8 embedding by using Cohere
Enter Cohere API Key: ··········
Generating embeddings: 100%|██████████| 5183/5183 [22:16<00:00, 3.88it/s]
Step 3: Generate BSON representations of float32 and int8 embedding
Step 4: Ingest float32, bsonfloat32 and bsonint8 into separate collections
Enter MongoDB URI: ··········
Connection to MongoDB successful
DeleteResult({'n': 10366, 'electionId': ObjectId('7fffffff0000000000000033'), 'opTime': {'ts': Timestamp(1727642864, 1867), 't': 51}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1727642864, 1867), 'signature': {'hash': b"\xa7'$-\x1dj\xc4\x86=0\xf46)\t>\x010.\xa7\xa7", 'keyId': 7353740577831124994}}, 'operationTime': Timestamp(1727642864, 1867)}, acknowledged=True) Preparing documents: 100%|██████████| 5183/5183 [00:00<00:00, 12963.83it/s]
Inserting documents into collections... Inserted 5183 documents into float32_embeddings Inserted 5183 documents into bson_float32_embeddings Inserted 5183 documents into bson_int8_embeddings Data ingestion complete.
Step 5: Create vector indicies for all collections
Creating index 'vector_index'... Waiting for 30 seconds to allow index 'vector_index' to be created... 30-second wait completed for index 'vector_index'. Creating index 'vector_index'... Waiting for 30 seconds to allow index 'vector_index' to be created... 30-second wait completed for index 'vector_index'. Creating index 'vector_index'... Waiting for 30 seconds to allow index 'vector_index' to be created... 30-second wait completed for index 'vector_index'.
'vector_index'
Step 6: Create generic vector search function
Step 7: Comparison of collections generic vector search results (Search result, Retrieval latency and Memory)
Float32 Vector Search Result:
[{'similarityScore': 0.6950235366821289,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'similarityScore': 0.6885790824890137,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'similarityScore': 0.6674723029136658,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'similarityScore': 0.6613469123840332,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'similarityScore': 0.6603913307189941,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
BSON Float32 Vector Search Result:
[{'similarityScore': 0.6950235366821289,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'similarityScore': 0.6885790824890137,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'similarityScore': 0.6674723029136658,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'similarityScore': 0.6613469123840332,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'similarityScore': 0.6603913307189941,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
BSON Int8 Vector Search Result:
[{'similarityScore': 0.6947944164276123,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'similarityScore': 0.6891517639160156,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'similarityScore': 0.6680400371551514,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'similarityScore': 0.6613667607307434,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'similarityScore': 0.6589280366897583,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
Based on the provided vector search results for Float32, BSON Float32, and BSON Int8 encodings, we can conclude:
- Consistency across encodings: All three encoding methods (Float32, BSON Float32, and BSON Int8) returned the same top 5 documents in the same order, indicating a high level of consistency in search results regardless of the encoding method used.
- Similarity in scores: The similarity scores for each document are remarkably close across all three encoding methods, with only minor variations in the BSON Int8 results.
- Precision preservation: The BSON Float32 results are identical to the standard Float32 results, suggesting that BSON encoding of float32 values preserves full precision.
- Minimal impact of quantization: The BSON Int8 (quantized) results show only slight differences in similarity scores compared to the float32 versions. The largest difference is in the fifth result, with a score of 0.6589 vs 0.6604, a difference of only about 0.22%.
- Ranking stability: Despite the minor differences in similarity scores for the BSON Int8 encoding, the ranking of results remains unchanged, indicating that the quantization process maintains the relative similarities between documents.
These results demonstrate that both BSON encoding and quantization to int8 maintain high fidelity in vector search operations.
The consistency in results across all three methods suggests that developers can confidently use BSON encoding and even int8 quantization to significantly reduce storage requirements (as seen in later analysis) without compromising the quality of vector search results in their AI applications.
Float32 search time: 16.684341 milliseconds BSON Float32 search time: 12.182835 milliseconds BSON Int8 search time: 0.030998 milliseconds
Size information for float32 collection: Total Size: 221.22 MB Storage Size: 220.55 MB Index Size: 0.67 MB Document Size: 73.09 MB Average Document Size: 14787.00 bytes Number of Documents: 5183 Size information for bsonfloat32 collection: Total Size: 136.86 MB Storage Size: 136.18 MB Index Size: 0.68 MB Document Size: 27.97 MB Average Document Size: 5659.00 bytes Number of Documents: 5183 Size information for bsonint8 collection: Total Size: 53.31 MB Storage Size: 52.65 MB Index Size: 0.66 MB Document Size: 12.79 MB Average Document Size: 2587.00 bytes Number of Documents: 5183
Based on the results from the above operation, we can conclue the following in regards to memory utilization of various embeddings types
The use of BSON encoding and quantization significantly reduces storage requirements for vector embeddings in MongoDB collections while maintaining the same number of documents. The float32 collection requires the most storage, followed by the BSON-encoded float32 collection, with the BSON-encoded int8 (quantized) collection being the most storage-efficient.
Specifically:
- The BSON float32 encoding reduces the total size by approximately 38% compared to the standard float32 collection (136.86 MB vs 221.22 MB).
- The BSON int8 (quantized) encoding further reduces the total size by about 76% compared to the standard float32 collection, and by 61% compared to the BSON float32 collection (53.31 MB vs 221.22 MB and 136.86 MB respectively).
- The average document size follows a similar pattern, with BSON int8 documents being about 82% smaller than standard float32 documents (2587 bytes vs 14787 bytes).
These results demonstrate that BSON encoding, particularly when combined with quantization to int8, offers substantial space savings for storing vector embeddings in MongoDB. This can lead to improved query performance and reduced storage costs, especially for large-scale AI applications dealing with numerous vector embeddings.
Step 8: Evaluation on key metrics on BEIR for each collection
Float32 Vector Search Result:
[{'_id': '17388232',
'similarityScore': 0.6950235366821289,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'_id': '4346436',
'similarityScore': 0.6885790824890137,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'_id': '14082855',
'similarityScore': 0.6674723029136658,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'_id': '8290953',
'similarityScore': 0.6613469123840332,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'_id': '1922901',
'similarityScore': 0.6603913307189941,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
BSON Float32 Vector Search Result:
[{'_id': '17388232',
'similarityScore': 0.6950235366821289,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'_id': '4346436',
'similarityScore': 0.6885790824890137,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'_id': '14082855',
'similarityScore': 0.6674723029136658,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'_id': '8290953',
'similarityScore': 0.6613469123840332,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'_id': '1922901',
'similarityScore': 0.6603913307189941,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
BSON Int8 Vector Search Result:
[{'_id': '17388232',
'similarityScore': 0.6947944164276123,
'title': 'Mechanical regulation of cell function with geometrically '
'modulated elastomeric substrates'},
{'_id': '4346436',
'similarityScore': 0.6891517639160156,
'title': 'Nonlinear Elasticity in Biological Gels'},
{'_id': '14082855',
'similarityScore': 0.6680400371551514,
'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
'an Early and Susceptible Event after Mesh Implantation'},
{'_id': '8290953',
'similarityScore': 0.6613667607307434,
'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
'a structural matrix that supports angiogenesis in infarcted heart '
'tissue.'},
{'_id': '1922901',
'similarityScore': 0.6589280366897583,
'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------
Let's evaluate Float32 vector embeddings
Sample of retrieved results: Query ID: 1 Query text: 0-dimensional biomaterials show inductive properties. Top 3 retrieved documents: Doc ID: 17388232, Score: 0.6993089914321899 Doc ID: 4346436, Score: 0.6975740790367126 Doc ID: 1922901, Score: 0.6674870848655701 Query ID: 3 Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants. Top 3 retrieved documents: Doc ID: 19058822, Score: 0.8058915138244629 Doc ID: 2739854, Score: 0.7782564163208008 Doc ID: 14717500, Score: 0.7675504684448242 Query ID: 5 Query text: 1/2000 in UK have abnormal PrP positivity. Top 3 retrieved documents: Doc ID: 13734012, Score: 0.7203654050827026 Doc ID: 23531592, Score: 0.6829813122749329 Doc ID: 5850219, Score: 0.6786690950393677 Query ID: 13 Query text: 5% of perinatal mortality is due to low birth weight. Top 3 retrieved documents: Doc ID: 1263446, Score: 0.7334267497062683 Doc ID: 4791384, Score: 0.7181892395019531 Doc ID: 26611834, Score: 0.7133195996284485 Query ID: 36 Query text: A deficiency of vitamin B12 increases blood levels of homocysteine. Top 3 retrieved documents: Doc ID: 18557974, Score: 0.8081403970718384 Doc ID: 3215494, Score: 0.7896957397460938 Doc ID: 11705328, Score: 0.7879230976104736
NDCG: NDCG@1: 0.5967 NDCG@3: 0.6556 NDCG@5: 0.6781 NDCG@10: 0.6781 NDCG@100: 0.6781 NDCG@1000: 0.6781 MAP: MAP@1: 0.5649 MAP@3: 0.6301 MAP@5: 0.6458 MAP@10: 0.6458 MAP@100: 0.6458 MAP@1000: 0.6458 Recall: Recall@1: 0.5649 Recall@3: 0.6964 Recall@5: 0.7551 Recall@10: 0.7551 Recall@100: 0.7551 Recall@1000: 0.7551 Precision: P@1: 0.5967 P@3: 0.2556 P@5: 0.1687 P@10: 0.0843 P@100: 0.0084 P@1000: 0.0008
Let's evaluate Float32(BSON) vector embeddings
Sample of retrieved results: Query ID: 1 Query text: 0-dimensional biomaterials show inductive properties. Top 3 retrieved documents: Doc ID: 17388232, Score: 0.6993553638458252 Doc ID: 4346436, Score: 0.6978418827056885 Doc ID: 1922901, Score: 0.6676774024963379 Query ID: 3 Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants. Top 3 retrieved documents: Doc ID: 19058822, Score: 0.8050355315208435 Doc ID: 2739854, Score: 0.7785025835037231 Doc ID: 14717500, Score: 0.7682878971099854 Query ID: 5 Query text: 1/2000 in UK have abnormal PrP positivity. Top 3 retrieved documents: Doc ID: 13734012, Score: 0.7210093140602112 Doc ID: 23531592, Score: 0.6826974749565125 Doc ID: 5850219, Score: 0.6789515018463135 Query ID: 13 Query text: 5% of perinatal mortality is due to low birth weight. Top 3 retrieved documents: Doc ID: 1263446, Score: 0.7330269813537598 Doc ID: 4791384, Score: 0.718242883682251 Doc ID: 26611834, Score: 0.7134110331535339 Query ID: 36 Query text: A deficiency of vitamin B12 increases blood levels of homocysteine. Top 3 retrieved documents: Doc ID: 18557974, Score: 0.8084380626678467 Doc ID: 3215494, Score: 0.7901670336723328 Doc ID: 11705328, Score: 0.7884390354156494
NDCG: NDCG@1: 0.6000 NDCG@3: 0.6573 NDCG@5: 0.6804 NDCG@10: 0.6804 NDCG@100: 0.6804 NDCG@1000: 0.6804 MAP: MAP@1: 0.5683 MAP@3: 0.6323 MAP@5: 0.6487 MAP@10: 0.6487 MAP@100: 0.6487 MAP@1000: 0.6487 Recall: Recall@1: 0.5683 Recall@3: 0.6964 Recall@5: 0.7562 Recall@10: 0.7562 Recall@100: 0.7562 Recall@1000: 0.7562 Precision: P@1: 0.6000 P@3: 0.2556 P@5: 0.1693 P@10: 0.0847 P@100: 0.0085 P@1000: 0.0008
Let's evaluate Int8(BSON) vector embeddings
Sample of retrieved results: Query ID: 1 Query text: 0-dimensional biomaterials show inductive properties. Top 3 retrieved documents: Doc ID: 17388232, Score: 0.6991255283355713 Doc ID: 4346436, Score: 0.6984505653381348 Doc ID: 1922901, Score: 0.6661576628684998 Query ID: 3 Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants. Top 3 retrieved documents: Doc ID: 19058822, Score: 0.8047915697097778 Doc ID: 2739854, Score: 0.7789502143859863 Doc ID: 14717500, Score: 0.768173098564148 Query ID: 5 Query text: 1/2000 in UK have abnormal PrP positivity. Top 3 retrieved documents: Doc ID: 13734012, Score: 0.721397876739502 Doc ID: 23531592, Score: 0.6823171973228455 Doc ID: 5850219, Score: 0.6804669499397278 Query ID: 13 Query text: 5% of perinatal mortality is due to low birth weight. Top 3 retrieved documents: Doc ID: 1263446, Score: 0.7327826023101807 Doc ID: 4791384, Score: 0.7189810276031494 Doc ID: 26611834, Score: 0.7132070064544678 Query ID: 36 Query text: A deficiency of vitamin B12 increases blood levels of homocysteine. Top 3 retrieved documents: Doc ID: 18557974, Score: 0.8082408905029297 Doc ID: 3215494, Score: 0.7897244691848755 Doc ID: 11705328, Score: 0.7882398962974548
NDCG: NDCG@1: 0.6000 NDCG@3: 0.6552 NDCG@5: 0.6780 NDCG@10: 0.6780 NDCG@100: 0.6780 NDCG@1000: 0.6780 MAP: MAP@1: 0.5683 MAP@3: 0.6307 MAP@5: 0.6467 MAP@10: 0.6467 MAP@100: 0.6467 MAP@1000: 0.6467 Recall: Recall@1: 0.5683 Recall@3: 0.6931 Recall@5: 0.7517 Recall@10: 0.7517 Recall@100: 0.7517 Recall@1000: 0.7517 Precision: P@1: 0.6000 P@3: 0.2544 P@5: 0.1680 P@10: 0.0840 P@100: 0.0084 P@1000: 0.0008