Notebooks
M
MongoDB
Advanced Evaluation Of Quantized Vectors Using Cohere Mongodb Beir

Advanced Evaluation Of Quantized Vectors Using Cohere Mongodb Beir

advanced_techniquesagentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Advanced Evaluation of Quantized Vectors: Cohere, MongoDB, and BEIR Integration


Open In Colab

You can view an article that explains concepts in this notebook:

View Article

What to Expect

In this notebook, we will conduct an advanced evaluation of quantized vectors using Cohere's embedding models, MongoDB for vector storage and search, and the BEIR (Benchmarking IR) framework for performance assessment. Here's what you can expect to learn and explore:

  1. Vector Quantization: Understand the process and benefits of vector quantization in the context of embedding storage and retrieval.

  2. Cohere Integration: Learn how to generate embeddings using Cohere's state-of-the-art models, including both float32 and quantized int8 versions.

  3. MongoDB Vector Storage: Explore efficient ways to store and index vector embeddings in MongoDB, including BSON encoding for optimized storage.

  4. Vector Search Implementation: Implement and compare vector search capabilities using different vector representations (float32, BSON float32, and BSON int8).

  5. BEIR Framework Usage: Utilize the BEIR framework to evaluate the performance of our vector search implementations across various information retrieval metrics.

  6. Performance Comparison: Analyze and visualize the performance differences between non-quantized and quantized vector representations in terms of:

    • Storage efficiency
    • Search accuracy
    • Retrieval speed
  7. Advanced Metrics: Dive deep into advanced IR metrics such as NDCG, MAP, Recall, and Precision at different cut-off points.

  8. Visualization: Create informative plots to compare the performance of different vector representations across multiple metrics.

  9. Practical Insights: Gain insights into the trade-offs between storage efficiency and search performance, helping you make informed decisions for your own vector search applications.

By the end of this notebook, you will have a comprehensive understanding of how vector quantization affects the performance of embedding-based search systems, and you'll be equipped with the knowledge to implement and evaluate such systems using industry-standard tools and frameworks.

image.png

[ ]

Step 1: Load the BEIR scifact dataset

[ ]
[ ]
  0%|          | 0/5183 [00:00<?, ?it/s]
[ ]
('4983', {'text': 'Alterations of the architecture of cerebral white matter in the developing human brain can affect cortical development and result in functional disabilities. A line scan diffusion-weighted magnetic resonance imaging (MRI) sequence with diffusion tensor analysis was applied to measure the apparent diffusion coefficient, to calculate relative anisotropy, and to delineate three-dimensional fiber architecture in cerebral white matter in preterm (n = 17) and full-term infants (n = 7). To assess effects of prematurity on cerebral white matter development, early gestation preterm infants (n = 10) were studied a second time at term. In the central white matter the mean apparent diffusion coefficient at 28 wk was high, 1.8 microm2/ms, and decreased toward term to 1.2 microm2/ms. In the posterior limb of the internal capsule, the mean apparent diffusion coefficients at both times were similar (1.2 versus 1.1 microm2/ms). Relative anisotropy was higher the closer birth was to term with greater absolute values in the internal capsule than in the central white matter. Preterm infants at term showed higher mean diffusion coefficients in the central white matter (1.4 +/- 0.24 versus 1.15 +/- 0.09 microm2/ms, p = 0.016) and lower relative anisotropy in both areas compared with full-term infants (white matter, 10.9 +/- 0.6 versus 22.9 +/- 3.0%, p = 0.001; internal capsule, 24.0 +/- 4.44 versus 33.1 +/- 0.6% p = 0.006). Nonmyelinated fibers in the corpus callosum were visible by diffusion tensor MRI as early as 28 wk; full-term and preterm infants at term showed marked differences in white matter fiber organization. The data indicate that quantitative assessment of water diffusion by diffusion tensor MRI provides insight into microstructural development in cerebral white matter in living infants.', 'title': 'Microstructural development of human newborn cerebral white matter assessed in vivo by diffusion tensor magnetic resonance imaging.'})

('1', '0-dimensional biomaterials show inductive properties.')

('1', {'31715818': 1})

Corpus: The corpus is a dictionary where each key is a document ID, and the value is another dictionary containing the document's text and title. For example:

	'4983': {
    'text': 'Alterations of the architecture of cerebral white matter...',
    'title': 'Microstructural development of human newborn cerebral white matter...'
}

This corresponds to the scientific abstracts in our earlier example.

Queries: The queries dictionary contains the scientific claims, where the key is a query ID and the value is the claim text. For example:

	'1': '0-dimensional biomaterials show inductive properties.'

Qrels: The qrels (query relevance) dictionary contains the ground truth relevance judgments. It's structured as a nested dictionary where the outer key is the query ID, the inner key is a document ID, and the value is the relevance score (typically 1 for relevant, 0 for non-relevant). For example:

	'1': {'31715818': 1}

This indicates that for query '1', the document with ID '31715818' is relevant.

[ ]
[ ]

Step 2: Generate embedddings for float 32, Int8 embedding by using Cohere

[ ]
Enter Cohere API Key: ··········
[ ]
[ ]
[ ]
Generating embeddings: 100%|██████████| 5183/5183 [22:16<00:00,  3.88it/s]
[ ]

Step 3: Generate BSON representations of float32 and int8 embedding

[ ]
[ ]
[ ]

Step 4: Ingest float32, bsonfloat32 and bsonint8 into separate collections

[ ]
Enter MongoDB URI: ··········
[ ]
[ ]
[ ]
Connection to MongoDB successful
[ ]
DeleteResult({'n': 10366, 'electionId': ObjectId('7fffffff0000000000000033'), 'opTime': {'ts': Timestamp(1727642864, 1867), 't': 51}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1727642864, 1867), 'signature': {'hash': b"\xa7'$-\x1dj\xc4\x86=0\xf46)\t>\x010.\xa7\xa7", 'keyId': 7353740577831124994}}, 'operationTime': Timestamp(1727642864, 1867)}, acknowledged=True)
[ ]
Preparing documents: 100%|██████████| 5183/5183 [00:00<00:00, 12963.83it/s]
Inserting documents into collections...
Inserted 5183 documents into float32_embeddings
Inserted 5183 documents into bson_float32_embeddings
Inserted 5183 documents into bson_int8_embeddings
Data ingestion complete.

Step 5: Create vector indicies for all collections

[ ]
[ ]
[ ]
[ ]
Creating index 'vector_index'...
Waiting for 30 seconds to allow index 'vector_index' to be created...
30-second wait completed for index 'vector_index'.
Creating index 'vector_index'...
Waiting for 30 seconds to allow index 'vector_index' to be created...
30-second wait completed for index 'vector_index'.
Creating index 'vector_index'...
Waiting for 30 seconds to allow index 'vector_index' to be created...
30-second wait completed for index 'vector_index'.
'vector_index'

Step 6: Create generic vector search function

[ ]

Step 7: Comparison of collections generic vector search results (Search result, Retrieval latency and Memory)

[ ]
[ ]
Float32 Vector Search Result:
[{'similarityScore': 0.6950235366821289,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'similarityScore': 0.6885790824890137,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'similarityScore': 0.6674723029136658,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'similarityScore': 0.6613469123840332,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'similarityScore': 0.6603913307189941,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

BSON Float32 Vector Search Result:
[{'similarityScore': 0.6950235366821289,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'similarityScore': 0.6885790824890137,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'similarityScore': 0.6674723029136658,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'similarityScore': 0.6613469123840332,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'similarityScore': 0.6603913307189941,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

BSON Int8 Vector Search Result:
[{'similarityScore': 0.6947944164276123,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'similarityScore': 0.6891517639160156,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'similarityScore': 0.6680400371551514,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'similarityScore': 0.6613667607307434,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'similarityScore': 0.6589280366897583,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

Based on the provided vector search results for Float32, BSON Float32, and BSON Int8 encodings, we can conclude:

  1. Consistency across encodings: All three encoding methods (Float32, BSON Float32, and BSON Int8) returned the same top 5 documents in the same order, indicating a high level of consistency in search results regardless of the encoding method used.
  2. Similarity in scores: The similarity scores for each document are remarkably close across all three encoding methods, with only minor variations in the BSON Int8 results.
  3. Precision preservation: The BSON Float32 results are identical to the standard Float32 results, suggesting that BSON encoding of float32 values preserves full precision.
  4. Minimal impact of quantization: The BSON Int8 (quantized) results show only slight differences in similarity scores compared to the float32 versions. The largest difference is in the fifth result, with a score of 0.6589 vs 0.6604, a difference of only about 0.22%.
  5. Ranking stability: Despite the minor differences in similarity scores for the BSON Int8 encoding, the ranking of results remains unchanged, indicating that the quantization process maintains the relative similarities between documents.

These results demonstrate that both BSON encoding and quantization to int8 maintain high fidelity in vector search operations.

The consistency in results across all three methods suggests that developers can confidently use BSON encoding and even int8 quantization to significantly reduce storage requirements (as seen in later analysis) without compromising the quality of vector search results in their AI applications.

[ ]
Float32 search time: 16.684341 milliseconds
BSON Float32 search time: 12.182835 milliseconds
BSON Int8 search time: 0.030998 milliseconds
[ ]
[ ]

Size information for float32 collection:
Total Size: 221.22 MB
Storage Size: 220.55 MB
Index Size: 0.67 MB
Document Size: 73.09 MB
Average Document Size: 14787.00 bytes
Number of Documents: 5183

Size information for bsonfloat32 collection:
Total Size: 136.86 MB
Storage Size: 136.18 MB
Index Size: 0.68 MB
Document Size: 27.97 MB
Average Document Size: 5659.00 bytes
Number of Documents: 5183

Size information for bsonint8 collection:
Total Size: 53.31 MB
Storage Size: 52.65 MB
Index Size: 0.66 MB
Document Size: 12.79 MB
Average Document Size: 2587.00 bytes
Number of Documents: 5183

Based on the results from the above operation, we can conclue the following in regards to memory utilization of various embeddings types

The use of BSON encoding and quantization significantly reduces storage requirements for vector embeddings in MongoDB collections while maintaining the same number of documents. The float32 collection requires the most storage, followed by the BSON-encoded float32 collection, with the BSON-encoded int8 (quantized) collection being the most storage-efficient.

Specifically:

  1. The BSON float32 encoding reduces the total size by approximately 38% compared to the standard float32 collection (136.86 MB vs 221.22 MB).
  2. The BSON int8 (quantized) encoding further reduces the total size by about 76% compared to the standard float32 collection, and by 61% compared to the BSON float32 collection (53.31 MB vs 221.22 MB and 136.86 MB respectively).
  3. The average document size follows a similar pattern, with BSON int8 documents being about 82% smaller than standard float32 documents (2587 bytes vs 14787 bytes).

These results demonstrate that BSON encoding, particularly when combined with quantization to int8, offers substantial space savings for storing vector embeddings in MongoDB. This can lead to improved query performance and reduced storage costs, especially for large-scale AI applications dealing with numerous vector embeddings.

Step 8: Evaluation on key metrics on BEIR for each collection

[ ]
[ ]
[ ]
[ ]
Float32 Vector Search Result:
[{'_id': '17388232',
  'similarityScore': 0.6950235366821289,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'_id': '4346436',
  'similarityScore': 0.6885790824890137,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'_id': '14082855',
  'similarityScore': 0.6674723029136658,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'_id': '8290953',
  'similarityScore': 0.6613469123840332,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'_id': '1922901',
  'similarityScore': 0.6603913307189941,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

BSON Float32 Vector Search Result:
[{'_id': '17388232',
  'similarityScore': 0.6950235366821289,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'_id': '4346436',
  'similarityScore': 0.6885790824890137,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'_id': '14082855',
  'similarityScore': 0.6674723029136658,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'_id': '8290953',
  'similarityScore': 0.6613469123840332,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'_id': '1922901',
  'similarityScore': 0.6603913307189941,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

BSON Int8 Vector Search Result:
[{'_id': '17388232',
  'similarityScore': 0.6947944164276123,
  'title': 'Mechanical regulation of cell function with geometrically '
           'modulated elastomeric substrates'},
 {'_id': '4346436',
  'similarityScore': 0.6891517639160156,
  'title': 'Nonlinear Elasticity in Biological Gels'},
 {'_id': '14082855',
  'similarityScore': 0.6680400371551514,
  'title': 'Inflammatory Reaction as Determinant of Foreign Body Reaction Is '
           'an Early and Susceptible Event after Mesh Implantation'},
 {'_id': '8290953',
  'similarityScore': 0.6613667607307434,
  'title': 'Scaffold-based three-dimensional human fibroblast culture provides '
           'a structural matrix that supports angiogenesis in infarcted heart '
           'tissue.'},
 {'_id': '1922901',
  'similarityScore': 0.6589280366897583,
  'title': 'Forces in Tissue Morphogenesis and Patterning'}]
----------------

[ ]

Let's evaluate Float32 vector embeddings

[ ]
[ ]
[ ]
[ ]
Sample of retrieved results:
Query ID: 1
Query text: 0-dimensional biomaterials show inductive properties.
Top 3 retrieved documents:
  Doc ID: 17388232, Score: 0.6993089914321899
  Doc ID: 4346436, Score: 0.6975740790367126
  Doc ID: 1922901, Score: 0.6674870848655701

Query ID: 3
Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.
Top 3 retrieved documents:
  Doc ID: 19058822, Score: 0.8058915138244629
  Doc ID: 2739854, Score: 0.7782564163208008
  Doc ID: 14717500, Score: 0.7675504684448242

Query ID: 5
Query text: 1/2000 in UK have abnormal PrP positivity.
Top 3 retrieved documents:
  Doc ID: 13734012, Score: 0.7203654050827026
  Doc ID: 23531592, Score: 0.6829813122749329
  Doc ID: 5850219, Score: 0.6786690950393677

Query ID: 13
Query text: 5% of perinatal mortality is due to low birth weight.
Top 3 retrieved documents:
  Doc ID: 1263446, Score: 0.7334267497062683
  Doc ID: 4791384, Score: 0.7181892395019531
  Doc ID: 26611834, Score: 0.7133195996284485

Query ID: 36
Query text: A deficiency of vitamin B12 increases blood levels of homocysteine.
Top 3 retrieved documents:
  Doc ID: 18557974, Score: 0.8081403970718384
  Doc ID: 3215494, Score: 0.7896957397460938
  Doc ID: 11705328, Score: 0.7879230976104736

[ ]
[ ]

NDCG:
  NDCG@1: 0.5967
  NDCG@3: 0.6556
  NDCG@5: 0.6781
  NDCG@10: 0.6781
  NDCG@100: 0.6781
  NDCG@1000: 0.6781

MAP:
  MAP@1: 0.5649
  MAP@3: 0.6301
  MAP@5: 0.6458
  MAP@10: 0.6458
  MAP@100: 0.6458
  MAP@1000: 0.6458

Recall:
  Recall@1: 0.5649
  Recall@3: 0.6964
  Recall@5: 0.7551
  Recall@10: 0.7551
  Recall@100: 0.7551
  Recall@1000: 0.7551

Precision:
  P@1: 0.5967
  P@3: 0.2556
  P@5: 0.1687
  P@10: 0.0843
  P@100: 0.0084
  P@1000: 0.0008

Let's evaluate Float32(BSON) vector embeddings

[ ]
[ ]
[ ]
[ ]
Sample of retrieved results:
Query ID: 1
Query text: 0-dimensional biomaterials show inductive properties.
Top 3 retrieved documents:
  Doc ID: 17388232, Score: 0.6993553638458252
  Doc ID: 4346436, Score: 0.6978418827056885
  Doc ID: 1922901, Score: 0.6676774024963379

Query ID: 3
Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.
Top 3 retrieved documents:
  Doc ID: 19058822, Score: 0.8050355315208435
  Doc ID: 2739854, Score: 0.7785025835037231
  Doc ID: 14717500, Score: 0.7682878971099854

Query ID: 5
Query text: 1/2000 in UK have abnormal PrP positivity.
Top 3 retrieved documents:
  Doc ID: 13734012, Score: 0.7210093140602112
  Doc ID: 23531592, Score: 0.6826974749565125
  Doc ID: 5850219, Score: 0.6789515018463135

Query ID: 13
Query text: 5% of perinatal mortality is due to low birth weight.
Top 3 retrieved documents:
  Doc ID: 1263446, Score: 0.7330269813537598
  Doc ID: 4791384, Score: 0.718242883682251
  Doc ID: 26611834, Score: 0.7134110331535339

Query ID: 36
Query text: A deficiency of vitamin B12 increases blood levels of homocysteine.
Top 3 retrieved documents:
  Doc ID: 18557974, Score: 0.8084380626678467
  Doc ID: 3215494, Score: 0.7901670336723328
  Doc ID: 11705328, Score: 0.7884390354156494

[ ]
[ ]

NDCG:
  NDCG@1: 0.6000
  NDCG@3: 0.6573
  NDCG@5: 0.6804
  NDCG@10: 0.6804
  NDCG@100: 0.6804
  NDCG@1000: 0.6804

MAP:
  MAP@1: 0.5683
  MAP@3: 0.6323
  MAP@5: 0.6487
  MAP@10: 0.6487
  MAP@100: 0.6487
  MAP@1000: 0.6487

Recall:
  Recall@1: 0.5683
  Recall@3: 0.6964
  Recall@5: 0.7562
  Recall@10: 0.7562
  Recall@100: 0.7562
  Recall@1000: 0.7562

Precision:
  P@1: 0.6000
  P@3: 0.2556
  P@5: 0.1693
  P@10: 0.0847
  P@100: 0.0085
  P@1000: 0.0008

Let's evaluate Int8(BSON) vector embeddings

[ ]
[ ]
[ ]
[ ]
Sample of retrieved results:
Query ID: 1
Query text: 0-dimensional biomaterials show inductive properties.
Top 3 retrieved documents:
  Doc ID: 17388232, Score: 0.6991255283355713
  Doc ID: 4346436, Score: 0.6984505653381348
  Doc ID: 1922901, Score: 0.6661576628684998

Query ID: 3
Query text: 1,000 genomes project enables mapping of genetic sequence variation consisting of rare variants with larger penetrance effects than common variants.
Top 3 retrieved documents:
  Doc ID: 19058822, Score: 0.8047915697097778
  Doc ID: 2739854, Score: 0.7789502143859863
  Doc ID: 14717500, Score: 0.768173098564148

Query ID: 5
Query text: 1/2000 in UK have abnormal PrP positivity.
Top 3 retrieved documents:
  Doc ID: 13734012, Score: 0.721397876739502
  Doc ID: 23531592, Score: 0.6823171973228455
  Doc ID: 5850219, Score: 0.6804669499397278

Query ID: 13
Query text: 5% of perinatal mortality is due to low birth weight.
Top 3 retrieved documents:
  Doc ID: 1263446, Score: 0.7327826023101807
  Doc ID: 4791384, Score: 0.7189810276031494
  Doc ID: 26611834, Score: 0.7132070064544678

Query ID: 36
Query text: A deficiency of vitamin B12 increases blood levels of homocysteine.
Top 3 retrieved documents:
  Doc ID: 18557974, Score: 0.8082408905029297
  Doc ID: 3215494, Score: 0.7897244691848755
  Doc ID: 11705328, Score: 0.7882398962974548

[ ]
[ ]

NDCG:
  NDCG@1: 0.6000
  NDCG@3: 0.6552
  NDCG@5: 0.6780
  NDCG@10: 0.6780
  NDCG@100: 0.6780
  NDCG@1000: 0.6780

MAP:
  MAP@1: 0.5683
  MAP@3: 0.6307
  MAP@5: 0.6467
  MAP@10: 0.6467
  MAP@100: 0.6467
  MAP@1000: 0.6467

Recall:
  Recall@1: 0.5683
  Recall@3: 0.6931
  Recall@5: 0.7517
  Recall@10: 0.7517
  Recall@100: 0.7517
  Recall@1000: 0.7517

Precision:
  P@1: 0.6000
  P@3: 0.2544
  P@5: 0.1680
  P@10: 0.0840
  P@100: 0.0084
  P@1000: 0.0008

Step 9: Compare and Visualize Results

[ ]
[ ]
Output