Notebooks
M
MongoDB
Quantized Vector Ingestion With Cohere And Mongodb

Quantized Vector Ingestion With Cohere And Mongodb

advanced_techniquesagentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Efficient Quantized Vector Ingestion with Cohere and MongoDB


Open In Colab

You can view an article version of this notebook here:

View Article

In this notebook, we cover the following:

  • What are quantization and vector quantization?
  • Example of scalar quantization
  • Efficient quantized vector ingestion with Cohere and MongoDB
  • Comparison of float32, BSON float32, and BSON int8 embeddings in vector search operations
  • Practical considerations for choosing the right embedding format for your AI application

Prerequisute: Understanding Scalar Quantization

Vector quantization is a lossy compression algorithm designed to reduce the dimensionality and memory requirements of high-dimensional vector data. It achieves this by mapping the original vectors to a smaller set of representative vectors. This process allows for significant data compression while preserving much of the essential information contained in the original vectors. Vector quantization technique is particularly useful in AI applications where some loss of precision is acceptable in exchange for reduced storage needs and lower latency in data retrieval steps. Let's demonstrate vector quantization using float32 values and quantizing to 8-bit integers. We'll use a simplified example with 1D vectors for clarity.

	Float32: [1.23, 4.56, 2.34, 5.67, 3.45, 6.78, 1.89, 4.90]

	Int8: [0, 153, 51, 204, 102, 255, 30, 169]

Step 1: Determine the range of values.

	Min: 1.23
Max: 6.78

Step 2: Define the quantization scale

	Scale = (Max - Min) / 255 = (6.78 - 1.23) / 255 ≈ 0.0217

Step 3: Quantize to 8-bit integers using the formula below: quantized = round((original - Min) / Scale)

image.png

[ ]
Original (float32):
[1.23 4.56 2.34 5.67 3.45 6.78 1.89 4.9 ]

Quantized (uint8):
[  0 153  51 204 102 255  30 169]

Quantization parameters - Min: 1.2300000190734863, Scale: 0.021764706630332798

Reconstructed (float32):
[1.23      4.5600004 2.3400002 5.67      3.45      6.78      1.8829412
 4.9082355]

Mean Absolute Error:
0.0019118637

Original memory usage: 32 bytes
Quantized memory usage: 8 bytes
Compression ratio: 4.00x

What we are building

This section provides a comprehensive guide to implementing efficient quantized vector ingestion and search using Cohere and MongoDB. We'll walk through the entire process, from generating quantized and non-quantized embeddings with Cohere's API to converting them into BSON format, ingesting them into MongoDB, and performing vector searches on the stored embeddings.

The end result of this section is a fully functional system that demonstrates the practical implementation of quantized vector storage and retrieval. This setup will allow you to compare the performance and accuracy of different embedding types (float32, BSON float32, and BSON int8) in vector search operations.

image.png

Step 1: Install Libraries and Set Environment Variables

For the code implementation in this tutorial, the following libraries are utilized:

  • pandas:  A data manipulation library for efficient handling of structured data. It's used for loading, cleaning, transforming, and analyzing data in various formats.
  • cohere: The official Cohere Python library. It will provide access to advanced language models, embedding generation, and text generation.
  • pymongo: The official Python driver for MongoDB. While commented out in the installation, it suggests potential use for interacting with MongoDB databases, enabling storage and retrieval of data.
[ ]
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89.9/89.9 kB 1.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.1/13.1 MB 47.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 249.2/249.2 kB 10.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.1/139.1 kB 7.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 47.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.4/76.4 kB 4.0 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.0/78.0 kB 4.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 37.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 38.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.8/60.8 kB 3.4 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.6/12.6 MB 50.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 307.7/307.7 kB 11.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.7/82.7 kB 3.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 385.0/385.0 kB 12.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.8/147.8 kB 6.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 82.1/82.1 kB 3.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 119.4/119.4 kB 6.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 3.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 9.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 kB 2.1 MB/s eta 0:00:00
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.6.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 2.2.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.
[ ]
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
Enter Cohere API Key: ··········

Step 2: Data Loading

This step involves generating a small structured dataset of sentences with placeholder fields for various embedding types. The dataset is designed to facilitate the demonstration and comparison of different embedding techniques in natural language processing and database storage.

The data is structured as a list of dictionaries, where each dictionary represents a sentence and its associated embedding attributes. Each entry contains a sentence key with a factual statement as its value, along with four embedding-related keys: float32_embedding, int8_embedding, bson_float32_embedding, and bson_int8_embedding. These keys are initially set to None.

[ ]
[ ]
[ ]

Step 3: Embedding Generation With Cohere

This step demonstrates the process of generating embeddings for our sentence dataset using the Cohere API. We will define a custom function, get_cohere_embeddings, for efficient embedding generation.

[ ]
[ ]
[ ]
[ ]

Step 4: Generate BSON Vectors

This step introduces a crucial data transformation process that, while potentially unfamiliar to some developers, is integral to building high-performant AI applications. When optimizing AI systems for scale and efficiency, certain techniques and data formats become essential to application development. The operations demonstrated here are fundamental to showcasing the memory optimization benefits of BSON (Binary JSON) and MongoDB's storage capabilities.

The operations demonstrated here are fundamental to showcasing the memory optimization benefits of BSON (Binary JSON) and MongoDB's storage capabilities. By converting our embeddings into BSON format, we can compare the storage efficiency of the BSON data format vs the non-BSON data format.

[ ]
[ ]
[ ]
[ ]

Step 5: Data Ingestion With MongoDB

For this step, you will need a MongoDB account to retrieve a connection string to a cluster.

Creating a database and collection within MongoDB is simple with MongoDB Atlas.

  1. First, register for a MongoDB Atlas account. Existing users can sign into MongoDB Atlas.
  2. Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.
  3. Follow MongoDB’s steps to get the connection string from the Atlas UI. Securely store the URI within your development environment after setting up the database and obtaining the Atlas cluster connection URI.

NOTE: Don’t forget to add the IP of your host machine to the IP access list for your cluster.

[ ]
Enter MongoDB URI: ··········
[ ]
[ ]
Connection to MongoDB successful
[ ]
DeleteResult({'n': 20, 'electionId': ObjectId('7fffffff0000000000000033'), 'opTime': {'ts': Timestamp(1727822733, 20), 't': 51}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1727822733, 20), 'signature': {'hash': b'?e_ \xaf\x8c1\xb4\x86\x83%W\xc1U\xa8Rc\x9dX\xab', 'keyId': 7390008424139849730}}, 'operationTime': Timestamp(1727822733, 20)}, acknowledged=True)
[ ]
InsertManyResult([ObjectId('66fc7b8f828da1d0760be746'), ObjectId('66fc7b8f828da1d0760be747'), ObjectId('66fc7b8f828da1d0760be748'), ObjectId('66fc7b8f828da1d0760be749'), ObjectId('66fc7b8f828da1d0760be74a'), ObjectId('66fc7b8f828da1d0760be74b'), ObjectId('66fc7b8f828da1d0760be74c'), ObjectId('66fc7b8f828da1d0760be74d'), ObjectId('66fc7b8f828da1d0760be74e'), ObjectId('66fc7b8f828da1d0760be74f'), ObjectId('66fc7b8f828da1d0760be750'), ObjectId('66fc7b8f828da1d0760be751'), ObjectId('66fc7b8f828da1d0760be752'), ObjectId('66fc7b8f828da1d0760be753'), ObjectId('66fc7b8f828da1d0760be754'), ObjectId('66fc7b8f828da1d0760be755'), ObjectId('66fc7b8f828da1d0760be756'), ObjectId('66fc7b8f828da1d0760be757'), ObjectId('66fc7b8f828da1d0760be758'), ObjectId('66fc7b8f828da1d0760be759')], acknowledged=True)

Step 6: Create Vector Index

In this step, we will create a critical component that facilitates vector search operations, a vector index. The setup_vector_search_index function creates a vector index, a specialized structure that significantly enhances the performance of similarity searches across embedding vectors.

[ ]

The code snippet above utilizes MongoDB's SearchIndexModel to define a vector search index. One operation to highlight is the 30-second wait period post-index creation. This ensures the index is fully built and optimized, preventing potential race conditions in subsequent query operations.

The next step is to initialize the variable vector_search_index_definition that is assigned to the actual vector index, which is configured for multiple vector fields, each corresponding to different embedding formats (float32, BSON float32, and BSON int8). By setting up this index, we enable efficient cosine similarity searches across our 1024-dimensional embedding space.

[ ]
[ ]
Creating index 'vector_index'...
Waiting for 30 seconds to allow index 'vector_index' to be created...
30-second wait completed for index 'vector_index'.
'vector_index'

Step 7: Create Vector Search Function

In this step, we'll define the vector_search function that leverages MongoDB's powerful aggregation framework to perform efficient vector search operations. This function is the intermediary layer between user queries and our database of sentence embeddings, enabling semantic search or RAG capabilities in your AI applications.

The vector_search function accommodates different types of embeddings and search strategies. It takes a user's input query, typically a sentence from the application layer, and transforms it into a vector representation using our previously defined get_cohere_embeddings function. This embedding, stored in the query_embedding variable, becomes the basis for our similarity search.

A key feature of this function is its ability to handle both quantized and non-quantized embeddings. By incorporating a check for the embedding type, we can dynamically adjust our search parameters to utilize either the space-efficient quantized (int8) embeddings or the higher-precision non-quantized (float32) embeddings.

By leveraging MongoDB's aggregation pipeline, we'll construct a series of stages that process our data, perform the vector similarity search, and return the most relevant results.

[ ]

The key operations of the code snippet above and the vector_search function are as follows:

  1. Embedding generation: The get_cohere_embeddings function converts the user's query into a vector representation.
  2. Embedding type selection: Based on the quantized flag, it chooses between float32 and int8 embeddings, adjusting the search path accordingly.
  3. BSON conversion: If use_bson is True, the embedding is converted to BSON format using the generate_bson_vector function, optimizing MongoDB storage and retrieval.
  4. Search pipeline construction: The function builds a MongoDB aggregation pipeline with two stages:
  • $vectorSearch: This utilizes the pre-created vector index for efficient similarity search. This operator takes the search query embedding as input, the path to the location of embedding data located in the collection, the number of candidates to consider for similarity, and the number of documents to be retrieved by the vector stage. -$project: This shapes the output, including the original sentence and a similarity score.

Step 8: Run Vector Search Query

In this step, we will use the vector_search function across different embedding configurations. This comparison allows us to evaluate the performance and accuracy of various vector representations in semantic search tasks.

[ ]

In the code snippet above, we execute three distinct vector searches using the query "what is the speed of light?" each targeting a different embedding format:

  1. results_bson_float: Uses BSON-encoded float32 embeddings, combining the precision of floating-point numbers with the efficiency of BSON storage
  2. results_bson_int8: Utilizes BSON-encoded quantized (int8) embeddings, offering a more compact representation that may trade some precision for improved storage and query efficiency
  3. `results_float`: Uses standard float32 embeddings without BSON encoding, serving as a baseline for comparison

These results will provide insights into how different embedding representations affect semantic search outcomes, helping you choose the most suitable approach for your specific use cases; typical factors to consider are storage optimization, search speed, and result precision.

[ ]
Float32 Results
[{'sentence': 'The speed of light in vacuum is 299,792,458 meters/second.',
  'similarityScore': 0.8345031142234802},
 {'sentence': 'Water boils at 100 degrees Celsius at standard atmospheric '
              'pressure.',
  'similarityScore': 0.7116270065307617},
 {'sentence': 'Jupiter is the largest planet in our solar system.',
  'similarityScore': 0.7062681317329407},
 {'sentence': 'DNA contains the genetic instructions for all living organisms.',
  'similarityScore': 0.6969594955444336},
 {'sentence': 'There are 118 elements in the periodic table.',
  'similarityScore': 0.695982813835144}]
-----------------

BSON Float32 Results
[{'sentence': 'The speed of light in vacuum is 299,792,458 meters/second.',
  'similarityScore': 0.8345031142234802},
 {'sentence': 'Water boils at 100 degrees Celsius at standard atmospheric '
              'pressure.',
  'similarityScore': 0.7116270065307617},
 {'sentence': 'Jupiter is the largest planet in our solar system.',
  'similarityScore': 0.7062681317329407},
 {'sentence': 'DNA contains the genetic instructions for all living organisms.',
  'similarityScore': 0.6969594955444336},
 {'sentence': 'There are 118 elements in the periodic table.',
  'similarityScore': 0.695982813835144}]
-----------------

BSON Int8 Results
[{'sentence': 'The speed of light in vacuum is 299,792,458 meters/second.',
  'similarityScore': 0.8344717025756836},
 {'sentence': 'Water boils at 100 degrees Celsius at standard atmospheric '
              'pressure.',
  'similarityScore': 0.7123053669929504},
 {'sentence': 'Jupiter is the largest planet in our solar system.',
  'similarityScore': 0.7069313526153564},
 {'sentence': 'DNA contains the genetic instructions for all living organisms.',
  'similarityScore': 0.6975141763687134},
 {'sentence': 'There are 118 elements in the periodic table.',
  'similarityScore': 0.6969249844551086}]
-----------------

Conclusion

All code presented in the steps outlined in this tutorial are available on GitHub.

Based on these results, we can draw several important conclusions about the benefits of quantization and BSON encoding in vector search operations:

We will start off with an important consideration, which is accuracy preservation. The results demonstrate that MongoDB’s native support for Vectors in BSON and quantization maintain a high degree of accuracy in semantic search. The top result for all three methods is identical, correctly identifying the sentence about the speed of light, with very similar similarity scores (0.8345 for float32 and BSON float32, 0.8344 for BSON int8).

Another factor to consider is the consistency across formats. The order of results and the similarity scores are remarkably consistent across all three formats. This suggests that the use of BSON encoding and even quantization to int8 does not significantly compromise the quality of search results. Based on this, let’s recap the benefits of BSON and quantized vectors for AI applications.

1. Benefits of native vector BSON type:

  • Storage efficiency: While not directly visible in the results from this tutorial, BSON encoding typically offers more compact storage compared to raw float32 values, potentially reducing database size. Follow the advanced notebook to get an overview of BSON's storage efficiency.
  • Query performance: BSON's binary format can lead to faster serialization/deserialization, potentially improving query speeds in large-scale applications.

2. Advantages of quantization (int8):

  • Space efficiency: Int8 representation requires significantly less storage space compared to float32, which can be crucial for large datasets.
  • Potential performance gains: Reduced data size can lead to faster data transfer and potentially quicker computations, which is especially beneficial for resource-constrained environments.

3. Minimal precision loss: The slight variations in similarity scores for the BSON int8 results (e.g., 0.8344 vs 0.8345 for the top result) indicate that the precision loss from quantization is minimal and does not affect the overall ranking of results.

The results of this tutorial demonstrate that you can leverage the benefits of both BSON encoding and quantization without significant compromise to search accuracy.

In an internal experiment, we compared the storage efficiency of different representations for vector embeddings: floats and BSON vectors using float32 precision. The test dataset consisted of three million documents, each containing embeddings generated by the cohere-embed-multilingual-v3.0 model with 1024 dimensions, like this tutorial. The results were as follows:

3M docs, cohere-embed-multilingual-v3.0@1024dMongoDB Storage Size (GB)
Array of floats41
BSON Vectors float3214

This comparison demonstrates a significant reduction in storage requirements when using BSON vectors with float32 precision. The BSON representation required only about 34% of the storage space needed for the array of floats representation, resulting in approximately 66% space savings. This substantial difference in storage efficiency has important implications for large-scale vector databases.

BSON encoding offers improved storage and potential performance benefits, while further quantization to int8 from float32 embeddings, as in this tutorial, can provide additional space savings with negligible impact on result quality.

The choice between these formats (array of floats, BSON vectors, or quantized representations) should be based on specific application needs, balancing storage constraints, processing speed requirements, and the level of precision required for the particular use case.

A key takeaway is that MongoDB Atlas enhances your AI stack by providing robust indexing support for quantized embedding model outputs from providers such as Cohere. This capability significantly improves the scalability of AI workloads, particularly those involving vector search, allowing them to handle substantially higher volumes. The smaller memory footprint of quantized embeddings results in more efficient storage utilization and cost-effectiveness. These features make MongoDB Atlas an excellent choice for organizations looking to optimize their AI applications, particularly those involving large-scale vector search operations.

This tutorial covers an introductory approach to evaluating the benefits of quantized embeddings. For a more detailed and advanced exploration of the memory optimization benefits of quantized vectors and BSON data format, check out the GitHub notebook.

[ ]