Notebooks
M
MongoDB
Automatic Quantization Of Nomic Emebddings With Mongodb

Automatic Quantization Of Nomic Emebddings With Mongodb

advanced_techniquesagentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Optimizing Vector Database Performance: Reducing Retrieval Latency with Quantization

Open In Colab


Summary

This notebook explores techniques for optimizing vector database performance, focusing on reducing retrieval latency through the use of quantization methods. We examine the practical application of various embedding types

  • float32_embedding
  • int8_embedding
  • binary_embedding

We analyze their impact on query precision and retrieval speed.

By leveraging quantization strategies like scalar and binary quantization, we highlight the trade-offs between precision and efficiency.

The notebook also includes a step-by-step demonstration of executing vector searches, measuring retrieval latencies, and visualizing results in a comparative framework.

Use Case:

The notebook demonstrates how to optimize vector database performance, specifically focusing on reducing retrieval latency using quantization methods.

Scenario: You have a large dataset of text data (in this case, a book from Gutenberg) and you want to build a system that can efficiently find similar pieces of text based on a user's query.

Approach:

  • Embeddings: The notebook uses SentenceTransformer to convert text into numerical vectors (embeddings) which capture the semantic meaning of the text.
  • Vector Database: MongoDB is used as a vector database to store and search these embeddings efficiently.
  • Quantization: To speed up retrieval, the notebook applies quantization techniques (scalar and binary) to the embeddings. This reduces the size of the embeddings, making searches faster but potentially impacting precision. Goal: By comparing the performance of different embedding types (float32, int8, binary), the notebook aims to show the trade-offs between retrieval speed and accuracy when using quantization. This helps in choosing the best approach for a given use case.

Step 1: Install Libaries

Here's a breakdown of the libraries and their roles:

  • unstructured: This library is used to process and structure various data formats, including text, enabling efficient analysis and extraction of information.
  • pymongo: This library provides the tools necessary to interact with MongoDB allowing for storage and retrieval of data within the project.
  • nomic: This library is used for vector embedding and other functions related to Nomic AI's models, specifically for generating and working with text embeddings.
  • pandas: This popular library is used for data manipulation and analysis, providing data structures and functions for efficient data handling and exploration.
  • sentence_transformers: This library is used for generating embeddings for text data using the SentenceTransformer model.

By installing these packages, the code sets up the tools necessary for data processing, embedding generation, and storage with MongoDB.

[10]
Note: you may need to restart the kernel to use updated packages.
[3]

Step 2: Data Loading and Preparation

Dataset Information

The dataset used in this example is "Pushing to the Front," an ebook from Project Gutenberg. This book, focusing on self-improvement and success, is freely available for public use.

The code leverages the unstructured library to process this raw text data, transforming it into a structured format suitable for semantic analysis and search. By chunking the text based on titles, the code creates meaningful units that can be embedded and stored in a vector database for efficient retrieval. This preprocessing is essential for building a robust and performant semantic search system.

The code below requests library to fetch the text content of the book "Pushing to the Front" from Project Gutenberg's website. The URL points to the raw text file of the book.

[4]

Data Cleaning: The unstructured library is used to clean and structure the raw text. The group_broken_paragraphs function helps in combining fragmented paragraphs, ensuring better text flow.

[5]

The partition_text function further processes the cleaned text, dividing it into logical sections. These sections could represent chapters, sub-sections, or other meaningful units within the book.

[6]
The Project Gutenberg eBook of Pushing to the Front


This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.


Title: Pushing to the Front


Author: Orison Swett Marden


Release date: May 4, 2007 [eBook #21291]


Chunking by Title: The chunk_by_title function identifies titles or headings within the parsed sections and uses them to create distinct chunks of text. This step is crucial for organizing the data into manageable units for subsequent embedding generation and semantic search.

[7]
[8]
The Project Gutenberg eBook of Pushing to the Front

This ebook is for the use of anyone anywhere in the United States and most other parts of the world at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this ebook or online at www.gutenberg.org. If you are not located in the United States, you will have to check the laws of the country where you are located before using this eBook.

Step 3: Embeddings Generation

[11]
<All keys matched successfully>

The embedding generation might take a approximately 20 minutes

[12]
Generating embeddings: 100%|██████████| 4135/4135 [03:09<00:00, 21.77it/s]
[16]
[17]

When visualizing the dataset values, you will observe that the embedding attributes: float32_embedding, int_embedding and binary_emebedding all have the same values.

In downstream proceses the values of the int_embedding and binary_embedding attributes for each data point will be modified to their respective data types, as a result of MongoDB Atlas auto quantization feature.

[18]

Step 4: MongoDB (Operational and Vector Database)

MongoDB acts as both an operational and vector database for the RAG system. MongoDB Atlas specifically provides a database solution that efficiently stores, queries and retrieves vector embeddings.

Creating a database and collection within MongoDB is made simple with MongoDB Atlas.

  1. First, register for a MongoDB Atlas account. For existing users, sign into MongoDB Atlas.
  2. Follow the instructions. Select Atlas UI as the procedure to deploy your first cluster.

Follow MongoDB’s steps to get the connection string from the Atlas UI. After setting up the database and obtaining the Atlas cluster connection URI, securely store the URI within your development environment.

[19]
[20]
[21]
Connection to MongoDB successful
Collection 'pushing_to_the_front_orison_quantized' created successfully.

Step 5: Data Ingestion

[22]
DeleteResult({'n': 0, 'electionId': ObjectId('7fffffff0000000000000039'), 'opTime': {'ts': Timestamp(1734581901, 1), 't': 57}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1734581901, 1), 'signature': {'hash': b'\xd8+\x12\xed3V\x1bi\x9df-\xf8\x06\x97\xd3p\xa64\x05A', 'keyId': 7390008424139849730}}, 'operationTime': Timestamp(1734581901, 1)}, acknowledged=True)
[23]
Data ingestion into MongoDB completed

Step 6: Vector Search Index Creation

[24]
[25]
[26]
[27]
{'fields': [{'type': 'vector', 'path': 'float32_embedding', 'numDimensions': 768, 'similarity': 'cosine'}, {'type': 'vector', 'path': 'int8_embedding', 'quantization': 'scalar', 'numDimensions': 768, 'similarity': 'cosine'}, {'type': 'vector', 'path': 'binary_embedding', 'quantization': 'binary', 'numDimensions': 768, 'similarity': 'euclidean'}]}
[28]
Creating index 'vector_index'...
Waiting for 60 seconds to allow index 'vector_index' to be created...
60-second wait completed for index 'vector_index'.
'vector_index'

Step 7: Vector Search Operation

[75]
[77]

Step 8: Retrieving Documents and Analysing Results

[82]

One key point to note: If you’ve followed this example with a small dataset, you likely won’t observe significant retrieval latency improvements. Quantization methods truly demonstrate their benefits when dealing with large-scale datasets—on the order of one million or more embeddings—where memory savings and speed gains become substantially more noticeable.

[83]

Quantization is a powerful tool for optimizing vector database performance, especially in applications that handle high-dimensional embeddings like semantic search and recommendation systems. This tutorial demonstrated the implementation of scalar and binary quantization methods using Nomic embeddings with MongoDB as the vector database. When leveraged appropriately, effective optimization extends beyond latency improvements. It enables scalability, reduces operational costs, and enhances application user experience. The Benefits of Database Optimization: Latency Reduction for Improved User Experience: Minimizing delays in data retrieval enhances user satisfaction and engagement. Efficient Handling of Large-Scale Data: Optimized databases can more effectively manage vast amounts of data, improving performance and scalability.

Cost Reduction and Resource Efficiency: Efficient data storage and retrieval reduce the need for excessive computational resources, leading to cost savings. By examining the trade-offs between retrieval accuracy and performance across different embedding formats (float32, int8, and binary), we showcased how MongoDB's capabilities, such as vector indexing and automatic quantization, can streamline data storage, retrieval, and analysis.

From this tutorial, we’ve explored Atlas Vector Search native capabilities for scalar quantization as well as binary quantization with rescoring. Our implementation showed that automatic quantization increases scalability and cost savings by reducing the storage and computational resources for efficient processing of vectors. In most cases, automatic quantization reduces the RAM for mongot by 3.75x for scalar and by 24x for binary; the vector values shrink by 4x and 32x, respectively, but the Hierarchical Navigable Small Worlds graph itself does not shrink.

We recommend automatic quantization if you have a large number of full-fidelity vectors, typically over 10M vectors. After quantization, you index reduced representation vectors without compromising the accuracy of your retrieval. To further explore quantization techniques and their applications, refer to resources like Ingesting Quantized Vectors with Cohere. An additional notebook for comparing retrieval accuracy between quantized and non-quantized vectors is also available to deepen your understanding of these methods.