Notebooks
M
MongoDB
Vector Database Performance Guidance Mongondb Pgvector

Vector Database Performance Guidance Mongondb Pgvector

agentsartificial-intelligencellmsmongodb-genai-showcaseperformance_guidancenotebooksgenerative-airag

AI Database Performance Comparison For AI Workloads: PostgreSQL/PgVector vs MongoDB Atlas Vector Search

Open In Colab

	Note: This resource is intended to provide performance guidance for AI workloads using vector data within databases, this resoruce is not meant to be an official or comprehensive benchmark, but a guide to help you understand the performance characteristics of the databases within specific search patterns and workloads, enabling you to make an informed decision on which database to use for your AI workload.

Because a database can has been traditionally used for a specific workload, doesn't mean that the database is the best fit for the workload.

What this notebook doesn't provide:

  • A comprehensive benchmark for all databases
  • A cost analysis for the databases and workloads

Introduction:

Welcome to this comprehensive notebook, where we provide performance insights for MongoDB and PostgreSQL—two of the most widely used databases in AI workloads.

In this session, we analyse the performance results of a variety of search mechanisms, including:

  • Vector Search
  • Hybrid Search

What You’ll Learn:

  • PostgreSQL with pgvector:
    • How to set up a PostgreSQL database with the pgvector extension.
    • How to run text, vector, and hybrid searches on PostgreSQL.
  • MongoDB Atlas Vector Search:
    • How to set up a MongoDB Atlas database with native Vector Search capabilities.
    • How to execute text, vector, and hybrid searches on MongoDB Atlas.
  • AI Workload Overview:
    • This notebook showcases a standard AI workload involving vector embeddings and the retrieval of semantically similar documents.
    • The system leverages two different vector search solutions:
      • PostgreSQL with pgvector: A powerful extension that integrates vector search capabilities directly into PostgreSQL.
      • MongoDB Atlas Vector Search: A native vector search feature built into MongoDB, optimized for modern, document-based applications.
  • AI Workload Metrics:
    • Latency: The time it takes to retrieve the top n results
    • Throughput: The number of queries processed per second
    • P95 Latency: The 95th percentile latency of the queries

Database Platforms:

For this performance guidance, we utilize:

  • MongoDB Atlas: A fully managed, cloud-native database designed for modern applications.
  • Neon: A serverless, fully managed PostgreSQL database optimized for cloud deployments.

Whether your focus is on MongoDB or PostgreSQL, this notebook is designed to help you understand their performance characteristics and guide you in achieving optimal performance for your AI

Key Information

  1. System Configuration

MongoDB Atlas (M30 → M40) vs. Neon (4 → 8 vCPUs) Comparison

Important Note on Resource Allocation Disparities

When interpreting the performance results in this notebook, it's essential to consider the significant resource allocation differences between the tested systems:

MongoDB Atlas (M30 → M40)
  • Minimum: 2 vCPUs, 8 GB RAM (M30)
  • Maximum: 4 vCPUs, 16 GB RAM (M40)
Neon PostgreSQL
  • Minimum: 4 vCPUs, 16 GB RAM
  • Maximum: 8 vCPUs, 32 GB RAM

This means Neon PostgreSQL has twice the compute resources at both minimum and maximum configurations compared to MongoDB Atlas. This resource disparity significantly impacts performance results interpretation in several ways:

  1. Performance per Resource Unit: If MongoDB shows comparable or better performance despite having fewer resources, this suggests higher efficiency per compute unit.

  2. Cost Considerations: Higher resource allocation typically incurs higher costs.

  3. Scaling Behavior: Both systems can scale, but across different resource ranges. Performance gains from scaling might manifest differently due to these distinct scaling ranges.

AttributeMongoDB Atlas (M30 → M40)Neon (Autoscaling: 4 → 8 vCPUs)
vCPUs- Min: M30 → 2 vCPUs (8 GB RAM)
- Max: M40 → 4 vCPUs (16 GB RAM)
- Min: 4 vCPUs (16 GB RAM)
- Max: 8 vCPUs (32 GB RAM)
Memory (RAM)- M30: 8 GB
- M40: 16 GB
- Min: 16 GB
- Max: 32 GB
Storage- M30: ~40 GB included
- M40: ~80 GB included
(Can scale further for additional cost)
- Remote “pageserver” stores primary data
- Local disk for temp files: 20 GB or 15 GB × 8 CUs (whichever is higher)
Autoscaling (Compute)- Cluster Tier Scaling: can move between M30 and M40
- Storage Scaling: automatically grows storage
- Compute Autoscaling: 4 to 8 vCPUs
- Scale to Zero: optional (after 5 min idle)
IOPS~2000+ on M30, higher on M40“Burstable” IOPS from cloud storage
Local File Cache for frequently accessed data
Max Connections- ~6000 (M30)
- ~12000 (M40)
- ~1802 (4 vCPUs)
- ~3604 (8 vCPUs)
Scale to ZeroNot supportedOptional. If enabled, compute suspends when idle (adds startup latency)
Restart Behavior on Resizing- Moving from M30 to M40 triggers a brief re-provisioning
- Minimal downtime but connections can be interrupted
- Autoscaling within 4–8 vCPUs does not restart connections
- Editing min/max or toggling Scale to Zero triggers a restart
Local Disk for Temp FilesAdequate for normal ops; M40 has more local disk- At least 20 GB local disk, or 15 GB × 8 CUs = 120 GB if that’s higher
Release Updates- Minor updates auto-applied
- Major version upgrades can be scheduled
- Weekly or scheduled updates
- Manual restart may be needed if Scale to Zero is disabled and you want the latest compute engine updates

Key Points

  • Resource Range

    • MongoDB Atlas scales from 2 vCPUs/8 GB (M30) to 4 vCPUs/16 GB (M40).
    • Neon ranges from 4 vCPUs/16 GB to 8 vCPUs/32 GB.
  • Closer Parity at M40

    • When Atlas scales to M40, it matches Neon’s minimum (4 vCPUs/16 GB), allowing more direct performance comparisons.
    • Neon can still go beyond M40, up to 8 vCPUs/32 GB, if workload spikes exceed M40 capacity.
  • IOPS and Connections

    • Atlas M30→M40 has higher IOPS and connection limits at each tier.
    • Neon’s IOPS is cloud-based and “burstable,” while connections scale with the CPU (CUs).

In summary, MongoDB Atlas (M30 → M40) is closer to Neon (4 → 8 vCPUs) than previous tiers, especially at the high end (4 vCPUs/16 GB). However, Neon still offers more headroom if your workload demands exceed M40’s capacity.

  1. Data Processing

    • Uses Wikipedia dataset (100,000 entries) with embeddings(Precision: float32, Dimensions: 768) generated by Cohere
    • JSON data is generated from the dataset and stored in the databases
    • Stores data in both PostgreSQL and MongoDB
  2. Performance Testing

    • Tests different sizes of concurrent queries (1-400 queries)
    • Tests different insertion batch sizes and speed of insertion
OperationMetricDescription
InsertionLatencyTime taken to insert the data (average response time)
ThroughputNumber of queries processed per second
RetrievalLatencyTime taken to retrieve the top n results (average response time)
ThroughputNumber of queries processed per second
P95 LatencyTime taken to retrieve the top n results for 95% of the queries
  1. Results Visualization
    • Interactive animations showing request-response cycles
    • Comparative charts for latency and throughput
    • Performance analysis across different batch sizes

Part 1: Data Setup

Setting up the performance results dictionary performance_guidance_results and the batch sizes to test CONCURRENT_QUERIES and TOTAL_QUERIES

  • performance_guidance_results is a dictionary that will store the results of the tests
  • CONCURRENT_QUERIES is a list of the number of queries that are run concurrently
  • TOTAL_QUERIES is the total number of queries that are run

Performance Guidance Configuration Example: When testing with a concurrency level of 10:

  • We run 100 iterations
  • Each iteration runs 10 concurrent queries
  • Total queries = 1,000 queries (TOTAL_ITERATIONS * CONCURRENT_QUERIES)

NOTE: For each concurrency level in CONCURRENT_QUERIES:

  1. Run TOTAL_QUERIES iterations
  2. In each iteration, execute that many concurrent queries
  3. Measure and collect latencies for all queries
[ ]
[ ]

Step 1: Install Libraries

All the libraries are installed using pip and facilitate the sourcing of data, embedding generation, and data visualization.

[ ]

Step 2: Data Loading

The dataset for the notebook is sourced from the Hugging Face Cohere Wikipedia dataset.

The Cohere/wikipedia-22-12-en-embeddings dataset on Hugging Face comprises English Wikipedia articles embedded using Cohere's multilingual-22-12 model. Each entry includes the article's title, text, URL, Wikipedia ID, view count, paragraph ID, language codes, and a 768-dimensional embedding vector. This dataset is valuable for tasks like semantic search, information retrieval, and NLP model training.

For this notebook, we are using 100,000 rows of the dataset and have removed the id, wiki_id, paragraph_id, langs and views columns.

[ ]
[ ]
[ ]
[ ]
[ ]

Step 3: Embedding Generation

We use the Cohere API to generate embeddings for the test queries.

To get the Cohere API key, you can sign up for a free account on the Cohere website.

[ ]

Using the Cohere API to generate embeddings for the test queries.

Using the embed-multilingual-v2.0 model. This is the same model used in the Cohere Wikipedia dataset.

Embedding size is 768 dimensions and the precision is float32.

[ ]

Generate embeddings for the query templates

Store the embeddings in a dictionary for easy access

Note: Doing this to avoid the overhead of generating embeddings for each query in the dataset during the performance analysis process, as this is a time consuming process and expensive in terms of API usage.

Note: Feel free to add more queries to the query_templates list to test the performance of the vector database with a larger number of queries

[ ]
[ ]
[ ]

Part 2: Retrieval Mechanisms with PostgreSQL and PgVector

In this section, we create a PostgreSQL database with the PgVector extension and insert the dataset into the database.

We are also going to implement various search mechanisms on the database to test the performance of the database under certain search patterns and workloads. Specifically, we are going to implement a semantic search mechanism on the database via vector search and a hybrid search mechanism on the database via vector search and text search.

The table wikipedia_data is created with the following columns:

  • id: The unique identifier for each row
  • title: The title of the Wikipedia article
  • text: The text of the Wikipedia article
  • url: The URL of the Wikipedia article
  • json_data: The JSON data of the Wikipedia article
  • embedding: The embedding vector for the Wikipedia article

The table is created with a HNSW index with m=16, ef_construction=64 and cosine similarity (these are the default parameters for the HNSW index in pgvector).

  • HNSW: Hierarchical Navigable Small World graphs are a type of graph-based index that are used for efficient similarity search.
  • m=16: The number of edges per node in the graph
  • ef_construction=64: Short for exploration factor construction, is the number of edges to build during the index construction phase
  • ef_search=100: Short for exploration factor search, is the number of edges to search during the index search phase
  • cosine similarity: The similarity metric used for the index (formula: dot product(A, B) / (|A||B|))
  • cosine distance: The distance metric calculated using cosine similarity (1 - cosine similarity)

We perform a semantic search on the database using a single data point of the query templates and their corresponding embeddings.

Step 1: Install Libraries

[ ]

Step 2: Create Postgres Table

  • id: The unique identifier for each row
  • title: The title of the Wikipedia article
  • text: The text of the Wikipedia article
  • url: The URL of the Wikipedia article
  • json_data: The JSON data of the Wikipedia article
  • embedding: The embedding vector for the Wikipedia article

Key aspect of PostgreSQL table creation:

  • id: The unique identifier for each row stored with the data type bigserial which is a 64-bit integer and auto-incremented.
  • title: The title of the Wikipedia article stored with the data type text which is a variable character string.
  • text: The text of the Wikipedia article stored with the data type text which is a variable character string.
  • url: The URL of the Wikipedia article stored with the data type text which is a variable character string.
  • json_data: The JSON data of the Wikipedia article stored with the data type jsonb which is a binary formatted JSON data type.
  • embedding: The embedding vector for the Wikipedia article stored with the data type vector(768) which is a provided by pgvector and is of 768 dimensions.

Optimizing the table for search:

  • search_vector: The search vector for the Wikipedia article stored with the data type tsvector which is a text search data type in PostgreSQL.
  • The expression inside the GENERATED ALWAYS AS clause is the text(title and text) to be tokenized and indexed for full-text search.
  • Using coalesce to handle any null values in the title or text columns.
  • STORED: This keyword indicates that the search_vector column is stored in the table, this avoids the overhead of recalculating the search_vector column during queries, and improves performance.

Extra:

  • The search_vector column is computed automatically using the text in the title and text fields, making full-text search more efficient by avoiding on-the-fly computation.
  • The HNSW index on the embedding column is optimized for ANN queries using cosine similarity, which is crucial for semantic search.
  • The GIN indexes on both the json_data and search_vector columns ensure fast query performance on JSONB queries and full-text search, respectively.
[ ]

Step 4: Define insert function

For inserting JSON data, we convert the Python Dictionary in the json_data attribute to a JSON string using the json.dumps() function.

This is a serilization process that converts the Python Dictionary in the json_data attribute to a JSON string that is stored as binary data in the database.

[ ]
[ ]
[ ]

Step 5: Insert Data into Postgres

[ ]

Step 6: Define text search function with postgres

[ ]
[ ]

Step 7: Define vector search function with postgres

To avoid exhasuting API key usage, we will fetch the query embedding from the query_embeddings_dict dictionary.

In the vector_search_with_postgres function, we set the HNSW ef parameter to 100 using the execute_command function.

This is to set the exploration factor for the HNSW index to 100. And corresponds to the number of nodes/candidates to search during the index search phase. A node corresponds to a vector in the index.

[ ]
[ ]

Step 8: Define hybrid search function with postgres

[ ]
[ ]

Part 3: Retrieval Mechanisms with MongoDB Atlas

Step 1: Install Libraries

[ ]

Step 2: Create MongoDB Atlas Account

TODO: Place inforioant required

Step 3: Connect to MongoDB and Create Database and Collection

[ ]

In the following code blocks below we do the following:

  1. Establish a connection to the MongoDB database
  2. Create a database and collection if they do not already exist
  3. Delete all data in the collection if it already exists
[ ]
[ ]
[ ]

Step 4: Vector Index Creation

The setup_vector_search_index function creates a vector search index for the MongoDB collection.

The index_name parameter is the name of the index to create.

The embedding_field_name parameter is the name of the field containing the text embeddings on each document within the wikipedia_data collection.

[ ]

Filtering your data is useful to narrow the scope of your semantic search and ensure that not all vectors are considered for comparison. It reduces the number of documents against which to run similarity comparisons, which can decrease query latency and increase the accuracy of search results.

You must index the fields that you want to filter by using the filter type inside the fields array.

[ ]
[ ]

An Atlas Search index is a data structure that categorizes data in an easily searchable format. It is a mapping between terms and the documents that contain those terms. Atlas Search indexes enable faster retrieval of documents using certain identifiers. You must configure an Atlas Search index to query data in your Atlas cluster using Atlas Search.

You can create an Atlas Search index on a single field or on multiple fields. We recommend that you index the fields that you regularly use to sort or filter your data in order to quickly retrieve the documents that contain the relevant data at query-time.

[ ]
[ ]

Step 5: Define Insert Data Function

Because of the affinity of MongoDB for JSON data, we don't have to convert the Python Dictionary in the json_data attribute to a JSON string using the json.dumps() function. Instead, we can directly insert the Python Dictionary into the MongoDB collection.

This reduced the operational overhead of the insertion processes in AI workloads.

[ ]

Step 6: Insert Data into MongoDB

[ ]
[ ]

Step 7: Define Text Search Function

The text_search_with_mongodb function performs a text search in the MongoDB collection based on the user query.

  • query_text parameter is the user's query string.
  • collection parameter is the MongoDB collection to search.
  • top_n parameter is the number of top results to return.
[ ]
[ ]
[ ]

Step 8: Define Vector Search Function

The semantic_search_with_mongodb function performs a vector search in the MongoDB collection based on the user query.

  • user_query parameter is the user's query string.
  • collection parameter is the MongoDB collection to search.
  • top_n parameter is the number of top results to return.
  • vector_search_index_name parameter is the name of the vector search index to use for the search.

The numCandidates parameter is the number of candidate matches to consider. This is set to 100 to match the number of candidate matches to consider in the PostgreSQL vector search.

Another point to note is the queries in MongoDB are performed using the aggregate function enabled by the MongoDB Query Language(MQL).

This allows for more flexibility in the queries and the ability to perform more complex searches. And data processing opreations can be defined as stages in the pipeline. If you are a data engineer, data scientist or ML Engineer, the concept of pipeline processing is a key concept.

[ ]
[ ]
[ ]

Step 9: Define Hybrid Search Function

The hybrid_search_with_mongodb function conducts a hybrid search on a MongoDB Atlas collection that combines a vector search and a full-text search using Atlas Search.

In the MongoDB hybrid search function, there are two weights:

  • vector_weight = 0.5: This weight scales the score obtained from the vector search portion.
  • full_text_weight = 0.5: This weight scales the score from the full-text search portion.

Note: In the MongoDB hybrid search function, two weights:

	- `vector_weight` 
- `full_text_weight` 

They are used to control the influence of each search component on the final score.

Here's how they work:

Purpose: The weights allow you to adjust how much the vector (semantic) search and the full-text search contribute to the overall ranking. For example, a higher full_text_weight means that the full-text search results will have a larger impact on the final score, whereas a higher vector_weight would give more importance to the vector similarity score.

Usage in the Pipeline: Within the aggregation pipeline, after retrieving results from each search type, the function computes a reciprocal ranking score for each result (using an expression like 1/(rank + 60)). This score is then multiplied by the corresponding weight:

Vector Search:

	"vs_score": {
  "$multiply": [ vector_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}

Full-Text Search:

	"fts_score": {
  "$multiply": [ full_text_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}

Finally, these weighted scores are combined (typically by adding them together) to produce a final score that determines the ranking of the documents.

Impact: By adjusting these weights, you can fine-tune the search results to better match your application's needs. For instance, if the full-text component is more reliable for your dataset, you might set full_text_weight higher than vector_weight.

The weights in the MongoDB function allow you to balance the contributions from vector-based and full-text search components, ensuring that the final ranking score reflects the desired importance of each search method.

[ ]
[ ]
[ ]

Part 4: Vector Database Performance Analysis

1. Insertion Performance Analysis Process

We are inserting data incrementally with doubling batch sizes and record performance metrics. Notably, we will be measuring the time it takes to insert data incrementally and the number of rows inserted per second.

We are using the insert_data_incrementally function to insert data incrementally.

It starts with a batch size of 1 and doubles the batch size until it has inserted all the data, recording the time it takes to insert the data and the number of rows inserted per second.

The key component we are interested in is the time it takes to insert the data and the number of rows inserted per second. In AI Workloads, there are data ingestion processes that are performned in batches from various data sources. So in practice, we are interested in the time it takes to insert the data and the number of rows inserted per second.

[ ]

1.1 PostgreSQL Insertion Performance Analysis

[ ]

1.2 MongoDB Insertion Performance Analysis

[ ]

1.3 Visualize Insertion Performance Analysis

[ ]
[ ]

2. Semantic Search with PostgreSQL and PgVector Performance Analysis

[ ]
[ ]
[ ]
[ ]

2.2 MongoDB Semantic Search Performance Analysis

[ ]
[ ]
[ ]

2.3 Visualize Vector Search Performance Analysis

[ ]
[ ]

Part 5: Extra Notes

5.1 PostgreSQL JSONB vs MongoDB BSON

FeaturePostgreSQL JSONBMongoDB BSON
IntegrationAn extension to a relational database system.Native to MongoDB, a document database.
Query LanguageUses SQL with JSONB-specific operators/functions.Uses MongoDB Query Language (MQL), a JSON-like query syntax.
Storage OptimizationOptimized for relational data alongside JSONB.Fully optimized for JSON-like document storage.
Data Type SupportStores standard JSON data types (e.g., strings, numbers).Includes additional types not in standard JSON (e.g., Date, ObjectId, Binary).
Use CaseBest for hybrid relational/JSON use cases.Designed for flexible schemas, document-based databases.
UpdatesJSONB supports in-place updates for specific keys or paths.BSON supports in-place updates with more native support for field-level atomic operations.
Size OverheadSlightly more compact than BSON in some cases.Includes metadata like type information, leading to slightly larger size.

5.2 Limitations of pgvector for Handling Large-Dimensional Embeddings

While pgvector is a powerful tool for storing and searching vector embeddings in PostgreSQL, it does have inherent limitations when it comes to handling very high-dimensional embeddings. Here are the key points and source:

  • PostgreSQL Page Size Constraint:

    • Reason: PostgreSQL uses fixed 8KB pages for data storage.
    • Impact: Each 32-bit float occupies 4 bytes, so storing a vector with many dimensions quickly exhausts the available space on a page.
    • Practical Limit: This design limits indexed vectors to around 2000 dimensions unless alternative approaches (such as quantization or splitting the vector) are used.
  • Index Tuple Size Limit:

    • Reason: Even if the underlying table supports larger vectors (up to 16,000 dimensions), the index tuples themselves are constrained by the 8KB limit.
    • Impact: Attempting to build an index on vectors exceeding this limit results in errors or performance degradation.
  • Trade-offs in Workarounds:

    • Quantization: Converting vectors from 32-bit floats to lower precision (e.g., half-precision) can allow for more dimensions but may reduce accuracy.
    • Splitting Vectors: Dividing a high-dimensional vector across multiple columns or rows increases complexity in reconstructing the original vector for search and may affect retrieval speed.
    • Alternative Data Types: Some projects (like pgvecto.rs) bypass these limitations by handling indexing outside PostgreSQL, but this sacrifices the ACID guarantees that pgvector provides.
  • Implications for AI Workloads:

    • Model Compatibility: Many modern embedding models (e.g., OpenAI’s text-embedding-3-large with 3072 dimensions) produce embeddings that exceed pgvector’s optimal indexed dimension size, potentially forcing truncation or quantization.
    • Search Quality: These workarounds (truncation, quantization, or splitting) can impact the precision and recall of similarity searches—a critical factor for many AI applications.

5.3 Workaround Options for High-Dimensional Embeddings in pgvector

The pgvector project has received several suggestions and workaround proposals to mitigate the limitation of indexing high-dimensional vectors (beyond ~2000 dimensions). Two key comments from issues #326 and #395 detail some of these options.


5.3.1 Option 1: Use Lower Precision with halfvec

  • Description:
    Convert full-precision (fp32) vectors to half-precision (fp16) for indexing purposes.

  • How It Works:

    • Storage: Vectors are stored as vector(n) (still in fp32) in the table.
    • Indexing: When creating the index, the vector is cast to the halfvec type (fp16) using syntax like:
      	CREATE INDEX ON items USING hnsw ((embedding::halfvec(n)) halfvec_l2_ops);
      
      
  • Benefits:

    • Smaller Index Size: fp16 values require half the storage of fp32, which can allow more dimensions to fit within the 8KB index tuple limit.
    • Faster Index Build: Smaller data size can lead to quicker index creation.
  • Trade-offs:

    • Loss of Precision: Quantizing from 32-bit to 16-bit floats introduces rounding errors.
    • Impact on Recall: Testing has shown that recall remains nearly identical in many cases, but the loss in precision may not be acceptable for all applications.
  • Appropriateness:
    This workaround is appropriate if the application can tolerate a slight reduction in numerical precision without significantly affecting the quality of similarity search results.


5.3.2 Option 2: Split the Embedding Across Multiple Rows or Columns

  • Description:
    Divide a high-dimensional vector into multiple smaller vectors that can be stored and indexed separately.

  • How It Works:

    • Schema Change: Instead of storing one vector with dimensions greater than the limit, split it into two or more parts (e.g., a 3072-dimensional vector into one part of 2000 dimensions and another of 1072 dimensions).
    • Indexing: Build separate indexes for each part and then combine the results (possibly with a re-ranking step) during query time.
  • Benefits:

    • Full Precision Retained: No need to quantize the data, so the original accuracy is preserved.
    • Scalability: This approach can support arbitrarily high dimensions by splitting the data.
  • Trade-offs:

    • Complexity: Requires changes to the schema and additional logic in query processing to recombine or re-rank partial results.
    • Performance Overhead: Merging results from multiple indexes can add latency to the search process.
  • Appropriateness:
    This workaround is most appropriate when high precision is critical and the application cannot afford any loss in accuracy. It adds complexity but retains full precision for each vector.