Vector Database Performance Guidance Mongondb Pgvector
AI Database Performance Comparison For AI Workloads: PostgreSQL/PgVector vs MongoDB Atlas Vector Search
Note: This resource is intended to provide performance guidance for AI workloads using vector data within databases, this resoruce is not meant to be an official or comprehensive benchmark, but a guide to help you understand the performance characteristics of the databases within specific search patterns and workloads, enabling you to make an informed decision on which database to use for your AI workload.
Because a database can has been traditionally used for a specific workload, doesn't mean that the database is the best fit for the workload.
What this notebook doesn't provide:
- A comprehensive benchmark for all databases
- A cost analysis for the databases and workloads
Introduction:
Welcome to this comprehensive notebook, where we provide performance insights for MongoDB and PostgreSQL—two of the most widely used databases in AI workloads.
In this session, we analyse the performance results of a variety of search mechanisms, including:
- Vector Search
- Hybrid Search
What You’ll Learn:
- PostgreSQL with pgvector:
- How to set up a PostgreSQL database with the pgvector extension.
- How to run text, vector, and hybrid searches on PostgreSQL.
- MongoDB Atlas Vector Search:
- How to set up a MongoDB Atlas database with native Vector Search capabilities.
- How to execute text, vector, and hybrid searches on MongoDB Atlas.
- AI Workload Overview:
- This notebook showcases a standard AI workload involving vector embeddings and the retrieval of semantically similar documents.
- The system leverages two different vector search solutions:
- PostgreSQL with pgvector: A powerful extension that integrates vector search capabilities directly into PostgreSQL.
- MongoDB Atlas Vector Search: A native vector search feature built into MongoDB, optimized for modern, document-based applications.
- AI Workload Metrics:
- Latency: The time it takes to retrieve the top n results
- Throughput: The number of queries processed per second
- P95 Latency: The 95th percentile latency of the queries
Database Platforms:
For this performance guidance, we utilize:
- MongoDB Atlas: A fully managed, cloud-native database designed for modern applications.
- Neon: A serverless, fully managed PostgreSQL database optimized for cloud deployments.
Whether your focus is on MongoDB or PostgreSQL, this notebook is designed to help you understand their performance characteristics and guide you in achieving optimal performance for your AI
Key Information
- System Configuration
MongoDB Atlas (M30 → M40) vs. Neon (4 → 8 vCPUs) Comparison
Important Note on Resource Allocation Disparities
When interpreting the performance results in this notebook, it's essential to consider the significant resource allocation differences between the tested systems:
MongoDB Atlas (M30 → M40)
- Minimum: 2 vCPUs, 8 GB RAM (M30)
- Maximum: 4 vCPUs, 16 GB RAM (M40)
Neon PostgreSQL
- Minimum: 4 vCPUs, 16 GB RAM
- Maximum: 8 vCPUs, 32 GB RAM
This means Neon PostgreSQL has twice the compute resources at both minimum and maximum configurations compared to MongoDB Atlas. This resource disparity significantly impacts performance results interpretation in several ways:
-
Performance per Resource Unit: If MongoDB shows comparable or better performance despite having fewer resources, this suggests higher efficiency per compute unit.
-
Cost Considerations: Higher resource allocation typically incurs higher costs.
-
Scaling Behavior: Both systems can scale, but across different resource ranges. Performance gains from scaling might manifest differently due to these distinct scaling ranges.
| Attribute | MongoDB Atlas (M30 → M40) | Neon (Autoscaling: 4 → 8 vCPUs) |
|---|---|---|
| vCPUs | - Min: M30 → 2 vCPUs (8 GB RAM) - Max: M40 → 4 vCPUs (16 GB RAM) | - Min: 4 vCPUs (16 GB RAM) - Max: 8 vCPUs (32 GB RAM) |
| Memory (RAM) | - M30: 8 GB - M40: 16 GB | - Min: 16 GB - Max: 32 GB |
| Storage | - M30: ~40 GB included - M40: ~80 GB included (Can scale further for additional cost) | - Remote “pageserver” stores primary data - Local disk for temp files: 20 GB or 15 GB × 8 CUs (whichever is higher) |
| Autoscaling (Compute) | - Cluster Tier Scaling: can move between M30 and M40 - Storage Scaling: automatically grows storage | - Compute Autoscaling: 4 to 8 vCPUs - Scale to Zero: optional (after 5 min idle) |
| IOPS | ~2000+ on M30, higher on M40 | “Burstable” IOPS from cloud storage Local File Cache for frequently accessed data |
| Max Connections | - ~6000 (M30) - ~12000 (M40) | - ~1802 (4 vCPUs) - ~3604 (8 vCPUs) |
| Scale to Zero | Not supported | Optional. If enabled, compute suspends when idle (adds startup latency) |
| Restart Behavior on Resizing | - Moving from M30 to M40 triggers a brief re-provisioning - Minimal downtime but connections can be interrupted | - Autoscaling within 4–8 vCPUs does not restart connections - Editing min/max or toggling Scale to Zero triggers a restart |
| Local Disk for Temp Files | Adequate for normal ops; M40 has more local disk | - At least 20 GB local disk, or 15 GB × 8 CUs = 120 GB if that’s higher |
| Release Updates | - Minor updates auto-applied - Major version upgrades can be scheduled | - Weekly or scheduled updates - Manual restart may be needed if Scale to Zero is disabled and you want the latest compute engine updates |
Key Points
-
Resource Range
- MongoDB Atlas scales from 2 vCPUs/8 GB (M30) to 4 vCPUs/16 GB (M40).
- Neon ranges from 4 vCPUs/16 GB to 8 vCPUs/32 GB.
-
Closer Parity at M40
- When Atlas scales to M40, it matches Neon’s minimum (4 vCPUs/16 GB), allowing more direct performance comparisons.
- Neon can still go beyond M40, up to 8 vCPUs/32 GB, if workload spikes exceed M40 capacity.
-
IOPS and Connections
- Atlas M30→M40 has higher IOPS and connection limits at each tier.
- Neon’s IOPS is cloud-based and “burstable,” while connections scale with the CPU (CUs).
In summary, MongoDB Atlas (M30 → M40) is closer to Neon (4 → 8 vCPUs) than previous tiers, especially at the high end (4 vCPUs/16 GB). However, Neon still offers more headroom if your workload demands exceed M40’s capacity.
-
Data Processing
- Uses Wikipedia dataset (100,000 entries) with embeddings(Precision: float32, Dimensions: 768) generated by Cohere
- JSON data is generated from the dataset and stored in the databases
- Stores data in both PostgreSQL and MongoDB
-
Performance Testing
- Tests different sizes of concurrent queries (1-400 queries)
- Tests different insertion batch sizes and speed of insertion
| Operation | Metric | Description |
|---|---|---|
| Insertion | Latency | Time taken to insert the data (average response time) |
| Throughput | Number of queries processed per second | |
| Retrieval | Latency | Time taken to retrieve the top n results (average response time) |
| Throughput | Number of queries processed per second | |
| P95 Latency | Time taken to retrieve the top n results for 95% of the queries |
- Results Visualization
- Interactive animations showing request-response cycles
- Comparative charts for latency and throughput
- Performance analysis across different batch sizes
Part 1: Data Setup
Setting up the performance results dictionary performance_guidance_results and the batch sizes to test CONCURRENT_QUERIES and TOTAL_QUERIES
performance_guidance_resultsis a dictionary that will store the results of the testsCONCURRENT_QUERIESis a list of the number of queries that are run concurrentlyTOTAL_QUERIESis the total number of queries that are run
Performance Guidance Configuration Example: When testing with a concurrency level of 10:
- We run 100 iterations
- Each iteration runs 10 concurrent queries
- Total queries = 1,000 queries (TOTAL_ITERATIONS * CONCURRENT_QUERIES)
NOTE: For each concurrency level in CONCURRENT_QUERIES:
- Run TOTAL_QUERIES iterations
- In each iteration, execute that many concurrent queries
- Measure and collect latencies for all queries
Step 1: Install Libraries
All the libraries are installed using pip and facilitate the sourcing of data, embedding generation, and data visualization.
datasets: Hugging Face library for managing and preprocessing datasets across text, image, and audio (https://huggingface.co/datasets)sentence_transformers: For creating sentence embeddings for tasks like semantic search and clustering. (https://www.sbert.net/)pandas: A library for data manipulation and analysis with DataFrames and Series (https://pandas.pydata.org/)matplotlib: A library for creating static, interactive, and animated data visualizations (https://matplotlib.org/)seaborn: A library for creating statistical data visualizations (https://seaborn.pydata.org/)cohere: A library for generating embeddings and accessing the Cohere API or models (https://cohere.ai/)
Step 2: Data Loading
The dataset for the notebook is sourced from the Hugging Face Cohere Wikipedia dataset.
The Cohere/wikipedia-22-12-en-embeddings dataset on Hugging Face comprises English Wikipedia articles embedded using Cohere's multilingual-22-12 model. Each entry includes the article's title, text, URL, Wikipedia ID, view count, paragraph ID, language codes, and a 768-dimensional embedding vector. This dataset is valuable for tasks like semantic search, information retrieval, and NLP model training.
For this notebook, we are using 100,000 rows of the dataset and have removed the id, wiki_id, paragraph_id, langs and views columns.
Step 3: Embedding Generation
We use the Cohere API to generate embeddings for the test queries.
To get the Cohere API key, you can sign up for a free account on the Cohere website.
Using the Cohere API to generate embeddings for the test queries.
Using the embed-multilingual-v2.0 model. This is the same model used in the Cohere Wikipedia dataset.
Embedding size is 768 dimensions and the precision is float32.
Generate embeddings for the query templates
Store the embeddings in a dictionary for easy access
Note: Doing this to avoid the overhead of generating embeddings for each query in the dataset during the performance analysis process, as this is a time consuming process and expensive in terms of API usage.
Note: Feel free to add more queries to the query_templates list to test the performance of the vector database with a larger number of queries
Part 2: Retrieval Mechanisms with PostgreSQL and PgVector
In this section, we create a PostgreSQL database with the PgVector extension and insert the dataset into the database.
We are also going to implement various search mechanisms on the database to test the performance of the database under certain search patterns and workloads. Specifically, we are going to implement a semantic search mechanism on the database via vector search and a hybrid search mechanism on the database via vector search and text search.
The table wikipedia_data is created with the following columns:
id: The unique identifier for each rowtitle: The title of the Wikipedia articletext: The text of the Wikipedia articleurl: The URL of the Wikipedia articlejson_data: The JSON data of the Wikipedia articleembedding: The embedding vector for the Wikipedia article
The table is created with a HNSW index with m=16, ef_construction=64 and cosine similarity (these are the default parameters for the HNSW index in pgvector).
HNSW: Hierarchical Navigable Small World graphs are a type of graph-based index that are used for efficient similarity search.m=16: The number of edges per node in the graphef_construction=64: Short for exploration factor construction, is the number of edges to build during the index construction phaseef_search=100: Short for exploration factor search, is the number of edges to search during the index search phasecosine similarity: The similarity metric used for the index (formula: dot product(A, B) / (|A||B|))cosine distance: The distance metric calculated using cosine similarity (1 - cosine similarity)
We perform a semantic search on the database using a single data point of the query templates and their corresponding embeddings.
Step 1: Install Libraries
pgvector(0.3.6): A PostgreSQL extension for vector similarity search (https://github.com/pgvector/pgvector)psycopg(3.2.3): A PostgreSQL database adapter for Python (https://www.psycopg.org/)
Step 2: Create Postgres Table
id: The unique identifier for each rowtitle: The title of the Wikipedia articletext: The text of the Wikipedia articleurl: The URL of the Wikipedia articlejson_data: The JSON data of the Wikipedia articleembedding: The embedding vector for the Wikipedia article
Key aspect of PostgreSQL table creation:
id: The unique identifier for each row stored with the data typebigserialwhich is a 64-bit integer and auto-incremented.title: The title of the Wikipedia article stored with the data typetextwhich is a variable character string.text: The text of the Wikipedia article stored with the data typetextwhich is a variable character string.url: The URL of the Wikipedia article stored with the data typetextwhich is a variable character string.json_data: The JSON data of the Wikipedia article stored with the data typejsonbwhich is a binary formatted JSON data type.embedding: The embedding vector for the Wikipedia article stored with the data typevector(768)which is a provided by pgvector and is of 768 dimensions.
Optimizing the table for search:
search_vector: The search vector for the Wikipedia article stored with the data typetsvectorwhich is a text search data type in PostgreSQL.- The expression inside the
GENERATED ALWAYS ASclause is the text(title and text) to be tokenized and indexed for full-text search. - Using
coalesceto handle any null values in the title or text columns. STORED: This keyword indicates that thesearch_vectorcolumn is stored in the table, this avoids the overhead of recalculating thesearch_vectorcolumn during queries, and improves performance.
Extra:
- The
search_vectorcolumn is computed automatically using the text in thetitleandtextfields, making full-text search more efficient by avoiding on-the-fly computation. - The
HNSWindex on theembeddingcolumn is optimized for ANN queries using cosine similarity, which is crucial for semantic search. - The
GINindexes on both thejson_dataandsearch_vectorcolumns ensure fast query performance on JSONB queries and full-text search, respectively.
Step 4: Define insert function
For inserting JSON data, we convert the Python Dictionary in the json_data attribute to a JSON string using the json.dumps() function.
This is a serilization process that converts the Python Dictionary in the json_data attribute to a JSON string that is stored as binary data in the database.
Step 5: Insert Data into Postgres
Step 6: Define text search function with postgres
Step 7: Define vector search function with postgres
To avoid exhasuting API key usage, we will fetch the query embedding from the query_embeddings_dict dictionary.
In the vector_search_with_postgres function, we set the HNSW ef parameter to 100 using the execute_command function.
This is to set the exploration factor for the HNSW index to 100. And corresponds to the number of nodes/candidates to search during the index search phase. A node corresponds to a vector in the index.
Step 8: Define hybrid search function with postgres
Part 3: Retrieval Mechanisms with MongoDB Atlas
Step 1: Install Libraries
pymongo(4.10.1): A Python driver for MongoDB (https://pymongo.readthedocs.io/en/stable/)
Step 2: Create MongoDB Atlas Account
TODO: Place inforioant required
Step 3: Connect to MongoDB and Create Database and Collection
In the following code blocks below we do the following:
- Establish a connection to the MongoDB database
- Create a database and collection if they do not already exist
- Delete all data in the collection if it already exists
Step 4: Vector Index Creation
The setup_vector_search_index function creates a vector search index for the MongoDB collection.
The index_name parameter is the name of the index to create.
The embedding_field_name parameter is the name of the field containing the text embeddings on each document within the wikipedia_data collection.
Filtering your data is useful to narrow the scope of your semantic search and ensure that not all vectors are considered for comparison. It reduces the number of documents against which to run similarity comparisons, which can decrease query latency and increase the accuracy of search results.
You must index the fields that you want to filter by using the filter type inside the fields array.
An Atlas Search index is a data structure that categorizes data in an easily searchable format. It is a mapping between terms and the documents that contain those terms. Atlas Search indexes enable faster retrieval of documents using certain identifiers. You must configure an Atlas Search index to query data in your Atlas cluster using Atlas Search.
You can create an Atlas Search index on a single field or on multiple fields. We recommend that you index the fields that you regularly use to sort or filter your data in order to quickly retrieve the documents that contain the relevant data at query-time.
Step 5: Define Insert Data Function
Because of the affinity of MongoDB for JSON data, we don't have to convert the Python Dictionary in the json_data attribute to a JSON string using the json.dumps() function. Instead, we can directly insert the Python Dictionary into the MongoDB collection.
This reduced the operational overhead of the insertion processes in AI workloads.
Step 6: Insert Data into MongoDB
Step 7: Define Text Search Function
The text_search_with_mongodb function performs a text search in the MongoDB collection based on the user query.
query_textparameter is the user's query string.collectionparameter is the MongoDB collection to search.top_nparameter is the number of top results to return.
Step 8: Define Vector Search Function
The semantic_search_with_mongodb function performs a vector search in the MongoDB collection based on the user query.
user_queryparameter is the user's query string.collectionparameter is the MongoDB collection to search.top_nparameter is the number of top results to return.vector_search_index_nameparameter is the name of the vector search index to use for the search.
The numCandidates parameter is the number of candidate matches to consider. This is set to 100 to match the number of candidate matches to consider in the PostgreSQL vector search.
Another point to note is the queries in MongoDB are performed using the aggregate function enabled by the MongoDB Query Language(MQL).
This allows for more flexibility in the queries and the ability to perform more complex searches. And data processing opreations can be defined as stages in the pipeline. If you are a data engineer, data scientist or ML Engineer, the concept of pipeline processing is a key concept.
Step 9: Define Hybrid Search Function
The hybrid_search_with_mongodb function conducts a hybrid search on a MongoDB Atlas collection that combines a vector search and a full-text search using Atlas Search.
In the MongoDB hybrid search function, there are two weights:
- vector_weight = 0.5: This weight scales the score obtained from the vector search portion.
- full_text_weight = 0.5: This weight scales the score from the full-text search portion.
Note: In the MongoDB hybrid search function, two weights:
- `vector_weight`
- `full_text_weight`
They are used to control the influence of each search component on the final score.
Here's how they work:
Purpose: The weights allow you to adjust how much the vector (semantic) search and the full-text search contribute to the overall ranking. For example, a higher full_text_weight means that the full-text search results will have a larger impact on the final score, whereas a higher vector_weight would give more importance to the vector similarity score.
Usage in the Pipeline:
Within the aggregation pipeline, after retrieving results from each search type, the function computes a reciprocal ranking score for each result (using an expression like 1/(rank + 60)).
This score is then multiplied by the corresponding weight:
Vector Search:
"vs_score": {
"$multiply": [ vector_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}
Full-Text Search:
"fts_score": {
"$multiply": [ full_text_weight, { "$divide": [1.0, { "$add": ["$rank", 60] } ] } ]
}
Finally, these weighted scores are combined (typically by adding them together) to produce a final score that determines the ranking of the documents.
Impact: By adjusting these weights, you can fine-tune the search results to better match your application's needs. For instance, if the full-text component is more reliable for your dataset, you might set full_text_weight higher than vector_weight.
The weights in the MongoDB function allow you to balance the contributions from vector-based and full-text search components, ensuring that the final ranking score reflects the desired importance of each search method.
Part 4: Vector Database Performance Analysis
1. Insertion Performance Analysis Process
We are inserting data incrementally with doubling batch sizes and record performance metrics. Notably, we will be measuring the time it takes to insert data incrementally and the number of rows inserted per second.
We are using the insert_data_incrementally function to insert data incrementally.
It starts with a batch size of 1 and doubles the batch size until it has inserted all the data, recording the time it takes to insert the data and the number of rows inserted per second.
The key component we are interested in is the time it takes to insert the data and the number of rows inserted per second. In AI Workloads, there are data ingestion processes that are performned in batches from various data sources. So in practice, we are interested in the time it takes to insert the data and the number of rows inserted per second.
1.1 PostgreSQL Insertion Performance Analysis
1.2 MongoDB Insertion Performance Analysis
1.3 Visualize Insertion Performance Analysis
2. Semantic Search with PostgreSQL and PgVector Performance Analysis
2.2 MongoDB Semantic Search Performance Analysis
2.3 Visualize Vector Search Performance Analysis
Part 5: Extra Notes
5.1 PostgreSQL JSONB vs MongoDB BSON
| Feature | PostgreSQL JSONB | MongoDB BSON |
|---|---|---|
| Integration | An extension to a relational database system. | Native to MongoDB, a document database. |
| Query Language | Uses SQL with JSONB-specific operators/functions. | Uses MongoDB Query Language (MQL), a JSON-like query syntax. |
| Storage Optimization | Optimized for relational data alongside JSONB. | Fully optimized for JSON-like document storage. |
| Data Type Support | Stores standard JSON data types (e.g., strings, numbers). | Includes additional types not in standard JSON (e.g., Date, ObjectId, Binary). |
| Use Case | Best for hybrid relational/JSON use cases. | Designed for flexible schemas, document-based databases. |
| Updates | JSONB supports in-place updates for specific keys or paths. | BSON supports in-place updates with more native support for field-level atomic operations. |
| Size Overhead | Slightly more compact than BSON in some cases. | Includes metadata like type information, leading to slightly larger size. |
5.2 Limitations of pgvector for Handling Large-Dimensional Embeddings
While pgvector is a powerful tool for storing and searching vector embeddings in PostgreSQL, it does have inherent limitations when it comes to handling very high-dimensional embeddings. Here are the key points and source:
-
PostgreSQL Page Size Constraint:
- Reason: PostgreSQL uses fixed 8KB pages for data storage.
- Impact: Each 32-bit float occupies 4 bytes, so storing a vector with many dimensions quickly exhausts the available space on a page.
- Practical Limit: This design limits indexed vectors to around 2000 dimensions unless alternative approaches (such as quantization or splitting the vector) are used.
-
Index Tuple Size Limit:
- Reason: Even if the underlying table supports larger vectors (up to 16,000 dimensions), the index tuples themselves are constrained by the 8KB limit.
- Impact: Attempting to build an index on vectors exceeding this limit results in errors or performance degradation.
-
Trade-offs in Workarounds:
- Quantization: Converting vectors from 32-bit floats to lower precision (e.g., half-precision) can allow for more dimensions but may reduce accuracy.
- Splitting Vectors: Dividing a high-dimensional vector across multiple columns or rows increases complexity in reconstructing the original vector for search and may affect retrieval speed.
- Alternative Data Types: Some projects (like pgvecto.rs) bypass these limitations by handling indexing outside PostgreSQL, but this sacrifices the ACID guarantees that pgvector provides.
-
Implications for AI Workloads:
- Model Compatibility: Many modern embedding models (e.g., OpenAI’s
text-embedding-3-largewith 3072 dimensions) produce embeddings that exceed pgvector’s optimal indexed dimension size, potentially forcing truncation or quantization. - Search Quality: These workarounds (truncation, quantization, or splitting) can impact the precision and recall of similarity searches—a critical factor for many AI applications.
- Model Compatibility: Many modern embedding models (e.g., OpenAI’s
5.3 Workaround Options for High-Dimensional Embeddings in pgvector
The pgvector project has received several suggestions and workaround proposals to mitigate the limitation of indexing high-dimensional vectors (beyond ~2000 dimensions). Two key comments from issues #326 and #395 detail some of these options.
5.3.1 Option 1: Use Lower Precision with halfvec
-
Description:
Convert full-precision (fp32) vectors to half-precision (fp16) for indexing purposes. -
How It Works:
- Storage: Vectors are stored as
vector(n)(still in fp32) in the table. - Indexing: When creating the index, the vector is cast to the
halfvectype (fp16) using syntax like:CREATE INDEX ON items USING hnsw ((embedding::halfvec(n)) halfvec_l2_ops);
- Storage: Vectors are stored as
-
Benefits:
- Smaller Index Size: fp16 values require half the storage of fp32, which can allow more dimensions to fit within the 8KB index tuple limit.
- Faster Index Build: Smaller data size can lead to quicker index creation.
-
Trade-offs:
- Loss of Precision: Quantizing from 32-bit to 16-bit floats introduces rounding errors.
- Impact on Recall: Testing has shown that recall remains nearly identical in many cases, but the loss in precision may not be acceptable for all applications.
-
Appropriateness:
This workaround is appropriate if the application can tolerate a slight reduction in numerical precision without significantly affecting the quality of similarity search results.
5.3.2 Option 2: Split the Embedding Across Multiple Rows or Columns
-
Description:
Divide a high-dimensional vector into multiple smaller vectors that can be stored and indexed separately. -
How It Works:
- Schema Change: Instead of storing one vector with dimensions greater than the limit, split it into two or more parts (e.g., a 3072-dimensional vector into one part of 2000 dimensions and another of 1072 dimensions).
- Indexing: Build separate indexes for each part and then combine the results (possibly with a re-ranking step) during query time.
-
Benefits:
- Full Precision Retained: No need to quantize the data, so the original accuracy is preserved.
- Scalability: This approach can support arbitrarily high dimensions by splitting the data.
-
Trade-offs:
- Complexity: Requires changes to the schema and additional logic in query processing to recombine or re-rank partial results.
- Performance Overhead: Merging results from multiple indexes can add latency to the search process.
-
Appropriateness:
This workaround is most appropriate when high precision is critical and the application cannot afford any loss in accuracy. It adds complexity but retains full precision for each vector.