MongoDB Ai Workload Database Architecture Mongodb Elastic

Ai Workload Database Architecture Mongodb Elastic

agentsartificial-intelligencellmsmongodb-genai-showcaseperformance_guidancenotebooksgenerative-airag

alph-notebooks/mongodb-genai-showcase / ai_workload_database_architecture_mongodb_elastic.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Vector Database Comparison For AI Workloads: Elasticsearch vs MongoDB Vector Search

While both MongoDB Atlas and Elasticsearch can store vector embeddings for AI applications, they serve fundamentally different purposes. This notebook explores multiple approaches to implementing vector search, comparing their benefits and challenges:

Elasticsearch for Vector Search:
Elasticsearch is primarily a search engine optimized for information retrieval and analytics. It efficiently handles vector embeddings to enable semantic search capabilities. In this notebook, we demonstrate how to implement vector search with Elasticsearch, highlighting its search optimizations capabilities
MongoDB Atlas for Unified AI Workloads:
MongoDB Atlas is a fully-featured database with built-in vector search capabilities. As a true database, it offers ACID compliance—ensuring Atomicity, Consistency, Isolation, and Durability—which is essential for production AI applications that require data reliability. We illustrate how to implement vector search using MongoDB Atlas, showcasing its ability to handle both vector search and traditional database operations within a unified system.
Split Architecture vs. Unified Architecture:
- Split Architecture: In this approach, vector embeddings are stored in Elasticsearch to leverage its search capabilities, while metadata and other critical information are managed in MongoDB Atlas. This model allows you to utilize the strengths of both systems but introduces challenges such as data synchronization and consistency between the two systems.
- Unified Architecture: Alternatively, MongoDB Atlas can be used to handle both vector search and data storage in a single system. This unified approach simplifies the architecture by eliminating cross-database synchronization issues, ensuring robust ACID compliance and streamlined operations for AI workloads.

In this notebook the parts are:

Part 1. Data Setup - Installing libraries, setting up connections, and preparing the dataset
Part 2. Elasticsearch Implementation - Setting up and using Elasticsearch for vector search
Part 3. MongoDB Atlas Implementation - Setting up and using MongoDB Atlas for vector search
Part 4. Split Architecture - Implementing a hybrid approach(Elastic as vector database and MongoDB as operational) with both databases
Part 5. Unified Architecture - Using MongoDB Atlas for both vector search and data storage
Part 6. Performance Guidance - Overview of the benefits of a unified architecture

Part 1: Data Setup

[ ]

Step 1: Install Libraries

All the libraries are installed using pip and facilitate the sourcing of data, embedding generation, and data visualization.

datasets: Hugging Face library for managing and preprocessing datasets across text, image, and audio (https://huggingface.co/datasets)
voyageai: A library for creating sentence embeddings for tasks like semantic search and clustering. (https://voyageai.com/)
pandas: A library for data manipulation and analysis with DataFrames and Series (https://pandas.pydata.org/)
matplotlib: A library for creating static, interactive, and animated data visualizations (https://matplotlib.org/)

[ ]

Note: you may need to restart the kernel to use updated packages.

Step 2: Data Loading

[ ]

/Users/richmondalake/miniconda3/envs/elasticmdbenv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

[ ]

Step 3: Embedding Generation

[ ]

Generate embeddings for the query templates used in benchmarking process

Note: Doing this to avoid the overhead of generating embeddings for each query during the benchmark process

Note: Feel free to add more queries to the query_templates list to test the performance of the vector database with a larger number of queries

Part 2: Vector Search with Elasticsearch

Step 1: Install Libraries

[ ]

Note: you may need to restart the kernel to use updated packages.

Step 2: Installing Elasticsearch Locally

Run start-local

To set up Elasticsearch and Kibana locally, run the start-local script:

curl -fsSL https://elastic.co/start-local | sh

This script creates an elastic-start-local folder containing configuration files and starts both Elasticsearch and Kibana using Docker.

After running the script, you can access Elastic services at the following endpoints:

Elasticsearch: http://localhost:9200
Kibana: http://localhost:5601

Elasticsearch is installed using docker.

Find more instructions on installing Elasticsearch with docker here

The Elasticsearch docker image is pulled from the elasticsearch repository.

NOTE: To uninstall Elasticsearch, run the following command:

./uninstall.sh

Your Elastic Cloud API Key will be shown on the terminal after running the start-local script

[ ]

{'name': '7b112493bc71', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'RdOsdCvCQf6567TohRbagg', 'version': {'number': '8.17.4', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': 'c63c7f5f8ce7d2e4805b7b3d842e7e792d84dda1', 'build_date': '2025-03-20T15:39:59.811110136Z', 'build_snapshot': False, 'lucene_version': '9.12.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}

Step 3: Create Elasticsearch Index

Create an index with the name wikipedia_data and the following mapping:

title: The title of the Wikipedia article
text: The text of the Wikipedia article
url: The URL of the Wikipedia article
embedding: The embedding vector for the Wikipedia article

Create an index in Elasticsearch with the right index mappings to handle vector searches.

One thing to note is that by default elasticsearch quantizes the embeddings to 8 bits. This means that the precison of your vector embeddings are reduced if you don't explicitly set the index_options to hnsw. More information can be found in the elasticsearch documentation.

[ ]

Deleting existing wikipedia_data
Creating index wikipedia_data

/var/folders/tw/h5zv0cns7yg3z_ytt6y14d3m0000gn/T/ipykernel_5541/4269052426.py:28: DeprecationWarning: Passing transport options in the API method is deprecated. Use 'Elasticsearch.options()' instead.
  es_client.indices.delete(index=es_index_name, ignore=[400, 404])

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'wikipedia_data'})

Step 4: Define insert function

[ ]

Step 5: Insert Data into Elasic

[ ]

Inserting data into Elasticsearch
Performing bulk insert
Bulk insert completed: 1000 documents inserted, 0 failed
Data ingestion into Elasticsearch complete!

Step 6: Define Full Text Search function

[ ]

/var/folders/tw/h5zv0cns7yg3z_ytt6y14d3m0000gn/T/ipykernel_5541/3667714330.py:10: DeprecationWarning: Received 'size' via a specific parameter in the presence of a 'body' parameter, which is deprecated and will be removed in a future version. Instead, use only 'body' or only specific parameters.
  response = client.search(

[ ]

Step 7: Define semantic search function

[ ]

Part 3: Search with MongoDB Vector Search

Step 1: Install Libraries

pymongo (4.10.1): A Python driver for MongoDB (https://pymongo.readthedocs.io/en/stable/)

[ ]

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
eland 8.17.0 requires pandas<2,>=1.5, but you have pandas 2.2.3 which is incompatible.
Note: you may need to restart the kernel to use updated packages.

Step 2: Installing MongoDB via Atlas CLI

The Atlas CLI is a command line interface built specifically for MongoDB Atlas. Interact with your Atlas database deployments and MongoDB Search from the terminal with short, intuitive commands, so you can accomplish complex database management tasks in seconds.

You can follow the instructions here to install the Atlas CLI using docker(other options are available) and get a local MongoDB database instance running.

Follow the steps here to run Altas CLI commands with Docker.

Find more information on the Atlas CLI here:

Step 3: Connect to MongoDB and Create Database and Collection

After installing the Atlas CLI, you can run the following command to connect to your MongoDB database:

atlas deployments connect
You will be prompted to specificy "How would you like to connect to local9410"
Select connectionString
Copy the connection string and paste it into the MONGO_URI environment variable

More information here.

[ ]

In the following code blocks below we do the following:

Establish a connection to the MongoDB database
Create a database and collection if they do not already exist
Delete all data in the collection if it already exists

[ ]

Connection to MongoDB successful
Collection 'wikipedia_data_test' already exists.

[ ]

DeleteResult({'n': 1000, 'electionId': ObjectId('7fffffff000000000000000c'), 'opTime': {'ts': Timestamp(1742987842, 100), 't': 12}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1742987842, 100), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1742987842, 100)}, acknowledged=True)

Step 4: Vector Index Creation

The setup_vector_search_index function creates a vector search index for the MongoDB collection.

The index_name parameter is the name of the index to create.

The embedding_field_name parameter is the name of the field containing the text embeddings on each document within the wikipedia_data collection.

[ ]

Creating index 'vector_index'...

'vector_index'

Step 5: Create Search Index

[ ]

Creating index 'text_search_index'...

'text_search_index'

Step 5: Define Insert Data Function

Because of the affinity of MongoDB for JSON data, we don't have to convert the Python Dictionary in the json_data attribute to a JSON string using the json.dumps() function. Instead, we can directly insert the Python Dictionary into the MongoDB collection.

This reduced the operational overhead of the insertion processes in AI workloads.

[ ]

Step 6: Insert Data into MongoDB

Since we are creating scenarios for the optimizaton of vecor search, which includes the optimizatioon of vector data storage and retrival, we need to convert the embedding field in the dataset to a list of BSON objects.

BSON Objects are a binary representation of the data that is used to store the data in the database.

More specficlly,we recommend the BSON binData vector subtype for the following use cases:

You need to index quantized vector output from embedding models.
You have a large number of float vectors but want to reduce the storage footprint (such as disk and memory usage) of the database.

Benefits The BinData vector format requires about three times less disk space in your cluster compared to arrays of elements. It allows you to index your vectors with alternate types such as int1 or int8 vectors, reducing the memory needed to build the MongoDB Vector Search index for your collection. It reduces the RAM for mongot by 3.75x for scalar and by 24x for binary; the vector values shrink by 4x and 32x respectively, but the Hierarchical Navigable Small Worlds graph itself doesn't shrink.

In this notebook, we will convert the embeddings to the BSON binData vector format by using the bson.binary module.

[ ]

In the following visualisation of the first 5 rows of the dataset, we can see that the embedding field is now a list of BSON objects. BSON objects are a binary representation of the data that is used to store the data in the database. BSON objects are more efficient for storage and retrieval in MongoDB.

[ ]

Step 7: Define Full Text Search Function

[ ]

Step 8: Define Semantic Search Function

The semantic_search_with_mongodb function performs a vector search in the MongoDB collection based on the user query.

user_query parameter is the user's query string.
collection parameter is the MongoDB collection to search.
top_n parameter is the number of top results to return.
vector_search_index_name parameter is the name of the vector search index to use for the search.

The numCandidates parameter is the number of candidate matches to consider. This is set to 150 to match the number of candidate matches to consider in the Elasticsearch vector search.

Another point to note is the queries in MongoDB are performed using the aggregate function enabled by the MongoDB Query Language(MQL).

This allows for more flexibility in the queries and the ability to perform more complex searches. And data processing opreations can be defined as stages in the pipeline. If you are a data engineer, data scientist or ML Engineer, the concept of pipeline processing is a key concept.

[ ]

Part 4: Demonstrating MongoDB's ACID Capabilities for AI Workloads

[ ]

INFO:mongodb_transaction:Starting transaction for document 67e3e2581c853f84b3767b4f
INFO:mongodb_transaction:Transaction committed successfully for document 67e3e2581c853f84b3767b4f
INFO:mongodb_transaction:Starting transaction for document 67e3e2581c853f84b3767b4f
WARNING:mongodb_transaction:Fail mode activated - simulating transaction failure
ERROR:mongodb_transaction:Transaction aborted: Simulated failure in transaction


===== DEMONSTRATING SUCCESSFUL TRANSACTION =====
Success Transaction Result: True
Document updated successfully: True

===== DEMONSTRATING FAILED TRANSACTION =====
Failed Transaction Result: False
Error: Simulated failure in transaction
Document unchanged after failed transaction: True
Operations attempted before failure: 1

Part 5: The Data Synchronization Challenge: Split Architecture with MongoDB and Elasticsearch

Key Steps

Logging and Setup:
- Configure logging to capture and format logs for the demonstration.
Initialization and Index Creation:
- Initialize MongoDB and Elasticsearch clients, and set up the MongoDB database/collection along with the Elasticsearch index name.
- Check if the Elasticsearch index exists; if not, create it with a mapping that includes fields like document_id, title, content_preview, embedding (with vector search capabilities), and last_updated.
Document Ingestion:
- Insert a new document into MongoDB with its title, content, metadata, status, and timestamps.
- Generate an embedding for the document’s content using a pre-defined get_embedding function.
- Create an Elasticsearch document using the same MongoDB ID, including the generated embedding and a preview of the content.
- Insert the document into Elasticsearch with a refresh to make it immediately searchable.
- Roll back the MongoDB insertion if the Elasticsearch operation fails to maintain consistency.
Document Update:
- Update the document in MongoDB with new title, content, or metadata, while refreshing the updated_at timestamp.
- If the content (or title) is updated, retrieve the updated document, regenerate the embedding, and update the corresponding Elasticsearch document.
- Log any synchronization issues if the Elasticsearch update fails after the MongoDB update.
Document Deletion:
- Delete the document from MongoDB first, then attempt to delete it from Elasticsearch.
- Include an option to simulate an Elasticsearch deletion failure to illustrate the risk of data inconsistency.
- Handle cases where the document is missing in Elasticsearch and log the resulting inconsistency.
Vector Search and Data Consistency Check:
- Generate an embedding for a given search query using the get_embedding function.
- Perform a vector search in Elasticsearch to retrieve documents based on similarity.
- For each result, fetch the full document from MongoDB to verify consistency, highlighting any discrepancies (e.g., ghost documents).
Running the Complete Demo:
- Execute a demonstration sequence that includes:
  a. Document ingestion.
  b. Document update.
  c. An initial vector search.
  d. Document deletion (with an optional simulated failure).
  e. A subsequent vector search to check for synchronization issues.
- Record each step’s status and log observations regarding potential data integrity issues due to synchronization failures.

Implementation of a simple split architecture

Key Functions

__init__
- Role: Initializes the class instance by setting up MongoDB and Elasticsearch clients, defining the MongoDB database/collection and Elasticsearch index name, and ensuring the Elasticsearch index exists (creating it if necessary).
_create_elastic_index
- Role: Creates the Elasticsearch index with the required mapping and settings, including vector search capabilities (e.g., defining a dense vector field with cosine similarity).
ingest_document
- Role: Ingests a new document into the split architecture by first inserting it into MongoDB, generating an embedding for the content via an external get_embedding function, and then indexing the document (with its vector embedding) into Elasticsearch. It includes error handling to roll back the MongoDB insertion if the Elasticsearch indexing fails.
update_document
- Role: Updates an existing document's details in MongoDB (such as title, content, or metadata) and, if necessary, updates the corresponding document in Elasticsearch. When the content is updated, it regenerates the embedding and refreshes the Elasticsearch document accordingly, while logging any synchronization issues if the update process fails.
delete_document
- Role: Deletes a document from both MongoDB and Elasticsearch. It first removes the document from MongoDB and then attempts to delete it from Elasticsearch. The function can simulate an Elasticsearch deletion failure to demonstrate potential synchronization problems, such as data remaining in Elasticsearch after being deleted from MongoDB.
vector_search
- Role: Performs a vector search in Elasticsearch by generating an embedding for the query text (using get_embedding), retrieving matching documents based on vector similarity, and then fetching the full documents from MongoDB to verify consistency across both systems.
run_split_architecture_demo
- Role: Orchestrates a full demonstration of the split architecture challenges. It sequentially runs document ingestion, update, initial vector search, deletion (with an optional simulated failure), and a subsequent vector search to showcase and log synchronization issues (like ghost documents) and overall data consistency challenges.

[ ]

INFO:elastic_transport.transport:HEAD http://localhost:9200/document_vectors [status:404 duration:0.007s]
INFO:elastic_transport.transport:PUT http://localhost:9200/document_vectors [status:200 duration:0.115s]
INFO:ai_architecture_demo:Created Elasticsearch index: document_vectors
INFO:ai_architecture_demo:Document inserted in MongoDB with ID: 67e3e2ae1c853f84b3767f39
INFO:elastic_transport.transport:PUT http://localhost:9200/document_vectors/_doc/67e3e2ae1c853f84b3767f39?refresh=true [status:201 duration:0.022s]
INFO:ai_architecture_demo:Document vector indexed in Elasticsearch with ID: 67e3e2ae1c853f84b3767f39
INFO:ai_architecture_demo:Document updated in MongoDB: 67e3e2ae1c853f84b3767f39
INFO:elastic_transport.transport:POST http://localhost:9200/document_vectors/_update/67e3e2ae1c853f84b3767f39?refresh=true [status:200 duration:0.034s]
INFO:ai_architecture_demo:Document vector updated in Elasticsearch: 67e3e2ae1c853f84b3767f39
/var/folders/tw/h5zv0cns7yg3z_ytt6y14d3m0000gn/T/ipykernel_5541/2559789100.py:308: DeprecationWarning: Received 'size' via a specific parameter in the presence of a 'body' parameter, which is deprecated and will be removed in a future version. Instead, use only 'body' or only specific parameters.
  es_response = self.elastic_client.search(
INFO:elastic_transport.transport:POST http://localhost:9200/document_vectors/_search [status:200 duration:0.013s]
INFO:ai_architecture_demo:Document deleted from MongoDB: 67e3e2ae1c853f84b3767f39
WARNING:ai_architecture_demo:Simulating Elasticsearch deletion failure
ERROR:ai_architecture_demo:CRITICAL SYNC ERROR: Document deleted from MongoDB but not from Elasticsearch: Simulated Elasticsearch failure
INFO:elastic_transport.transport:POST http://localhost:9200/document_vectors/_search [status:200 duration:0.011s]

Demo completed with 5 observations
- CRITICAL DATA INTEGRITY ISSUE: Document deleted from MongoDB still appears in Elasticsearch search results
- Data Synchronization Failure: MongoDB and Elasticsearch are out of sync due to partial deletion
- Complex Error Handling Required: Split architecture requires extensive error handling and recovery processes
- Eventual Consistency Challenges: Split architecture can only achieve eventual consistency, not immediate consistency
- Ghost Documents: Documents deleted from MongoDB still exist in Elasticsearch

$image.png$

Part 6: Performance Guidance

MongoDB is a modern database designed from the ground up to handle the challenges of today's AI applications, offering a unified architecture where both operational data and vector embeddings coexist within a single transaction-safe system.

MongoDB transforms the typical AI architecture pattern from a complex, synchronized multi-database solution into a single, cohesive platform that supports the entire AI application lifecycle while maintaining the transactional integrity and operational reliability that production systems demand.