Notebooks
M
MongoDB
Multimodal Rag Mongodb Voyage Ai

Multimodal Rag Mongodb Voyage Ai

agentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Open In Colab

Building Multimodal RAG Applications with MongoDB and Voyage AI

In this notebook, you will learn how to build multimodal RAG applications using Voyage AI's multimodal embedding models and Google's multimodal LLMs.

Additionally, we will evaluate Voyage AI's VLM-based embedding model against CLIP-based embedding models on our dataset.

Step 1: Install required libraries

  • pymongo: Python driver for MongoDB
  • voyageai: Python client for Voyage AI
  • google-genai: Python library to access Google's embedding models and LLMs via Google AI Studio
  • google-cloud-storage: Python client for Google Cloud Storage
  • sentence-transformers: Python library to use open-source ML models from Hugging Face
  • PyMuPDF: Python library for analyzing and manipulating PDFs
  • Pillow: A Python imaging library
  • tqdm: Show progress bars for loops in Python
  • tenacity: Python library for easily adding retries to functions
[1]

Step 2: Setup prerequisites

  • Set the MongoDB connection string: Follow the steps here to get the connection string from the Atlas UI.
  • Set the Voyage AI API key: Follow the steps here to get a Voyage AI API key.
  • Set a Gemini API key: Follow the steps here to get a Gemini API key via Google AI Studio.
  • [In a separate terminal] Setup Application Default Credentials (ADC): Follow the steps here to configure ADC via the Google Cloud CLI.
[2]

MongoDB

[3]
Enter your MongoDB connection string:  ········

Voyage AI

[4]
Enter your Voyage AI API key:  ········

Google

[5]
Enter your Gemini API key:  ········

Step 3: Read PDF from URL

[6]
[7]

Step 4: Store PDF images in GCS and extract metadata for MongoDB

[8]
[9]
[10]
[11]
[12]
[13]
100%|██████████| 22/22 [00:10<00:00,  2.18it/s]

Step 5: Add embeddings to the MongoDB documents

[14]
[15]
[16]
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[17]
[18]
[19]
[20]
100%|██████████| 22/22 [00:29<00:00,  1.33s/it]
[21]
dict_keys(['gcs_key', 'width', 'height', 'voyage_embedding', 'clip_embedding'])

Step 6: Ingest documents into MongoDB

[22]
[23]
{'ok': 1.0,
, '$clusterTime': {'clusterTime': Timestamp(1743655584, 1),
,  'signature': {'hash': b'\xcf1\xccO*\\\xd2\x08\xbf\x147\xe0h\x8b{\xfb \xf5$?',
,   'keyId': 7456513059255746561}},
, 'operationTime': Timestamp(1743655584, 1)}
[24]
[25]
[26]
DeleteResult({'n': 22, 'electionId': ObjectId('7fffffff0000000000000027'), 'opTime': {'ts': Timestamp(1743655585, 21), 't': 39}, 'ok': 1.0, '$clusterTime': {'clusterTime': Timestamp(1743655585, 22), 'signature': {'hash': b'\x7f\x9f\x93\xc9zJ\x8f\xf9\xafO\xeb.\x04\xf3{t"}\xf5\xe4', 'keyId': 7456513059255746561}}, 'operationTime': Timestamp(1743655585, 21)}, acknowledged=True)
[27]
InsertManyResult([ObjectId('67ee12a1a224539fbe2019e5'), ObjectId('67ee12a1a224539fbe2019e6'), ObjectId('67ee12a1a224539fbe2019e7'), ObjectId('67ee12a1a224539fbe2019e8'), ObjectId('67ee12a1a224539fbe2019e9'), ObjectId('67ee12a1a224539fbe2019ea'), ObjectId('67ee12a1a224539fbe2019eb'), ObjectId('67ee12a1a224539fbe2019ec'), ObjectId('67ee12a1a224539fbe2019ed'), ObjectId('67ee12a1a224539fbe2019ee'), ObjectId('67ee12a1a224539fbe2019ef'), ObjectId('67ee12a1a224539fbe2019f0'), ObjectId('67ee12a1a224539fbe2019f1'), ObjectId('67ee12a1a224539fbe2019f2'), ObjectId('67ee12a1a224539fbe2019f3'), ObjectId('67ee12a1a224539fbe2019f4'), ObjectId('67ee12a1a224539fbe2019f5'), ObjectId('67ee12a1a224539fbe2019f6'), ObjectId('67ee12a1a224539fbe2019f7'), ObjectId('67ee12a1a224539fbe2019f8'), ObjectId('67ee12a1a224539fbe2019f9'), ObjectId('67ee12a1a224539fbe2019fa')], acknowledged=True)

Step 7: Create a vector search index

[28]
[29]
'vector_index'

Step 8: Retrieve documents using vector search

[30]
[31]
[32]
0.7585833072662354

Output
0.7482262253761292

Output
0.7399106025695801

Output
0.7107774019241333

Output
0.69964599609375

Output
['multimodal-rag/1.png',
, 'multimodal-rag/13.png',
, 'multimodal-rag/14.png',
, 'multimodal-rag/7.png',
, 'multimodal-rag/4.png']
[33]
0.6344423294067383

Output
0.6320553421974182

Output
0.6312342882156372

Output
0.6270501017570496

Output
0.6267095804214478

Output
['multimodal-rag/1.png',
, 'multimodal-rag/7.png',
, 'multimodal-rag/14.png',
, 'multimodal-rag/8.png',
, 'multimodal-rag/5.png']

Step 9: Create a multimodal RAG app

[34]
[35]
[36]
[37]
[38]
'DeepSeek-R1 achieves a Pass@1 accuracy of 79.8% on AIME 2024, 97.3% on MATH-500, 90.8% on MMLU, and 71.5% on GPQA Diamond. It outperforms OpenAI-o1-1217 on MATH-500 and Codeforces. It performs slightly better than DeepSeek-V3 on SWE-bench Verified. However, its performance is slightly below that of OpenAI-o1-1217 on benchmarks like MMLU-Pro and GPQA Diamond.\n'
[39]
"Based on the provided context, here's a summary of the Pass@1 accuracy of Deepseek R1 against other models:\n\n*   **DeepSeek-R1 vs. OpenAI-01-mini and OpenAI-01-0912:** DeepSeek-R1 outperforms both OpenAI-01-mini and OpenAI-01-0912 on AIME 2024, MATH-500, and GPOA Diamond benchmarks.\n*   **DeepSeek-R1 vs. Distilled Models:** DeepSeek-R1-Distill-Qwen-7B outperforms Qwen32B-Preview on all evaluation metrics. DeepSeek-R1-14B surpasses QwQ32B-Preview on all evaluation metrics. DeepSeek-R1-Distill-Llama-70B exceeds OpenAI-01-1217 on entire non benchmark tasks.\n*   **DeepSeek-R1 vs. DeepSeek-V3:** DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.\n*   **DeepSeek-R1 vs. OpenAI-1217:** DeepSeek-R1 performs comparably on par with OpenAI-1217, surpassing other models by a large margin."

Step 10: Evaluating retrieval and generation

[40]
[41]
[42]
[43]
[44]
[45]
[ ]
[53]
Model: voyage
100%|██████████| 5/5 [00:25<00:00,  5.20s/it]
MRR: 1.0
Avg. Recall @5: 0.68
Avg. alignment: 3.2

[54]
Model: clip
100%|██████████| 5/5 [00:24<00:00,  5.00s/it]
MRR: 0.8
Avg. Recall @5: 0.56
Avg. alignment: 3.2

[55]
Model: voyage
100%|██████████| 5/5 [00:25<00:00,  5.05s/it]
MRR: 0.8666666666666666
Avg. Recall @5: 0.52
Avg. alignment: 3.8

[56]
Model: clip
100%|██████████| 5/5 [00:22<00:00,  4.58s/it]
MRR: 0.8
Avg. Recall @5: 0.32
Avg. alignment: 2.4

[57]
Model: voyage
100%|██████████| 10/10 [00:49<00:00,  4.97s/it]
MRR: 0.9333333333333332
Avg. Recall @5: 0.6
Avg. alignment: 3.5

[58]
Model: clip
100%|██████████| 10/10 [00:46<00:00,  4.67s/it]
MRR: 0.8
Avg. Recall @5: 0.4400000000000001
Avg. alignment: 2.7