Notebooks
M
MongoDB
Voyage Multimodal 3 Mixed Modality

Voyage Multimodal 3 Mixed Modality

advanced_techniquesagentsartificial-intelligencellmsmongodb-genai-showcasenotebooksgenerative-airag

Open In Colab

Getting started with Voyage multimodal embeddings

This notebook shows some of the ways you can use the latest multimodal model from Voyage AI.

Let's dive in!

0. Add your API key to Colab secrets

To run this notebook, you'll need a Voyage API key. If you don't have a Voyage API key yet, you can create one here.

(If you're running this notebook locally, you can skip the remainder of this step.)

By default, this notebook uses the Google Colab environment. You can add your API key to the Colab Secrets manager to securely store it:

  1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel.

    The Secrets tab is found on the left panel.
  2. Create a new secret with the name VOYAGE_API_KEY.

  3. Copy/paste your API key into the Value input box of VOYAGE_API_KEY.

  4. Toggle the button on the left to allow notebook access to the secret.

1. Install packages

Now it's time to get started -- let's begin by installing some packages. For this notebook, all you'll need:

  • voyageai for calling accessing the multimodal model via the Voyage API
  • pandas for loading some test data we'll be using later on in the notebook
  • PyMuPDF for taking screenshots of documents, slides, and other visual data embedded inside PDFs
[ ]

For this notebook, we'll use voyage-multimodal-3 as the embedding model. You can see a full list of multimodal embedding models available to you on the Voyage AI documentation.

[ ]

2. Create a synchronous Voyage client

(If you're running this notebook locally, uncomment the cell below and run it in place of the second cell.)

[ ]
[ ]
[ ]

3. Generate some vectors over example data

Because Voyage multimodal models support interleaved text and images, each input to the multimodal_embed function is a list of str and/or PIL.Image objects. We can see this clearly in the example below, which generates vectors for four different hypothetical documents:

  1. A single text (text variable)
  2. A single image (image variable)
  3. Interleaved text + image
  4. Interleaved image + text

Let's define the text and image that we'll use first:

[ ]
'Voyage AI makes best-in-class embedding models and rerankers.'
[ ]
Output

We can now compile the documents for the Voyage client object and vectorize them:

[ ]
[ ]
/usr/local/lib/python3.11/dist-packages/PIL/Image.py:1045: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images
  warnings.warn(

A call to the multimodal_embed function returns a MultimodalEmbeddings dataclass which contains four components:

  • .embeddings: The computed vectors
  • .text_tokens: The number of text tokens ingested across all inputs
  • .image_pixels: The number of image pixels processed across all inputs
  • .total_tokens: The total token count when images are taken into consideration (one image token is 560 pixels)

We can see the results here across all of our documents:

[ ]
Number of vectors generated: 4
Number of text tokens ingested: 39
Number of image pixels processed: 196608
Total number of tokens (texts + images): 390

Note that the vectors generated by (2) and (3) are not the same, despite being derived from the same data. The cosine similarity between them is still extremely high, as one would expect:

[ ]
Cosine similarity between the two vectors from interleaved data: 0.9656

4. Vectorizing charts, graphs, tables, and more

Voyage multimodal models excel at table and figure retrieval, i.e. the ability to match an image containing a figure (charts, graphs, tables, etc) with descriptions, captions, or other textual queries which reference the figure. Let's run through an example from the CharXiv dataset.

[ ]
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

We've pulled the first three figures from the validation subset of CharXiv for use in this section. Let's plot these to see what they look like.

[ ]
Output

With these extracted as PIL.Image objects, we can vectorize them directly with the multimodal model -- no need to use layout analyzers or image transcription models:

[ ]
[ ]

Let's also define a query string, vectorize it, and see which of the figures the query is most related to:

[ ]
'3D loss landscapes for different training strategies'
[ ]
[ ]
Cosine similarities: [[0.1838631637438084], [0.2137423680833308], [0.42832326084317174]]
[ ]
Best match for query "3D loss landscapes for different training strategies":
Output

The similarity search results do indeed match.

5. One step further: screenshot is all you need!

With Voyage's multimodal models, there is no longer a need for screen parsing models or any other complex text extraction pipelines. Simply take a screenshot of the document and convert the resulting image into a single, unified embedding that has context into any text, tables, figures, and other visuals that appear in the document.

5a. Convert PDFs to screenshots

We'll start by defining a function to extract all pages as a screenshot from a PDF. Specifically, we need to:

  1. Download a PDF as bytes and create an IO object using it.
  2. Use PyMuPDF (fitz) to render each page using the specified zoom.
  3. Convert and return all rendered pages to PIL.Image objects.
[ ]

5b. A single-page example

Now let's define a query string and a PDF document to embed. We'll use Watson and Crick's seminal 1-page paper on the molecular structure of DNA for this example:

[ ]
'What is the molecular structure of Deoxyribonucleic Acid (DNA)?'
[ ]
Output

We can now directly compute the cosine similarity between the query and PDF document using the screenshot above:

[ ]
[ ]
[ ]
0.4985062545747496

The resulting similarity value is quite high, showing that the content of the document itself was indeed embedded in the same space as the query.

5c. Screenshotting multi-page PDFs

We can use the same function for multi-page PDFs; each page is rendered separately as an image. Let's replicate the previous example, only for a multi-page document. We'll use President Franklin D. Roosevelt's 1941 State of the Union address for this example:

[ ]
Output

This is just the first page of the document. Let's also display the full document in one large figure:

[ ]
Output

Now let's test out voyage-multimodal-3 on this more challenging input. We'll try to see if we can retrieve the page that discusses the following query:

[ ]
"The consequences of a dictator's peace"

As with before, let's vectorize everything. Each page in the document will get its own vector.

[ ]
[ ]

Now, let's find the most relevant page. We do this by picking the page number which corresponds to the document page with the highest cosine similarity with the query vector.

[ ]
Output

So how did we do? Recall our query string:

The consequences of a dictator's peace

This page, selected from nearly two dozen by voyage-multimodal-3, does indeed discuss FDR's perspective on this:

No realistic American can expect from a dictator's peace international generosity, or return of true independence, or world disarmament, or freedom of expression, or freedom of religion -- or even good business. Such a peace would bring no security for us or for our neighbors.

Next steps

Try voyage-multimodal-3 on your own data today! The first 200M tokens are on us. If you have any follow-up questions, or if you're interested in fine-tuned embeddings, feel free to reach out to use at contact@voyageai.com.

Feel free to follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.