Voyage Multimodal 3 Mixed Modality
Getting started with Voyage multimodal embeddings
This notebook shows some of the ways you can use the latest multimodal model from Voyage AI.
Let's dive in!
0. Add your API key to Colab secrets
To run this notebook, you'll need a Voyage API key. If you don't have a Voyage API key yet, you can create one here.
(If you're running this notebook locally, you can skip the remainder of this step.)
By default, this notebook uses the Google Colab environment. You can add your API key to the Colab Secrets manager to securely store it:
-
Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel.
-
Create a new secret with the name
VOYAGE_API_KEY. -
Copy/paste your API key into the
Valueinput box ofVOYAGE_API_KEY. -
Toggle the button on the left to allow notebook access to the secret.
1. Install packages
Now it's time to get started -- let's begin by installing some packages. For this notebook, all you'll need:
voyageaifor calling accessing the multimodal model via the Voyage APIpandasfor loading some test data we'll be using later on in the notebookPyMuPDFfor taking screenshots of documents, slides, and other visual data embedded inside PDFs
For this notebook, we'll use voyage-multimodal-3 as the embedding model. You can see a full list of multimodal embedding models available to you on the Voyage AI documentation.
2. Create a synchronous Voyage client
(If you're running this notebook locally, uncomment the cell below and run it in place of the second cell.)
3. Generate some vectors over example data
Because Voyage multimodal models support interleaved text and images, each input to the multimodal_embed function is a list of str and/or PIL.Image objects. We can see this clearly in the example below, which generates vectors for four different hypothetical documents:
- A single text (
textvariable) - A single image (
imagevariable) - Interleaved
text+image - Interleaved
image+text
Let's define the text and image that we'll use first:
'Voyage AI makes best-in-class embedding models and rerankers.'
We can now compile the documents for the Voyage client object and vectorize them:
/usr/local/lib/python3.11/dist-packages/PIL/Image.py:1045: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images warnings.warn(
A call to the multimodal_embed function returns a MultimodalEmbeddings dataclass which contains four components:
.embeddings: The computed vectors.text_tokens: The number of text tokens ingested across all inputs.image_pixels: The number of image pixels processed across all inputs.total_tokens: The total token count when images are taken into consideration (one image token is 560 pixels)
We can see the results here across all of our documents:
Number of vectors generated: 4 Number of text tokens ingested: 39 Number of image pixels processed: 196608 Total number of tokens (texts + images): 390
Note that the vectors generated by (2) and (3) are not the same, despite being derived from the same data. The cosine similarity between them is still extremely high, as one would expect:
Cosine similarity between the two vectors from interleaved data: 0.9656
4. Vectorizing charts, graphs, tables, and more
Voyage multimodal models excel at table and figure retrieval, i.e. the ability to match an image containing a figure (charts, graphs, tables, etc) with descriptions, captions, or other textual queries which reference the figure. Let's run through an example from the CharXiv dataset.
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn(
We've pulled the first three figures from the validation subset of CharXiv for use in this section. Let's plot these to see what they look like.
With these extracted as PIL.Image objects, we can vectorize them directly with the multimodal model -- no need to use layout analyzers or image transcription models:
Let's also define a query string, vectorize it, and see which of the figures the query is most related to:
'3D loss landscapes for different training strategies'
Cosine similarities: [[0.1838631637438084], [0.2137423680833308], [0.42832326084317174]]
Best match for query "3D loss landscapes for different training strategies":
The similarity search results do indeed match.
5. One step further: screenshot is all you need!
With Voyage's multimodal models, there is no longer a need for screen parsing models or any other complex text extraction pipelines. Simply take a screenshot of the document and convert the resulting image into a single, unified embedding that has context into any text, tables, figures, and other visuals that appear in the document.
5a. Convert PDFs to screenshots
We'll start by defining a function to extract all pages as a screenshot from a PDF. Specifically, we need to:
- Download a PDF as bytes and create an IO object using it.
- Use
PyMuPDF(fitz) to render each page using the specified zoom. - Convert and return all rendered pages to
PIL.Imageobjects.
5b. A single-page example
Now let's define a query string and a PDF document to embed. We'll use Watson and Crick's seminal 1-page paper on the molecular structure of DNA for this example:
'What is the molecular structure of Deoxyribonucleic Acid (DNA)?'
We can now directly compute the cosine similarity between the query and PDF document using the screenshot above:
0.4985062545747496
The resulting similarity value is quite high, showing that the content of the document itself was indeed embedded in the same space as the query.
5c. Screenshotting multi-page PDFs
We can use the same function for multi-page PDFs; each page is rendered separately as an image. Let's replicate the previous example, only for a multi-page document. We'll use President Franklin D. Roosevelt's 1941 State of the Union address for this example:
This is just the first page of the document. Let's also display the full document in one large figure:
Now let's test out voyage-multimodal-3 on this more challenging input. We'll try to see if we can retrieve the page that discusses the following query:
"The consequences of a dictator's peace"
As with before, let's vectorize everything. Each page in the document will get its own vector.
Now, let's find the most relevant page. We do this by picking the page number which corresponds to the document page with the highest cosine similarity with the query vector.
So how did we do? Recall our query string:
The consequences of a dictator's peace
This page, selected from nearly two dozen by voyage-multimodal-3, does indeed discuss FDR's perspective on this:
No realistic American can expect from a dictator's peace international generosity, or return of true independence, or world disarmament, or freedom of expression, or freedom of religion -- or even good business. Such a peace would bring no security for us or for our neighbors.
Next steps
Try voyage-multimodal-3 on your own data today! The first 200M tokens are on us. If you have any follow-up questions, or if you're interested in fine-tuned embeddings, feel free to reach out to use at contact@voyageai.com.
Feel free to follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.