Voyage Multimodal 3 5 Video
Getting started with voyage-multimodal-3.5
This notebook shows some of the ways you can use the latest multimodal model from Voyage AI.
Let's dive in!
0. Obtain a Voyage API key
To run this notebook, you'll need a Voyage API key. If you don't have an API key yet, you can create one here.
1. Install packages
Now it's time to get started -- let's begin by installing some packages. For this notebook, you'll need:
voyageaifor calling accessing the multimodal model via the Voyage APIffmpegfor processing video filesffmpeg-python, a Python wrapper forffmpeg
If you're running this notebook in Google Colab, run the following cell to install ffmpeg:
If you're running this notebook locally, do the following:
- Install
ffmpegfrom the terminal based on the OS you are using.
MacOS
brew install ffmpeg
Linux
sudo apt-get install ffmpeg
Windows
- Download the executable from ffmpeg.org
- Extract the downloaded zip file
- Note the path to the
binfolder
- Ensure that
ffmpegis accessible from your notebook. To do this, uncomment the cell below and replace/path/to/ffmpegwith yourffmpegpath:
For this notebook, we'll use voyage-multimodal-3.5 as the embedding model. You can see a full list of multimodal embedding models available to you in the our documentation.
2. Create a synchronous Voyage client
Paste your Voyage API key when prompted upon running the cell below:
Enter your Voyage API key: ········
3. Generate some vectors over example data
voyage-multimodal-3.5 can embed interleaved text and visual inputs, including videos.
Let's assume we have a video, its description, and frames extracted from it. Let's see how to embed the following using voyage-multimodal-3.5:
- Video only
- Individual frames only
- Video description only
- Video and its description
- A frame and its description
The multimodal_embed method accepts video inputs as Video objects. You can create these using the Video class from the voyageai library:
client_config.json: 0%| | 0.00/253 [00:00<?, ?B/s]
In the above example, we created the Video object from a file-like object/bytes, but you can also create it from a local filesystem path using the from_path method.
Now let's compile the documents to embed.
A call to the multimodal_embed function returns a MultimodalEmbeddings dataclass, which contains five components:
.embeddings: The computed vectors.text_tokens: The number of text tokens ingested across all inputs.image_pixels: The number of image pixels processed across all inputs.video_pixels: The number of video pixels processed across all inputs.total_tokens: The total token count when images are taken into consideration (one image token is 560 pixels, one video token is 1120 pixels). Keep in mind that each input must not exceed 32,000 tokens, and the total number of tokens across all inputs must not exceed 320,000.
We can see the results here across all our documents:
Number of vectors generated: 5 Number of text tokens ingested: 25 Number of image pixels processed: 131072 Number of video pixels processed: 30005248 Total number of tokens (texts + images + video): 27049
To query these documents using say vector search, set the input_type to "query" when embedding the search query.
For the rest of this tutorial, we will focus mainly on video inputs. For more examples of how to embed other input types using voyage-multimodal-3.5, refer to the voyage-multimodal-3 tutorial. Both models use the multimodal_embed method of the API and support the same input formats, with the exception that voyage-multimodal-3.5 supports video inputs.
4. Video optimization
In the previous example, we embedded a short (32-second) low-resolution video that was well within the model's 32,000 token and 20 MB file-size limits. However, in the real world, you'll likely encounter longer videos, with different resolutions, frame rates, etc., which will oftentimes exceed these limits. Let's see how to handle such inputs when working with voyage-multimodal-3.5.
By default, videos are automatically optimized when creating Video objects for vectorization using the voyageai library. This is done by setting the optimize parameter to True when calling the from_file and from_path methods.
Optimization here mainly means resizing the videos and downsampling the frame rate to ensure videos stay within the model's token limits. The complete optimization logic can be found here.
To test this, let's use a video that we know exceeds the model's limits:
Let's get stats like the frame rate, number of pixels, estimated token usage, etc., for the video prior to optimization using ffmpeg.
But first, we need to save the video bytes to a temporary file:
Now let's use the probe method of ffmpeg to extract metadata from the video:
The stats for the video before optimization are as follows:
Width of each frame: 1920 Height of each frame: 1080 Number of frames: 6654 Frame rate: 60/1 Video duration (seconds): 110.9 Pixels per frame: 2073600 Total video pixels: 13797734400 Estimated token usage: 12319405
As seen above, the video in its original form would consume ~12.3M tokens, exceeding the model's 32K token limit by nearly 400x. The video is in full HD resolution (1920 x 1080) with a high frame rate of 60 fps, both of which can be reduced significantly to optimize token consumption.
Now let's create the Video object with optimization enabled by default:
True
32000000
16
28571
As seen above, the optimizer downsampled the frame rate to strategically capture 16 frames across the entire video, while largely retaining the frame resolution.
This aggressive frame sampling is effective because consecutive frames at high frame rates (like 60 fps) are often visually redundant and don't add new semantic information. Maintaining higher frame resolution, however, preserves the visual details and clarity needed for accurate content understanding, that lower resolutions might lose.
5. Semantic video segmentation
In some cases, you might want more control over how frames are selected rather than relying on auto-sampling, especially for longer videos. In such scenarios, you may first want to split the videos into smaller segments and embed each segment separately.
For example, if you have video transcripts/captions, splitting the video based on natural breaks in the transcripts will ensure that related frames stay together, resulting in more focused embeddings. If segments still exceed the 32K token limit, you can apply auto-optimization on top of your initial semantic segmentation for best results.
Let's see how to do this. We'll use the same video as in Section 4a, along with its captions.
Caption files typically contain timestamped segments of dialogue or narration from a video. Each entry includes start and end times and the corresponding text for the segment.
An example of an entry in our caption file is as follows:
{'start': 0.0,
, 'end': 7.166,
, 'text': 'A fresh take on the origin of Earth’s Moon '} Now, let's use ffmpeg to split the video based on the caption timestamps.
Created ./video_segments/segment_000.mp4: 0.000s - 7.166s Created ./video_segments/segment_001.mp4: 9.383s - 16.550s Created ./video_segments/segment_002.mp4: 18.533s - 25.700s Created ./video_segments/segment_003.mp4: 31.333s - 38.500s Created ./video_segments/segment_004.mp4: 40.900s - 48.366s Created ./video_segments/segment_005.mp4: 53.016s - 60.533s Created ./video_segments/segment_006.mp4: 64.233s - 70.066s Created ./video_segments/segment_007.mp4: 72.500s - 79.916s Created ./video_segments/segment_008.mp4: 83.066s - 90.333s Created ./video_segments/segment_009.mp4: 92.633s - 99.800s
Let's convert each video segment into a Video object:
Note that the video segments are auto-optimized by default to stay within the model's token limits. Segmenting by captions ensures related frames stay together, and the optimization removes any redundant frames.
Now, let's vectorize each video segment separately:
Number of vectors generated: 10
You can also embed the video segments along with their captions to create more contextually aware embeddings:
Example input: [<voyageai.video_utils.Video object at 0x7f3c002aec20>, 'A fresh take on the origin of Earth’s Moon '] Number of vectors generated: 10
Next steps
Try voyage-multimodal-3.5 on your own data today! The first 200M tokens are on us. If you have any follow-up questions, or if you're interested in fine-tuned embeddings, feel free to reach out to use at contact@voyageai.com.
Feel free to follow us on X (Twitter) and LinkedIn, and join our Discord for more updates.