Multimodal Intro
๐ผ๏ธ Introduction to Multimodal Text Generation
In this notebook, we introduce the features that enable multimodal text generation in Haystack.
- We introduced the
ImageContentdataclass, which represents the image content of a userChatMessage. - We developed some image converter components.
- The
OpenAIChatGeneratorwas extended to support multimodal messages. - The
ChatPromptBuilderwas refactored to also work with string templates, making it easier to support multimodal use cases.
In this notebook, we'll introduce all these features, show an application using textual retrieval + multimodal generation, and a multimodal Agent.
Setup Development Environment
Enter OpenAI API key:ยทยทยทยทยทยทยทยทยทยท
Introduction to ImageContent
ImageContent is a new dataclass that stores the image content of a user ChatMessage.
It has the following attributes:
base64_image: A base64 string representing the image.mime_type: The optional MIME type of the image (e.g. "image/png", "image/jpeg").detail: Optional detail level of the image (only supported by OpenAI). One of "auto", "high", or "low".meta: Optional metadata for the image.
Creating an ImageContent Object
Let's start by downloading an image from the web and manually creating an ImageContent object. We'll see more convenient ways to do this later.
--2025-05-14 09:29:45-- https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download Resolving upload.wikimedia.org (upload.wikimedia.org)... 198.35.26.112, 2620:0:863:ed1a::2:b Connecting to upload.wikimedia.org (upload.wikimedia.org)|198.35.26.112|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 202119 (197K) [image/jpeg] Saving to: โcapybara.jpgโ capybara.jpg 100%[===================>] 197.38K --.-KB/s in 0.09s 2025-05-14 09:29:45 (2.23 MB/s) - โcapybara.jpgโ saved [202119/202119]
ImageContent(base64_image='/9j/4QBoRXhpZgAATU0AKgAAAAgABQEaAAUAAAABAAAASgEbAAUAAAABAAAAUgEoAAMAAAABAAIAAAE7AAIAAAAFAAAAWgITAAMA...', mime_type='image/jpeg', detail='low', meta={}) Nice!
To perform text generation based on this image, we need to pass it in a user message with a prompt. Let's do that.
('The image depicts a capybara, a large rodent, with a small bird standing on '
'its head. The capybara has a brownish fur coat, while the bird has a yellow '
'belly and a grayish-brown back. They are surrounded by grassy vegetation, '
'creating a natural setting.')
Creating an ImageContent Object from URL or File Path
ImageContent features two utility class methods:
from_url: downloads an image file and wraps it inImageContentfrom_file_path: loads an image from disk and wraps it inImageContent
Using from_url, we can simplify the previous example. mime_type is automatically inferred.
ImageContent(base64_image='/9j/4QBoRXhpZgAATU0AKgAAAAgABQEaAAUAAAABAAAASgEbAAUAAAABAAAAUgEoAAMAAAABAAIAAAE7AAIAAAAFAAAAWgITAAMA...', mime_type='image/jpeg', detail='low', meta={'content_type': 'image/jpeg', 'url': 'https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download'}) Since we downloaded the image file, we can also see from_file_path in action.
In this case, we will also use the size parameter, that resizes the image to fit within the specified dimensions while maintaining aspect ratio. This reduces file size, memory usage, and processing time, which is beneficial when working with models that have resolution constraints or when transmitting images to remote services.
ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail='low', meta={'file_path': 'capybara.jpg'}) Image Converters for ImageContent
To perform image conversion in multimodal pipelines, we also introduced two image converters:
ImageFileToImageContent, which converts image files toImageContentobjects (similar tofrom_file_path).PDFToImageContent, which converts PDF files toImageContentobjects.
ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail='low', meta={'file_path': 'capybara.jpg'}) Let's see a more interesting example. We want our LLM to interpret a figure in this influential paper by Google: Scaling Instruction-Finetuned Language Models.
--2025-05-14 09:31:03-- https://arxiv.org/pdf/2210.11416.pdf Resolving arxiv.org (arxiv.org)... 151.101.67.42, 151.101.3.42, 151.101.195.42, ... Connecting to arxiv.org (arxiv.org)|151.101.67.42|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: http://arxiv.org/pdf/2210.11416 [following] --2025-05-14 09:31:04-- http://arxiv.org/pdf/2210.11416 Connecting to arxiv.org (arxiv.org)|151.101.67.42|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1557309 (1.5M) [application/pdf] Saving to: โflan_paper.pdfโ flan_paper.pdf 100%[===================>] 1.48M --.-KB/s in 0.09s 2025-05-14 09:31:04 (16.5 MB/s) - โflan_paper.pdfโ saved [1557309/1557309]
ImageContent(base64_image='/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAx...', mime_type='image/jpeg', detail=None, meta={'file_path': 'flan_paper.pdf', 'page_number': 9}) ('The main takeaway of Figure 6 is that Flan-PaLM demonstrates improved '
'performance in zero-shot reasoning tasks when utilizing chain-of-thought '
'(CoT) reasoning, as indicated by higher accuracy across different model '
'sizes compared to PaLM without finetuning. This highlights the importance of '
'instruction finetuning combined with CoT for enhancing reasoning '
'capabilities in models.')
Extended ChatPromptBuilder with String Templates
As we explored multimodal use cases, it became clear that the existing ChatPromptBuilder had some limitations. Specifically, we need a way to pass structured objects like ImageContent when building ChatMessage, and to handle a variable number of such objects.
To address this, we are introducing support for string templates in the ChatPromptBuilder. The syntax is pretty simple, as you can see below.
Note the | templatize_part Jinja2 filter: this is used to indicate that the content part is a structured object, not plain text, and needs special treatment.
[ChatMessage(_role=<ChatRole.SYSTEM: 'system'>,
_content=[TextContent(text='You are a joking assistant.')],
_name=None,
_meta={}),
ChatMessage(_role=<ChatRole.USER: 'user'>,
_content=[TextContent(text='Compare these images:'),
ImageContent(base64_image='iVBORw0KGgoAAAANSUhEUgAADwAAAAhwAgMAAADt0CPhAAAADFBMVEVHcEwAAADe3t58fHxUHjQgAAAAAXRSTlMAQObYZgAAIABJ...', mime_type='image/png', detail='low', meta={'content_type': 'image/png', 'url': 'https://1000logos.net/wp-content/uploads/2017/02/Apple-Logosu.png'}),
ImageContent(base64_image='/9j/4AAQSkZJRgABAQEA8ADwAAD/7SnQUGhvdG9zaG9wIDMuMAA4QklNA+0AAAAAABAA8AAAAAEAAQDwAAAAAQABOEJJTQQMAAAA...', mime_type='image/jpeg', detail='low', meta={'content_type': 'image/jpeg', 'url': 'https://upload.wikimedia.org/wikipedia/commons/2/26/Pink_Lady_Apple_%284107712628%29.jpg'})],
_name=None,
_meta={})]
("Sure! Let's dive into these fruity comparisons! \n"
'\n'
"1. **Apple Logo**: This is a stylized logo of an apple. It's simple, iconic, "
"and represents a well-known tech company. It's all about design and branding "
'โ who knew a fruit could be so influential in the tech world?\n'
'\n'
"2. **Real Apple**: This is an actual apple, the kind you can bite into! It's "
'delicious, nutritious, and makes a great snack or pie ingredient. Plus, it '
"doesn't need charging!\n"
'\n'
'In short, one is a tech icon, and the other is a snackable delight. Both are '
'essential in their own realms! ๐๐')
Textual Retrieval and Multimodal Generation
Let's see a more advanced example.
In this case, we have a collection of images from papers about Language Models.
Our goal is to build a system that can:
- Retrieve the most relevant image from this collection based on a user's textual question.
- Use this image, along with the original question, to have an LLM generate an answer.
We start by downloading the images.
['./arxiv/direct_preference_optimization.png', , './arxiv/large_language_diffusion_models.png', , './arxiv/lora_vs_full_fine_tuning.png', , './arxiv/magpie.png', , './arxiv/online_ai_feedback.png', , './arxiv/reverse_thinking_llms.png', , './arxiv/scaling_laws_for_precision.png', , './arxiv/spectrum.png', , './arxiv/textgrad.png', , './arxiv/tulu_3.png', , './map.png']
We create an InMemoryDocumentStore and write a Document there for each image: the content is a textual description of the image; the image path is stored in meta.
The content of the Documents here is minimal. You can think of more sophisticated ways to create a representive content: perform OCR or use a Vision Language Model. We'll explore this direction in the future.
10
We perform text-based retrieval (using BM25) to get the most relevant Document. Then an ImageContent object is created using the image file path. Finally, the ImageContent is passed to the LLM with the user question.
('The image compares two methods in machine learning: Reinforcement Learning '
'from Human Feedback (RLHF) and Direct Preference Optimization (DPO).\n'
'\n'
'### Left Side: RLHF\n'
'- **Process**: \n'
' - Input example: "write me a poem about the history of jazz."\n'
' - Preference data shown as two different responses (yโ and yโ).\n'
'- **Components**:\n'
' - It includes a "reward model" that labels the quality of outputs and '
'involves a reinforcement learning process.\n'
' - The goal is to derive an LM policy through sampling that improves over '
'time.\n'
'- **Key Terms**: "preference data," "maximum likelihood," "reinforcement '
'learning."\n'
'\n'
'### Right Side: DPO\n'
'- **Process**:\n'
' - Similar input as on the left.\n'
' - Preference data involves determining which response (yโ or yโ) is '
'preferred without a reward model.\n'
'- **Components**:\n'
' - Focuses directly on optimizing preferences to produce a final language '
'model.\n'
'- **Key Terms**: "preference data," "maximum likelihood," "final LM."\n'
'\n'
'### General Visual Elements:\n'
'- The image utilizes a clear color scheme to differentiate between systems, '
'with RLHF having a pink background and DPO a light blue background.\n'
'- Diagrams include nodes to represent network components and flows '
'indicating processes.')
Here are some other example questions to test:
Multimodal Agent
Let's combine multimodal messages with the Agent component.
We start with creating a weather Tool, based on the python-weather library. The library is asynchronous while the Tool abstraction expects a synchronous invocation method, so we make some adaptations.
Learn more about creating agents in Tutorial: Build a Tool-Calling Agent
Let's test our Tool by invoking it with the required parameter:
{'description': 'Heavy rain, fog',
, 'temperature': 14,
, 'humidity': 93,
, 'precipitation': 0.0,
, 'wind_speed': 24,
, 'wind_direction': WindDirection.EAST_SOUTHEAST} We can now define an Agent, provide it with the weather Tool and see if it can find the weather based on a geographical map.
('The weather in Valencia, Spain is currently overcast with a temperature of '
'21ยฐC. The humidity is at 64%, and there is no precipitation. Winds are '
'coming from the east-northeast at a speed of 25 km/h.')
What's next?
We also support image capabilities across a variety of LLM providers, including Amazon Bedrock, Google, Mistral, Ollama, and more.
To learn how to build more advanced multimodal pipelines, with different file formats and multimodal embedding models, check out the Creating Vision+Text RAG Pipelines tutorial.
(Notebook by Stefano Fiorucci)