Fine-Tuning a Vision Language Model (Qwen2-VL-7B) with the Hugging Face Ecosystem (TRL)
Authored by: Sergio Paniego
🚨 WARNING: This notebook is resource-intensive and requires substantial computational power. If you’re running this in Colab, it will utilize an A100 GPU.
In this recipe, we’ll demonstrate how to fine-tune a Vision Language Model (VLM) using the Hugging Face ecosystem, specifically with the Transformer Reinforcement Learning library (TRL).
🌟 Model & Dataset Overview
We’ll be fine-tuning the Qwen2-VL-7B model on the ChartQA dataset. This dataset includes images of various chart types paired with question-answer pairs—ideal for enhancing the model's visual question-answering capabilities.
📖 Additional Resources
If you’re interested in more VLM applications, check out:
- Multimodal Retrieval-Augmented Generation (RAG) Recipe: where I guide you through building a RAG system using Document Retrieval (ColPali) and Vision Language Models (VLMs).
- Phil Schmid's tutorial: an excellent deep dive into fine-tuning multimodal LLMs with TRL.
- Merve Noyan's smol-vision repository: a collection of engaging notebooks on cutting-edge vision and multimodal AI topics.
1. Install Dependencies
Let’s start by installing the essential libraries we’ll need for fine-tuning! 🚀
Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 844.5/844.5 kB 15.6 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 59.6/59.6 MB 43.7 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.6/324.6 kB 30.2 MB/s eta 0:00:00
Log in to Hugging Face to upload your fine-tuned model! 🗝️
You’ll need to authenticate with your Hugging Face account to save and share your model directly from this notebook.
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
2. Load Dataset 📁
In this section, we’ll load the HuggingFaceM4/ChartQA dataset. This dataset contains chart images paired with related questions and answers, making it ideal for training on visual question answering tasks.
Next, we’ll generate a system message for the VLM. In this case, we want to create a system that acts as an expert in analyzing chart images and providing concise answers to questions based on them.
We’ll format the dataset into a chatbot structure for interaction. Each interaction will consist of a system message, followed by the image and the user's query, and finally, the answer to the query.
💡For more usage tips specific to this model, check out the Model Card.
For educational purposes, we’ll load only 10% of each split in the dataset. However, in a real-world use case, you would typically load the entire set of samples.
/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_auth.py:86: UserWarning: Access to the secret `HF_TOKEN` has not been granted on this notebook. You will not be requested again. Please restart the session if you want to be prompted again. warnings.warn(
Let’s take a look at the structure of the dataset. It includes an image, a query, a label (which is the answer), and a fourth feature that we’ll be discarding.
Dataset({
, features: ['image', 'query', 'label', 'human_or_machine'],
, num_rows: 2830
,}) Now, let’s format the data using the chatbot structure. This will allow us to set up the interactions appropriately for our model.
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=308x369>],
, 'messages': [{'role': 'system',
, 'content': [{'type': 'text',
, 'text': 'You are a Vision Language Model specialized in interpreting visual data from chart images.\nYour task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.\nThe charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]},
, {'role': 'user',
, 'content': [{'type': 'image',
, 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=308x369>},
, {'type': 'text',
, 'text': 'Is the rightmost value of light brown graph 58?'}]},
, {'role': 'assistant', 'content': [{'type': 'text', 'text': 'No'}]}]} 3. Load Model and Check Performance! 🤔
Now that we’ve loaded the dataset, let’s start by loading the model and evaluating its performance using a sample from the dataset. We’ll be using Qwen/Qwen2-VL-7B-Instruct, a Vision Language Model (VLM) capable of understanding both visual data and text.
If you're exploring alternatives, consider these open-source options:
- Meta AI's Llama-3.2-11B-Vision
- Mistral AI's Pixtral-12B
- Allen AI's Molmo-7B-D-0924
Additionally, you can check the Leaderboards, such as the WildVision Arena or the OpenVLM Leaderboard, to find the best-performing VLMs.

Next, we’ll load the model and the tokenizer to prepare for inference.
To evaluate the model's performance, we’ll use a sample from the dataset. First, let’s take a look at the internal structure of this sample.
We’ll use the sample without the system message to assess the VLM's raw understanding. Here’s the input we will use:
Now, let’s take a look at the chart corresponding to the sample. Can you answer the query based on the visual information?
Let’s create a method that takes the model, processor, and sample as inputs to generate the model's answer. This will allow us to streamline the inference process and easily evaluate the VLM's performance.
While the model successfully retrieves the correct visual information, it struggles to answer the question accurately. This indicates that fine-tuning might be the key to enhancing its performance. Let’s proceed with the fine-tuning process!
Remove Model and Clean GPU
Before we proceed with training the model in the next section, let's clear the current variables and clean the GPU to free up resources.
4. Fine-Tune the Model using TRL
4.1 Load the Quantized Model for Training ⚙️
Next, we’ll load the quantized model using bitsandbytes. If you want to learn more about quantization, check out this blog post or this one.
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release. You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
4.2 Set Up QLoRA and SFTConfig 🚀
Next, we will configure QLoRA for our training setup. QLoRA enables efficient fine-tuning of large language models while significantly reducing the memory footprint compared to traditional methods. Unlike standard LoRA, which reduces memory usage by applying a low-rank approximation, QLoRA takes it a step further by quantizing the weights of the LoRA adapters. This leads to even lower memory requirements and improved training efficiency, making it an excellent choice for optimizing our model's performance without sacrificing quality.
We will use Supervised Fine-Tuning (SFT) to refine our model’s performance on the task at hand. To do this, we'll define the training arguments using the SFTConfig class from the TRL library. SFT allows us to provide labeled data, helping the model learn to generate more accurate responses based on the input it receives. This approach ensures that the model is tailored to our specific use case, leading to better performance in understanding and responding to visual queries.
4.3 Training the Model 🏃
We will log our training progress using trackio. Let’s connect our notebook to W&B to capture essential information during training.
* Trackio project initialized: qwen2-7b-instruct-trl-sft-ChartQA * Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/qwen2-7b-instruct-trl-sft-ChartQA-trackio-dataset * Creating new space: https://huggingface.co/spaces/sergiopaniego/qwen2-7b-instruct-trl-sft-ChartQA-trackio * View dashboard by going to: https://huggingface.co/spaces/sergiopaniego/qwen2-7b-instruct-trl-sft-ChartQA-trackio
<trackio.run.Run at 0x7b48df258410>
Now, we will define the SFTTrainer, which is a wrapper around the transformers.Trainer class and inherits its attributes and methods. This class simplifies the fine-tuning process by properly initializing the PeftModel when a PeftConfig object is provided. By using SFTTrainer, we can efficiently manage the training workflow and ensure a smooth fine-tuning experience for our Vision Language Model. When doing inference we defined our own generate_text_from_sample function which applied the necessary preprocessing before passing the inputs to the model. Here, the SFTTrainer infers automatically that the model is a vision-language model and applies a DataCollatorForVisionLanguageModeling which convers the inputs to the appropriate format.
/usr/local/lib/python3.12/dist-packages/peft/mapping_func.py:73: UserWarning: You are trying to modify a model with PEFT for a second time. If you want to reload the model with a different config, make sure to call `.unload()` before. warnings.warn( /usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py:196: UserWarning: Already found a `peft_config` attribute in the model. This will lead to having multiple adapters in the model. Make sure to know what you are doing! warnings.warn(
Time to Train the Model! 🎉
Let's save the results 💾
adapter_model.safetensors: 0%| | 0.00/10.1M [00:00<?, ?B/s]
5. Testing the Fine-Tuned Model 🔍
Now that we've successfully fine-tuned our Vision Language Model (VLM), it's time to evaluate its performance! In this section, we will test the model using examples from the ChartQA dataset to see how well it answers questions based on chart images. Let's dive in and explore the results! 🚀
Let's clean up the GPU memory to ensure optimal performance 🧹
We will reload the base model using the same pipeline as before.
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release. You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.
We will attach the trained adapter to the pretrained model. This adapter contains the fine-tuning adjustments we made during training, allowing the base model to leverage the new knowledge without altering its core parameters. By integrating the adapter, we can enhance the model's capabilities while maintaining its original structure.
adapter_config.json: 0%| | 0.00/859 [00:00<?, ?B/s]
adapter_model.safetensors: 0%| | 0.00/10.1M [00:00<?, ?B/s]
We will utilize the previous sample from the dataset that the model initially struggled to answer correctly.
[{'role': 'system',
, 'content': [{'type': 'text',
, 'text': 'You are a Vision Language Model specialized in interpreting visual data from chart images.\nYour task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.\nThe charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]},
, {'role': 'user',
, 'content': [{'type': 'image',
, 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=422x359>},
, {'type': 'text', 'text': 'Is the value of Favorable 38 in 2015?'}]}] 'No, the value of Favorable is not 38 in 2015. According to the chart, the value of Favorable in 2015 is 38.'
Since this sample is drawn from the training set, the model has encountered it during training, which may be seen as a form of cheating. To gain a more comprehensive understanding of the model's performance, we will also evaluate it using an unseen sample.
[{'role': 'system',
, 'content': [{'type': 'text',
, 'text': 'You are a Vision Language Model specialized in interpreting visual data from chart images.\nYour task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.\nThe charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]},
, {'role': 'user',
, 'content': [{'type': 'image',
, 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=850x600>},
, {'type': 'text', 'text': 'What is the value of Slovenia in the graph?'}]}] 'The value of Slovenia in the graph is 1.0.'
The model has successfully learned to respond to the queries as specified in the dataset. We've achieved our goal! 🎉✨
6. Compare Fine-Tuned Model vs. Base Model + Prompting 📊
We have explored how fine-tuning the VLM can be a valuable option for adapting it to our specific needs. Another approach to consider is directly using prompting or implementing a RAG system, which is covered in another recipe.
Fine-tuning a VLM requires significant amounts of data and computational resources, which can incur costs. In contrast, we can experiment with prompting to see if we can achieve similar results without the overhead of fine-tuning.
Let's again clean up the GPU memory to ensure optimal performance 🧹
GPU allocated memory: 0.02 GB GPU reserved memory: 0.27 GB
🏗️ First, we will load the baseline model following the same pipeline as before.
Loading checkpoint shards: 0%| | 0/5 [00:00<?, ?it/s]
📜 In this case, we will again use the previous sample, but this time we will include the system message as follows. This addition helps to contextualize the input for the model, potentially improving its response accuracy.
[{'role': 'system',
, 'content': [{'type': 'text',
, 'text': 'You are a Vision Language Model specialized in interpreting visual data from chart images.\nYour task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.\nThe charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.\nFocus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary.'}]},
, {'role': 'user',
, 'content': [{'type': 'image',
, 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=422x359>},
, {'type': 'text', 'text': 'Is the value of Favorable 38 in 2015?'}]}] Let's see how it performs!
'Yes'
💡 As we can see, the model generates the correct answer using the pretrained model along with the additional system message, without any training. This approach may serve as a viable alternative to fine-tuning, depending on the specific use case.
7. Continuing the Learning Journey 🧑🎓️
To further enhance your understanding and skills in working with multimodal models, check out the following resources:
- Multimodal Retrieval-Augmented Generation (RAG) Recipe
- Phil Schmid's tutorial
- Merve Noyan's smol-vision repository
- Quantize Your Qwen2-VL Model with AutoAWQ
- Preference Optimization for Vision Language Models with TRL
- Hugging Face Llama Recipes: SFT for VLM
- Hugging Face Llama Recipes: PEFT Fine-Tuning
- Hugging Face Blog: IDEFICS2
These resources will help you deepen your knowledge and skills in multimodal learning.