Unsloth Gemma3 (4B) Vision GRPO

Gemma3 (4B) Vision GRPO

unsloth-notebooksunslothoriginal_template

alph-notebooks/unsloth-notebooks / Gemma3_(4B)-Vision-GRPO.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

News

Placeholder

Installation

[ ]

Unsloth

[3]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 09-16 13:32:44 [__init__.py:241] Automatically detected platform cuda.
ERROR 09-16 13:32:46 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.5: Fast Gemma3 patching. Transformers: 4.55.4. vLLM: 0.10.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.

model.safetensors:   0%|          | 0.00/4.56G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

We now add LoRA adapters for parameter efficient fine-tuning, allowing us to train only 1% of all model parameters efficiently.

[NEW] We also support fine-tuning only the vision component, only the language component, or both. Additionally, you can choose to fine-tune the attention modules, the MLP layers, or both!

[4]

Unsloth: Making `model.base_model.model.model.language_model` require gradients

Data Prep

AI4Math/MathVista is a dataset that involves using images to solve logic and math problems, for this notebook, it will only be math problems with numeric answers for simplicity.

[5]

README.md: 0.00B [00:00, ?B/s]

data/testmini-00000-of-00001-725687bf7a1(…):   0%|          | 0.00/142M [00:00<?, ?B/s]

data/test-00000-of-00002-6b81bd7f7e2065e(…):   0%|          | 0.00/358M [00:00<?, ?B/s]

data/test-00001-of-00002-6a611c71596db30(…):   0%|          | 0.00/386M [00:00<?, ?B/s]

Generating testmini split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5141 [00:00<?, ? examples/s]

Let us see what our data looks like

[6]

[7]

('As shown in the figure, if CB = 4.0, DB = 7.0, and D is the midpoint of AC, then the length of AC is ()',
, '6cm')

The image of our data is a part of the math problem.

To make the rewarding easy later on, let us filter only the numeric answer so that we can create reward function that gives score if reward is float or not.

[8]

Filter:   0%|          | 0/1000 [00:00<?, ? examples/s]

We also resize the images to be 512 by 512 pixels to make the images managable in context length. We also convert them to RGB so they are compatible with TRL's trainer!

[9]

Map:   0%|          | 0/566 [00:00<?, ? examples/s]

Map:   0%|          | 0/566 [00:00<?, ? examples/s]

This is the conversational template that is needed to collate the dataset

[10]

Map:   0%|          | 0/566 [00:00<?, ? examples/s]

Applying Chat Template across the entire dataset

[11]

Map:   0%|          | 0/566 [00:00<?, ? examples/s]

We use a basic formatting functions to see if reasoning starts and ends as well as if the answers were written correctly.

[12]

Here is the first example prompt in the dataset

[13]

"<bos><start_of_turn>user\n<start_of_image>When a spring does work on an object, we cannot find the work by simply multiplying the spring force by the object's displacement. The reason is that there is no one value for the force-it changes. However, we can split the displacement up into an infinite number of tiny parts and then approximate the force in each as being constant. Integration sums the work done in all those parts. Here we use the generic result of the integration.\r\n\r\nIn Figure, a cumin canister of mass $m=0.40 \\mathrm{~kg}$ slides across a horizontal frictionless counter with speed $v=0.50 \\mathrm{~m} / \\mathrm{s}$. It then runs into and compresses a spring of spring constant $k=750 \\mathrm{~N} / \\mathrm{m}$. When the canister is momentarily stopped by the spring, by what distance $d$ is the spring compressed?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>\n"

Train the model

Now set up GRPO Trainer and all configurations!

[14]

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[15]

Unsloth: Switching to float32 training since model cannot work with float16

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 566 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8
 "-____-"     Trainable parameters = 29,802,496 of 4,329,881,968 (0.69% trained)
`generation_config` default values have been modified to match model-specific defaults: {'top_p': 0.95}. If this is not desired, please set these values explicitly.

-------------------- Question:
<bos><start_of_turn>user
<start_of_image>What is the highest value on the X axis?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
30 
Response:

**Question:** What is the highest value on the X axis?

<REASONING>
The X axis represents the concentration of a substance, measured in micrograms per mL (µg/mL). The scale of the X axis ranges from 0 to 30. The highest value displayed on the X axis is 30. 
</REASONING>

<SOLUTION>
30
(float: [image])
Unsloth: Will smartly offload gradients to save VRAM!

-------------------- Question:
<bos><start_of_turn>user
<start_of_image>Subtract all yellow metallic balls. Subtract all small yellow shiny things. How many objects are left?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
8 
Response:

**Objects in the image:**

*   1 large blue cube
*   1 small blue cube
*   1 large silver ball
*   1 small silver ball
*   1 teal cube
*   1 purple cube
*   1 yellow cube
*   1 purple metallic ball
*   1 yellow metallic ball

**Step 1: Subtract all yellow metallic balls.**

*   There are 2 yellow metallic balls.
*   Subtracting them: 2 - 2 = 0

**Step 2: Subtract all small yellow shiny things.**

*   There is 1 yellow cube.
*   Subtracting it: 1 - 1 = 0

**Step 3: Total objects remaining.**

*   Start with the original number of objects (10).
*   Subtract the removed objects (0 + 0 = 0).
*   Remaining objects: 10 - 0 = 10

<REASONING>
We first identified the yellow metallic balls and removed them. Then we identified and removed all small yellow shiny things. Subtracting each of these categories from the initial count gives us the number of objects remaining.
</REASONING>
<SOLUTION>
10.0
</SOLUTION
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>What is the highest accuracy reported in the whole chart?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
8 
Response:

**Here's how to approach the problem:**

1.  **Observe the chart:** Examine each dataset and note its associated accuracy value.
2.  **Identify the maximum:** Determine the highest accuracy value across all datasets.
3.  **Consider the datasets:** Note the dataset that contributed to the highest accuracy.

<REASONING>
The chart shows the accuracy of different algorithms on various datasets. The highest accuracy is reported for the 'bow' dataset, with an accuracy value of approximately 9.2. This is significantly higher than the other datasets displayed.
</REASONING>

<SOLUTION>
9.2 </SOLUTION>
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>What is the value of the largest bar?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
10000 
Response:

**Image Description:**

The image shows a bar chart with the following categories: "picnic," "fair," "draw," and "candle."  The y-axis represents the "Values" in a logarithmic scale (powers of 10), while the x-axis represents the categories. The heights of the bars are approximately:

*   "picnic": 100,000
*   "fair": 10,000,000
*   "draw": 10,000
*   "candle": 100,000

<REASONING>
The bar chart displays the values for four different activities. To determine the largest bar, we need to compare the heights of each bar. The "fair" bar is significantly taller than the other three, indicating a much larger value. The "picnic" bar is smaller than "fair" and "draw," but larger than "candle." The "draw" and "candle" bars are the smallest.
</REASONING>

<SOLUTION>
10,000,000.0
</SOLUTION>
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>What is the age gap between these two people in image?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
16 
Response:* **Barack Obama** was approximately 52 years old when this photo was taken in 2012.
* **Aung San Suu Kyi** was approximately 66 years old when this photo was taken in 2012.

<REASONING>
Barack Obama was born in 1961, and Aung San Suu Kyi was born in 1945.
The age difference can be calculated as follows:
2012 - 1961 = 51
2012 - 1945 = 67
The difference between their ages is 67 - 51 = 16 years.
</REASONING>
<SOLUTION>
16 years (put a float here)
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>What is the age gap between these two people in image?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
12 
Response:

**Image Description:** The image shows two footballers, one wearing a white shirt with "Air Asia" logo and the other wearing a blue shirt with the "Chelsea" logo. The footballer in the white shirt appears older than the one in the blue shirt.

**Players:**
*   **White Shirt:** Jonathan Walters (Swansea City)
*   **Blue Shirt:** Asmir Begović (Chelsea)

**Age Difference:** Approximately 8 years.

<REASONING>
To determine the age difference, we need to identify the birthdates of both players. As of November 21, 2023, Jonathan Walters is 37 years old (born on May 18, 1986), and Asmir Begović is 39 years old (born on November 29, 1983). The age difference is 39 - 37 = 2 years.
</REASONING>

<SOLUTION>
2 years (float: 2) </SOLUTION
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>For Group C, in which week is the cumulative increase in weight , the highest?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
3 
Response:

**Question:**

For Group C, in which week is the cumulative increase in weight, the highest? Provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION>

**Image:**

The image shows a line graph with cumulative weight increase over 6 weeks for three groups: Group A (42% powder), Group B (21% powder), and Group C (control).

**Analysis:**

The graph shows the cumulative increase in weight in grams for each group over 6 weeks. We need to identify the week with the highest cumulative weight increase for Group C.

Looking at the graph, the cumulative increase for Group C is consistently around 200 grams throughout the 6 weeks, with a slight fluctuation. The highest point for Group C is around week 3.

**<REASONING>**
The graph shows that Group C's cumulative weight increase remains consistently around 200 grams throughout the 6 weeks. There is no week where it exceeds this value. The graph indicates that the increase is relatively stable, not dramatically increasing or decreasing at any specific week.

**<SOLUTION>**
Week 3 (float: [image of week 3 of the graph])
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>An author recorded how many words she wrote in the past 3 days. How many words in total did the author write on Thursday and Friday? (Unit: words), provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
679 
Response:*Assume that the number of words written each day are integers.*

<REASONING>
The author wrote 374 words on Thursday and 305 words on Friday. To find the total number of words written on Thursday and Friday, we need to add the number of words written on each day.
Total words = Words written on Thursday + Words written on Friday
Total words = 374 + 305
Total words = 679
</REASONING>
<SOLUTION>
679.0
</SOLUTION>
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>How many items sold more than 3 units in at least one store?, provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
7 
Response:

**Analysis:**

The graph shows sales statistics for various items across different stores (Wrist and Cult). We want to find the number of items that have sold more than 3 units in *at least one* store.

Let's examine each item individually:

*   **deputy:** Wrist sold 6 units, Cult sold 6 units. Both stores sold more than 3 units.
*   **brave:** Wrist sold 7 units, Cult sold 7 units. Both stores sold more than 3 units.
*   **violin:** Wrist sold 9 units, Cult sold 9 units. Both stores sold more than 3 units.
*   **arch:** Wrist sold 7 units, Cult sold 7 units. Both stores sold more than 3 units.
*   **breeze:** Wrist sold 4 units, Cult sold 4 units. Both stores sold more than 3 units.
*   **above:** Wrist sold 9 units, Cult sold 9 units. Both stores sold more than 3 units.
*   **boom:** Wrist sold 3 units, Cult sold 3 units. Only the Cult store sold more than 3 units.

Therefore, all the items (deputy, brave, violin, arch, breeze, above, boom) have sold more than 3 units in at least one store.

<REASONING>
We need to find the total number of items that have sold more than 3 units in at least one store. We can examine each item in the graph and see which stores have sold more than 3 units. An item sold more than 3 units in at least one store if any of its stores have sold more than 3 units.
</REASONING>

<SOLUTION>
7.0
</SOLUTION>
-------------------- Question:
<bos><start_of_turn>user
<start_of_image>As shown in the figure, E is any point in ▱ABCD, if S~quadrilateral ABCD~ = 6.0, then the area of the shaded part in the figure is (), provide your reasoning between <REASONING> and </REASONING> and then your final answer between <SOLUTION> and (put a float here) </SOLUTION><end_of_turn>
 
Answer:
3 
Response:S~quadrilateral ABCD~ = 6.0 is the area of quadrilateral ABCD.

**REASONING**

The shaded area is the sum of the areas of triangles ABE and CDE. Let the height of the two triangles be $h$.
Let the base of triangle ABE be $b_1$ and the base of triangle CDE be $b_2$.
The area of triangle ABE is $\frac{1}{2} b_1 h$.
The area of triangle CDE is $\frac{1}{2} b_2 h$.
The sum of the areas of these two triangles is $\frac{1}{2} b_1 h + \frac{1}{2} b_2 h = \frac{1}{2} (b_1 + b_2) h$.

Since AB is parallel to CD, the triangles ABE and CDE are similar.
We have $\frac{b_1}{b_2} = \frac{AE}{CE} = \frac{BE}{DE}$.
However, we don't know the lengths of $b_1, b_2, h$. We know that the area of the quadrilateral ABCD is 6.0.
The area of the quadrilateral ABCD is the sum of the areas of triangles ABC and ADC.
Area(ABCD) = Area(ABE) + Area(BCE) + Area(CDE) + Area(DAE)
Also, Area(ABCD) = Area(ABC) + Area(ADC)
We can write Area(ABCD) = Area(ABE) + Area(CDE) + Area(BCE) + Area(DAE)
Since AB || CD, we have Area(ABE) = Area(CDE) and Area(BCE) = Area(DAE).

In the given figure, the shaded area is the sum of the area of triangle ABE and the area of triangle CDE.
Let h1 be the height of ABE from E to AB and h2 be the height of CDE from E to CD.
Then, the area of ABE is $\frac{1}{2} \times AB \times h1$ and the area of CDE is $\frac{1}{2} \times CD \times h2$.
Since AB and CD are parallel, the height of ABE and CDE from E to AB and CD respectively are the same, so $h1 = h2 = h$.
Then the area of ABE is $\frac{1}{2} \times AB \times h$ and the area of CDE is $\frac{1}{2} \times CD \times h$.
Area(shaded) = $\frac{1}{2} \times AB \times h + \frac{1}{2} \times CD \times h = \frac{1}{2} (AB + CD) h$.

The area of quadrilateral ABCD is the area of two triangles, ABC and ADC.
If ABCD is a parallelogram, then AB = CD. In that case, the area of the shaded region is $\frac{1}{2} (AB + CD) h = \frac{1}{2} (2AB) h = AB \times h$.
Since the area of the quadrilateral ABCD is $AB \times h$, we have $AB \times h = 6$.
We are given that $S = 6$.

Since AB || CD, the heights of triangles ABE and CDE are the same, which is the distance between AB and CD.
Let the distance between AB and CD be $h$.
Area(ABE) = $\frac{1}{2} \times AB \times h$
Area(CDE) = $\frac{1}{2} \times CD \times h$
Area(shaded) = Area(ABE) + Area(CDE) = $\frac{1}{2} \times AB \times h + \frac{1}{2} \times CD \times h = \frac{1}{2} (AB + CD) \times h$

Since ABCD is a parallelogram, AB = CD.
Area(shaded) = $\frac{1}{2} (AB + AB) \times h = \frac{1}{2} (2AB) \times h = AB \times h$
The area of parallelogram ABCD is $AB \times h$, which is given as 6.
So, Area(shaded) = 3.

<SOLUTION>
3.0

TrainOutput(global_step=10, training_loss=0.010091520100831985, metrics={'train_runtime': 1497.6475, 'train_samples_per_second': 0.053, 'train_steps_per_second': 0.007, 'total_flos': 0.0, 'train_loss': 0.010091520100831985})

Inference

Let's run the model! You can modify the instruction and input—just leave the output blank.

We'll use the best hyperparameters for inference on Gemma: top_p=0.95, top_k=64, and temperature=1.0.

[25]

Okay, let's analyze the image and measure the length of the nail to the nearest inch.

<REASONING>
The ruler is marked in centimeters (cm). We need to convert the measurement from centimeters to inches. We know that 1 inch is approximately equal to 2.54 centimeters.  We need to find how many inches the nail is long, looking at the ruler markings.  The nail appears to be positioned at the 5cm mark on the ruler.  So we will convert 5cm to inches.
</REASONING>

<SOLUTION>
First, convert 5 cm

Saving, loading finetuned models

To save the final model as LoRA adapters, use Hugging Face’s push_to_hub for online saving, or save_pretrained for local storage.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[18]

['lora_model/processor_config.json']

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

[27]

Here's an analysis of the volume of the measuring cup:

**<REASONING>**

The measuring cup is a clear plastic container with markings indicating volume measurements. The markings are in milliliters (ml). 

The markings appear to be in increments of 100ml, 200ml, 300ml, 400ml, 500ml, 600ml, 700ml, 800ml, 900ml and 1000ml.  The container is almost full, it is likely it can hold around

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[28]