Notebooks
U
Unsloth
Gemma3N (4B) Audio

Gemma3N (4B) Audio

unsloth-notebooksunslothoriginal_template

News

Placeholder

Installation

[ ]
[ ]

Unsloth

FastModel supports loading nearly any model now! This includes Vision and Text models!

[1]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.7.5: Fast Gemma3N patching. Transformers: 4.54.0.dev0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gemma3N does not support SDPA - switching to eager!
Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Gemma 3N can process Text, Vision and Audio!

Let's first experience how Gemma 3N can handle multimodal inputs. We use Gemma 3N's recommended settings of temperature = 1.0, top_p = 0.95, top_k = 64 but for this example we use do_sample=False for ASR.

[2]

Let's Evaluate Gemma 3N Baseline Performance on German Transcription

[3]
[ ]
 Ich, ich rechne direkt mich an. Das ist natürlich klar, nur, dass, äh, es politische Interessen gibt im Handel, im Austausch mit Waren, dass es politische Einflüsse gibt. Die Frage ist, die Alternative soll es nicht sein.
[ ]
Sie direkt mich an. Das finde ich klar, nur dass es politisch Interessen gibt im Handel, im Austausch mit Waren, dass es politische Einflüsse gibt. Die Frage ist, die Alternative soll es nicht sein.<end_of_turn>

Baseline Model Performance: 24.32% Word Error Rate (WER) for this sample !

Let's finetune Gemma 3N!

You can finetune the vision and text and audio parts

We now add LoRA adapters so we only need to update a small amount of parameters!

[ ]
Unsloth: Making `model.base_model.model.model.language_model` require gradients

Data Prep

We adapt the kadirnar/Emilia-DE-B000000 dataset for our German ASR task using Gemma 3N multi-modal chat format. Each audio-text pair is structured into a conversation with system, user, and assistant roles. The processor then converts this into the final training format:

	<bos><start_of_turn>system
You are an assistant that transcribes speech accurately.<end_of_turn>
<start_of_turn>user
<audio>Please transcribe this audio.<end_of_turn>
<start_of_turn>model
Ich, ich rechne direkt mich an.<end_of_turn>

[5]
[6]
Map (num_proc=4):   0%|          | 0/3000 [00:00<?, ? examples/s]
[7]

Train the model

Now let's train our model. We train for one full epoch (num_train_epochs=1) to get a meaningful result.

[8]
[ ]

Let's train the model!

To resume a training run, set trainer.train(resume_from_checkpoint = True)

[9]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 3,000 | Num Epochs = 1 | Total steps = 1,500
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 24,076,288 of 7,874,054,480 (0.31% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
Unsloth: Will smartly offload gradients to save VRAM!
[ ]

Inference

Let's run the model via Unsloth native inference! According to the Gemma-3 team, the recommended settings for inference are temperature = 1.0, top_p = 0.95, top_k = 64 but for this example we use do_sample=False for ASR.

[14]
Du sprichst direkt mich an. Das finde ich klar, nur, dass, äh, dass politische Interessen gibt im Handel, im Austausch mit Waren, dass es politische Einflüsse gibt. Die Frage ist, die Alternative soll es nicht sein.<end_of_turn>

With only 3,000 German speech samples, we reduced the Word Error Rate (WER) from 24.32% to 16.22%. This represents a significant 33.31% relative error rate reduction !

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[ ]

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

[ ]

Saving to float16 for VLLM

We also support saving to float16 directly for deployment! We save it in the folder gemma-3N-finetune. Set if False to if True to let it run!

[ ]

If you want to upload / push to your Hugging Face account, set if False to if True and add your Hugging Face token and upload location!

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now for all models! For now, you can convert easily to Q8_0, F16 or BF16 precision. Q4_K_M for 4bit will come later!

[ ]

Likewise, if you want to instead push to GGUF to your Hugging Face account, set if False to if True and add your Hugging Face token and upload location!

[ ]

Now, use the gemma-3N-finetune.gguf file or gemma-3N-finetune-Q4_K_M.gguf file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

  1. Train your own reasoning model - Llama GRPO notebook Free Colab
  2. Saving finetunes to Ollama. Free notebook
  3. Llama 3.2 Vision finetuning - Radiography use case. Free Colab
  4. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️