Unsloth Kaggle Qwen3 (4B) Instruct QAT

Kaggle Qwen3 (4B) Instruct QAT

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / Kaggle-Qwen3_(4B)_Instruct-QAT.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context

New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]

Unsloth

[2]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

WARNING:torchao:Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu126 for torchao version 0.14.0         Please see GitHub issue #2919 for more info
TMA benchmarks will be running without grid constant TMA descriptor.

🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.8: Fast Qwen3 patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.08G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update a small amount of parameters!

[3]

Unsloth 2025.10.8 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

Unsloth: Applying QAT to mitigate quantization degradation

Lets check if QAT is applied!

[4]

QAT is applied!

Data Prep

We now use the Qwen-3 format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. Qwen-3 renders multi turn conversations like below:

	<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hey there!<|im_end|>

We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3 and more.

[5]

[6]

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We now use standardize_data_formats to try converting datasets to the correct format for finetuning purposes!

[7]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

[8]

{'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?',
,   'role': 'user'},
,  {'content': 'In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.',
,   'role': 'assistant'}],
, 'source': 'infini-instruct-top-500k',
, 'score': 4.774171352386475}

We now have to apply the chat template for Qwen-3 onto the conversations, and save it to text.

[9]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see how the chat template did!

[10]

'<|im_start|>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<|im_end|>\n<|im_start|>assistant\nIn programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|im_end|>\n'

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[11]

Unsloth: Tokenizing ["text"] (num_proc=6):   0%|          | 0/100000 [00:00<?, ? examples/s]

We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!

[12]

Map (num_proc=6):   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's verify masking the instruction part is done! Let's print the 100th row again.

[13]

'<|im_start|>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<|im_end|>\n<|im_start|>assistant\nIn programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|im_end|>\n'

Now let's print the masked out example - you should see only the answer is present:

[14]

'                              In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|im_end|>\n'

[15]

GPU = Tesla T4. Max memory = 14.741 GB.
8.547 GB of memory reserved.

Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)

[16]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained)

Unsloth: Will smartly offload gradients to save VRAM!

[17]

392.6031 seconds used for training.
6.54 minutes used for training.
Peak reserved memory = 14.135 GB.
Peak reserved memory for training = 5.588 GB.
Peak reserved memory % of max memory = 95.889 %.
Peak reserved memory for training % of max memory = 37.908 %.

Now that training is complete, let's convert the FakeQuantizedLinear layers back to standard nn.Linear layers. This removes the fake quantization overhead and prepares the model for its final conversion step or for merging LoRA adapters.

[18]

Inference

Let's run the model via Unsloth native inference! According to the Qwen-3 team, the recommended settings for instruct inference are temperature = 0.7, top_p = 0.8, top_k = 20

For reasoning chat based inference, temperature = 0.6, top_p = 0.95, top_k = 20

[19]

The sequence you provided is the Fibonacci sequence, where each number is the sum of the two preceding numbers.

1 + 1 = 2
1 + 2 = 3
2 + 3 = 5
3 + 5 = 8
5 + 8 = 13

So, the next number in the sequence is 13.<|im_end|>

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Hugging Face's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[20]

('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/chat_template.jinja',
, 'lora_model/vocab.json',
, 'lora_model/merges.txt',
, 'lora_model/added_tokens.json',
, 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

[21]

We can now save and quantize the final model using TorchAO, applying the same configuration used during QAT training.

[22]

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...

Unsloth: Copying 2 files from cache to `model`: 100%|██████████| 2/2 [03:20<00:00, 100.23s/it]

Successfully copied all 2 files from cache to `model`
Checking cache directory for required files...
Cache check failed: tokenizer.model not found in local cache.
Not all required files found in cache. Will proceed with downloading.

Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 15709.00it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [04:31<00:00, 135.70s/it]

Unsloth: Merge process complete. Saved to `/content/model`

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

TorchAO Exporting and Conversion

We also support exporting to TorchAO-quantized checkpoints with custom configs to allow inference in vLLM or other inference engines.

For a deeper dive into TorchAO configuration, you can refer to Hugging Face Transformers official documentation: https://huggingface.co/docs/transformers/main/quantization/torchao

[23]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
Learn how to do Reinforcement Learning with our RL Guide and notebooks.
Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
Explore our LLM Tutorials Directory to find dedicated guides for each model.
Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0