To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
To install Unsloth your local device, follow our guide. This notebook is licensed LGPL-3.0.
You will learn how to do data prep, how to train, how to run the model, & how to save it
News
Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog
You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog
Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog
3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context
New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL
Visit our docs for all our model uploads and notebooks.
Installation
Unsloth
Goal: To convert unsloth/Qwen3-30B-A3B-Instruct-2507 into a reasoning model via GRPO by using OpenR1's Math dataset.
We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2026.2.1: Fast Qwen3_MoE patching. Transformers: 5.1.0. \\ /| NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.10.0+cu128. CUDA: 8.0. CUDA Toolkit: 12.8. Triton: 3.6.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. WARNING:huggingface_hub.utils._http:Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 0%| | 0/531 [00:00<?, ?it/s]
Unsloth: Detected MoE model with 128 experts - enabling LoRA on: ['mlp.experts.gate_up_proj', 'mlp.experts.down_proj']
/usr/local/lib/python3.12/dist-packages/peft/tuners/tuners_utils.py:212: UserWarning: Unsupported layer type '<class 'transformers.models.qwen3_moe.modeling_qwen3_moe.Qwen3MoeExperts'>' encountered, proceed at your own risk.
warnings.warn(f"Unsupported layer type '{type(module)}' encountered, proceed at your own risk.", UserWarning)
Unsloth: Making `model.base_model.model.model` require gradients
GRPO chat template
Since we're using a base model, we should set a chat template. You can make your own chat template as well!
- DeepSeek uses
<think>and</think>, but this is not necessary - you can customize it however you like! - A
system_promptis recommended to at least guide the model's responses.
'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION>'
We create a simple chat template below. Notice add_generation_prompt includes prepending <start_working_out> to guide the model to start its reasoning process.
Let's see how our chat template behaves on an example:
"You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION><|im_end|>What is 1+1?<start_working_out>I think it's 2.<end_working_out><SOLUTION>2</SOLUTION><|im_end|>What is 2+2?<start_working_out>"
Pre fine-tuning for formatting
We now use a subset of NVIDIA's Open Math Reasoning dataset which was filtered to only include high quality DeepSeek R1 traces.
We'll only filter ~59 or so examples to first "prime" / pre fine-tune the model to understand our custom GRPO formatting.
We have to format the dataset to follow our GRPO style formatting:
Check to see if it worked:
"You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <start_working_out> and <end_working_out>.\nThen, provide your solution between <SOLUTION></SOLUTION><|im_end|>Given $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<start_working_out>Okay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n \\[\n \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n \\[\n \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n \\]\n\n3. Square both sides to eliminate the square root on the left:\n \\[\n (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n \\]\n Simplifying both sides, we get:\n \\[\n x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n \\]\n\n4. Combine like terms on the right side:\n \\[\n x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n \\]\n Simplifying further:\n \\[\n x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n \\]\n\n5. Subtract \\(x^2\\) from both sides:\n \\[\n 165 = -3 + 14\\sqrt{x^2 - 52}\n \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n \\[\n 168 = 14\\sqrt{x^2 - 52}\n \\]\n\n7. Divide both sides by 14:\n \\[\n 12 = \\sqrt{x^2 - 52}\n \\]\n\n8. Square both sides again to eliminate the square root:\n \\[\n 12^2 = x^2 - 52\n \\]\n Simplifying:\n \\[\n 144 = x^2 - 52\n \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n \\[\n 196 = x^2\n \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n \\[\n x = \\sqrt{196} = 14\n \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n \\[\n \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n \\]\n The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<end_working_out><SOLUTION>14</SOLUTION><|im_end|>" Let's truncate the pre fine-tuning dataset to max_seq_length/2 since we don't want too long reasoning traces.
Note this might take 2 minutes!
(59, 5)
We then tokenize the messages and convert it to a Hugging Face compatible dataset format:
Dataset({
, features: ['expected_answer', 'problem', 'generated_solution', 'Messages', 'N', 'text', '__index_level_0__'],
, num_rows: 59
,}) Let's now pre fine-tune the model so it follows our custom GRPO formatting!
warmup_ratio is deprecated and will be removed in v5.2. Use `warmup_steps` instead. WARNING:trl.trainer.sft_trainer:You are using a per_device_train_batch_size of 1 with padding-free training. Using a batch size of 1 anihilate the benefits of padding-free training. Please consider increasing the batch size to at least 2.
Unsloth: Tokenizing ["text"] (num_proc=16): 0%| | 0/59 [00:00<?, ? examples/s]
🦥 Unsloth: Padding-free auto-enabled, enabling faster training.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 59 | Num Epochs = 1 | Total steps = 50 O^O/ \_/ \ Batch size per device = 1 | Gradient accumulation steps = 1 \ / Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1 "-____-" Trainable parameters = 1,285,029,888 of 31,817,152,512 (4.04% trained)
TrainOutput(global_step=50, training_loss=0.37110782623291017, metrics={'train_runtime': 208.3978, 'train_samples_per_second': 0.24, 'train_steps_per_second': 0.24, 'total_flos': 8509830258413568.0, 'train_loss': 0.37110782623291017, 'epoch': 0.847457627118644}) Let's check if the model has learnt to follow the custom format:
Yes it did follow the formatting! Great! Let's remove some items before the GRPO step
11925
Saving to float16 for VLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our docs page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[NEW] To finetune and auto export to Ollama, try our Ollama notebook
Now, use the qwen_finetune.Q8_0.gguf file or qwen_finetune.Q4_K_M.gguf file in llama.cpp.
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other resources:
- Train your own reasoning model - Llama GRPO notebook Free Colab
- Saving finetunes to Ollama. Free notebook
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab
- See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!
This notebook and all Unsloth notebooks are licensed LGPL-3.0.



