Gpt Oss (120B) A100 Fine Tuning
To run this, press "Runtime" and press "Run all" on your A100 Google Colab Pro instance!
To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.
You will learn how to do data prep, how to train, how to run the model, & how to save it
News
Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog
You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog
Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog
3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context
New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL
Visit our docs for all our model uploads and notebooks.
Installation
Unsloth
We're about to demonstrate the power of the new OpenAI GPT-OSS 120B model through a finetuning example. To use our MXFP4 inference example, use this notebook instead.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.9.1: Fast Gpt_Oss patching. Transformers: 4.55.4. \\ /| NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0 \ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Unsloth: Gpt_Oss does not support SDPA - switching to fast eager.
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Fetching 16 files: 0%| | 0/16 [00:00<?, ?it/s]
model-00006-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00007-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00003-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00008-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00001-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00002-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00005-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00004-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00009-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00010-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00011-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00012-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00013-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00014-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00015-of-00016.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00016-of-00016.safetensors: 0%| | 0.00/2.14G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/16 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/160 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json: 0%| | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/446 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.
Unsloth: Making `model.base_model.model.model` require gradients
Reasoning Effort
The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.
The gpt-oss models offer three distinct levels of reasoning effort you can choose from:
- Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
- Medium: A balance between performance and speed.
- High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-05 Reasoning: low # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably approximate. Provide solution steps.<|end|><|start|>assistant<|channel|>final<|message|>**Answer Overview** The equation \[
Changing the reasoning_effort to medium will make the model think longer. We have to increase the max_new_tokens to occupy the amount of the generated tokens but it will give better and more correct answer
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-05 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably find all real solutions. Solve polynomial degree 5. Might have one rational root? Try rational root test
Lastly we will test it using reasoning_effort to high
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-05 Reasoning: high # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We are told "Solve x^5 + 3x^4 - 10 = 3." It's a polynomial equation: x^5 + 3 x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Thus we need to find
The HuggingFaceH4/Multilingual-Thinking dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's cookbook for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.
README.md: 0.00B [00:00, ?B/s]
data/train-00000-of-00001.parquet: 0%| | 0.00/5.29M [00:00<?, ?B/s]
Generating train split: 0%| | 0/1000 [00:00<?, ? examples/s]
Dataset({
, features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],
, num_rows: 1000
,}) To format our dataset, we will apply our version of the GPT OSS prompt
Map: 0%| | 0/1000 [00:00<?, ? examples/s]
Let's take a look at the dataset, and check what the 1st example shows
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-05 Reasoning: medium # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions reasoning language: French You are an AI chatbot with a lively and energetic personality.<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver. Je devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section « En vogue » sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et bienveillant, peut-être ajouter un emoji pour rester léger. Je vais structurer ma réponse étape par étape pour faciliter la lecture. Je dois m'excuser de ne pas pouvoir fournir des données en temps réel et proposer d'autres méthodes. Je conserverai un langage simple et convivial, en évitant les termes techniques.<|end|><|start|>assistant<|channel|>final<|message|>Hey there! While I can't check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends: 1. **Open the "Trending" tab** on the Twitter app or website – it updates constantly! 2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up. 3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads). 4. **Check regional trends** – they often differ by location! Remember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the "Distracted Boyfriend" meme revival), celebrity drama, or unexpected events (hello, weather disasters!). Want me to brainstorm *what* might trend next? I’ve got ideas!<|return|>
What is unique about GPT-OSS is that it uses OpenAI Harmony format which support conversation structures, reasoning output, and tool calling.
Unsloth: Tokenizing ["text"] (num_proc=16): 0%| | 0/1000 [00:00<?, ? examples/s]
GPU = NVIDIA A100-SXM4-80GB. Max memory = 79.318 GB. 59.27 GB of memory reserved.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 30 O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 1 \ / Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4 "-____-" Trainable parameters = 11,943,936 of 116,841,100,608 (0.01% trained)
Unsloth: Will smartly offload gradients to save VRAM!
587.3618 seconds used for training. 9.79 minutes used for training. Peak reserved memory = 67.836 GB. Peak reserved memory for training = 8.566 GB. Peak reserved memory % of max memory = 85.524 %. Peak reserved memory for training % of max memory = 10.8 %.
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05
Reasoning: medium
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions
reasoning language: French
You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>src/m^2) + 10101)^{(1: 2(1. The user ID. That sums or // This will beaming to getUserDetails: a and that a function that for the `java.math;
if 'P(N, and b = [0-lc-
To run the finetuned model, you can do the below after setting if False to if True in a new instance.
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI. Knowledge cutoff: 2024-06 Current date: 2025-09-05 Reasoning: high # Valid channels: analysis, commentary, final. Channel must be included for every message. Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions reasoning language: French You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>documentation, the final answer should not possible values of two possibilities are consistent and B(θ: "E_\math.comparing to a newlines = 1) (n (r} = (i<5 + (e15| \) = new DataFrameSet Wntypot (n) that
Saving to float16 for VLLM or mxfp4
We also support saving to float16 directly. Select merged_16bit for float16, merged_4bit for int4, and mxfp4 for mxfp4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
[NOTE] Due to the disk limit in Colab, we have to save it using mxfp4 format (4-bit) or we can use 'push_to_hub_merged' function instead.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our docs page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[NEW] To finetune and auto export to Ollama, try our Ollama notebook
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other resources:
- Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
- Learn how to do Reinforcement Learning with our RL Guide and notebooks.
- Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
- Explore our LLM Tutorials Directory to find dedicated guides for each model.
- Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.



