Unsloth Gpt Oss (120B) A100 Fine Tuning

Gpt Oss (120B) A100 Fine Tuning

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / gpt-oss-(120B)_A100-Fine-tuning.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on your A100 Google Colab Pro instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context

New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]

Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 120B model through a finetuning example. To use our MXFP4 inference example, use this notebook instead.

[2]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.1: Fast Gpt_Oss patching. Transformers: 4.55.4.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gpt_Oss does not support SDPA - switching to fast eager.

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]

model-00006-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00007-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00008-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00001-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00005-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00004-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00009-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00010-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00011-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00012-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00013-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00014-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00015-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00016-of-00016.safetensors:   0%|          | 0.00/2.14G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/16 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

[3]

Unsloth: Making `model.base_model.model.model` require gradients

Reasoning Effort

The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.

The gpt-oss models offer three distinct levels of reasoning effort you can choose from:

Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
Medium: A balance between performance and speed.
High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.

[4]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably approximate. Provide solution steps.<|end|><|start|>assistant<|channel|>final<|message|>**Answer Overview**

The equation  

\[

Changing the reasoning_effort to medium will make the model think longer. We have to increase the max_new_tokens to occupy the amount of the generated tokens but it will give better and more correct answer

[5]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably find all real solutions. Solve polynomial degree 5. Might have one rational root? Try rational root test

Lastly we will test it using reasoning_effort to high

[6]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We are told "Solve x^5 + 3x^4 - 10 = 3." It's a polynomial equation: x^5 + 3 x^4 - 10 = 3 => x^5 + 3x^4 -13 =0.

Thus we need to find

Data Prep

The HuggingFaceH4/Multilingual-Thinking dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's cookbook for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

[7]

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/5.29M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
,    features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],
,    num_rows: 1000
,})

To format our dataset, we will apply our version of the GPT OSS prompt

[8]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's take a look at the dataset, and check what the 1st example shows

[9]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are an AI chatbot with a lively and energetic personality.<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.

Je devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section « En vogue » sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et bienveillant, peut-être ajouter un emoji pour rester léger. Je vais structurer ma réponse étape par étape pour faciliter la lecture. Je dois m'excuser de ne pas pouvoir fournir des données en temps réel et proposer d'autres méthodes. Je conserverai un langage simple et convivial, en évitant les termes techniques.<|end|><|start|>assistant<|channel|>final<|message|>Hey there!  While I can't check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends:

1. **Open the "Trending" tab** on the Twitter app or website – it updates constantly!  
2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up.  
3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads).  
4. **Check regional trends** – they often differ by location!  

Remember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the "Distracted Boyfriend" meme revival), celebrity drama, or unexpected events (hello, weather disasters!).  

Want me to brainstorm *what* might trend next? I’ve got ideas!<|return|>

What is unique about GPT-OSS is that it uses OpenAI Harmony format which support conversation structures, reasoning output, and tool calling.

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[10]

Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1000 [00:00<?, ? examples/s]

[11]

GPU = NVIDIA A100-SXM4-80GB. Max memory = 79.318 GB.
59.27 GB of memory reserved.

[12]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 11,943,936 of 116,841,100,608 (0.01% trained)

Unsloth: Will smartly offload gradients to save VRAM!

[13]

587.3618 seconds used for training.
9.79 minutes used for training.
Peak reserved memory = 67.836 GB.
Peak reserved memory for training = 8.566 GB.
Peak reserved memory % of max memory = 85.524 %.
Peak reserved memory for training % of max memory = 10.8 %.

Inference

Let's run the model! You can change the instruction and input - leave the output blank!

[14]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>src/m^2) + 10101)^{(1: 2(1. The user ID. That sums or  // This will beaming to getUserDetails: a and that a function that for the `java.math;

        if 'P(N, and b = [0-lc-

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Hugging Face's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] Currently finetunes can only be loaded via Unsloth in the meantime - we're working on vLLM and GGUF exporting!

[ ]

To run the finetuned model, you can do the below after setting if False to if True in a new instance.

[16]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>documentation, the final answer should not possible values of two possibilities are consistent and B(θ: "E_\math.comparing to a newlines = 1) (n (r} = (i<5 + (e15| \) = new DataFrameSet Wntypot (n) that

Saving to float16 for VLLM or mxfp4

We also support saving to float16 directly. Select merged_16bit for float16, merged_4bit for int4, and mxfp4 for mxfp4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[NOTE] Due to the disk limit in Colab, we have to save it using mxfp4 format (4-bit) or we can use 'push_to_hub_merged' function instead.

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our docs page):

q8_0 - Fast conversion. High resource use, but generally acceptable.
q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[NEW] To finetune and auto export to Ollama, try our Ollama notebook

[18]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
Learn how to do Reinforcement Learning with our RL Guide and notebooks.
Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
Explore our LLM Tutorials Directory to find dedicated guides for each model.
Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0