Notebooks
U
Unsloth
Gpt Oss (120B) A100 Fine Tuning

Gpt Oss (120B) A100 Fine Tuning

unsloth-notebooksunslothoriginal_template

News

Placeholder

Installation

[ ]

Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 120B model through a finetuning example. To use our MXFP4 inference example, use this notebook instead.

[2]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.1: Fast Gpt_Oss patching. Transformers: 4.55.4.
   \\   /|    NVIDIA A100-SXM4-80GB. Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Gpt_Oss does not support SDPA - switching to fast eager.
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Fetching 16 files:   0%|          | 0/16 [00:00<?, ?it/s]
model-00006-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00007-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00003-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00008-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00001-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00002-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00005-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00004-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00009-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00010-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00011-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00012-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00013-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00014-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00015-of-00016.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]
model-00016-of-00016.safetensors:   0%|          | 0.00/2.14G [00:00<?, ?B/s]
Loading checkpoint shards:   0%|          | 0/16 [00:00<?, ?it/s]
generation_config.json:   0%|          | 0.00/160 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters.

[3]
Unsloth: Making `model.base_model.model.model` require gradients

Reasoning Effort

The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.


The gpt-oss models offer three distinct levels of reasoning effort you can choose from:

  • Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
  • Medium: A balance between performance and speed.
  • High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.
[4]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably approximate. Provide solution steps.<|end|><|start|>assistant<|channel|>final<|message|>**Answer Overview**

The equation  

\[

Changing the reasoning_effort to medium will make the model think longer. We have to increase the max_new_tokens to occupy the amount of the generated tokens but it will give better and more correct answer

[5]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We need solve equation: x^5 + 3x^4 - 10 = 3 => x^5 + 3x^4 -13 =0. Find real roots? Probably find all real solutions. Solve polynomial degree 5. Might have one rational root? Try rational root test

Lastly we will test it using reasoning_effort to high

[6]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>We are told "Solve x^5 + 3x^4 - 10 = 3." It's a polynomial equation: x^5 + 3 x^4 - 10 = 3 => x^5 + 3x^4 -13 =0.

Thus we need to find

Data Prep

The HuggingFaceH4/Multilingual-Thinking dataset will be utilized as our example. This dataset, available on Hugging Face, contains reasoning chain-of-thought examples derived from user questions that have been translated from English into four other languages. It is also the same dataset referenced in OpenAI's cookbook for fine-tuning. The purpose of using this dataset is to enable the model to learn and develop reasoning capabilities in these four distinct languages.

[7]
README.md: 0.00B [00:00, ?B/s]
data/train-00000-of-00001.parquet:   0%|          | 0.00/5.29M [00:00<?, ?B/s]
Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]
Dataset({
,    features: ['reasoning_language', 'developer', 'user', 'analysis', 'final', 'messages'],
,    num_rows: 1000
,})

To format our dataset, we will apply our version of the GPT OSS prompt

[8]
Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's take a look at the dataset, and check what the 1st example shows

[9]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are an AI chatbot with a lively and energetic personality.<|end|><|start|>user<|message|>Can you show me the latest trends on Twitter right now?<|end|><|start|>assistant<|channel|>analysis<|message|>D'accord, l'utilisateur demande les tendances Twitter les plus récentes. Tout d'abord, je dois vérifier si j'ai accès à des données en temps réel. Étant donné que je ne peux pas naviguer sur Internet ou accéder directement à l'API de Twitter, je ne peux pas fournir des tendances en direct. Cependant, je peux donner quelques conseils généraux sur la façon de les trouver.

Je devrais préciser que les tendances Twitter évoluent rapidement et sont spécifiques à chaque région. Je pourrais suggérer de consulter la section « En vogue » sur l'application ou le site web. Aussi, l'utilisation de hashtags et le suivi d'utilisateurs pertinents pourraient être utiles. Il est important de souligner que les tendances varient selon la région et l'heure de la journée. Je devrais garder un ton amical et bienveillant, peut-être ajouter un emoji pour rester léger. Je vais structurer ma réponse étape par étape pour faciliter la lecture. Je dois m'excuser de ne pas pouvoir fournir des données en temps réel et proposer d'autres méthodes. Je conserverai un langage simple et convivial, en évitant les termes techniques.<|end|><|start|>assistant<|channel|>final<|message|>Hey there!  While I can't check Twitter (X) in real-time or access live data, I can share some tips to help you spot the latest trends:

1. **Open the "Trending" tab** on the Twitter app or website – it updates constantly!  
2. **Search for hashtags** like #Trending or #Viral to see what’s blowing up.  
3. **Follow accounts** that curate trends (e.g., @TrendingNow, @ViralThreads).  
4. **Check regional trends** – they often differ by location!  

Remember, trends are *super fast-moving* and often tied to pop culture, memes, or breaking news. For example, recent trends have included viral challenges (like the "Distracted Boyfriend" meme revival), celebrity drama, or unexpected events (hello, weather disasters!).  

Want me to brainstorm *what* might trend next? I’ve got ideas!<|return|>

What is unique about GPT-OSS is that it uses OpenAI Harmony format which support conversation structures, reasoning output, and tool calling.

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[10]
Unsloth: Tokenizing ["text"] (num_proc=16):   0%|          | 0/1000 [00:00<?, ? examples/s]
[11]
GPU = NVIDIA A100-SXM4-80GB. Max memory = 79.318 GB.
59.27 GB of memory reserved.
[12]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 30
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 11,943,936 of 116,841,100,608 (0.01% trained)
Unsloth: Will smartly offload gradients to save VRAM!
[13]
587.3618 seconds used for training.
9.79 minutes used for training.
Peak reserved memory = 67.836 GB.
Peak reserved memory for training = 8.566 GB.
Peak reserved memory % of max memory = 85.524 %.
Peak reserved memory for training % of max memory = 10.8 %.

Inference

Let's run the model! You can change the instruction and input - leave the output blank!

[14]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: medium

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>src/m^2) + 10101)^{(1: 2(1. The user ID. That sums or  // This will beaming to getUserDetails: a and that a function that for the `java.math;

        if 'P(N, and b = [0-lc-

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] Currently finetunes can only be loaded via Unsloth in the meantime - we're working on vLLM and GGUF exporting!

[ ]

To run the finetuned model, you can do the below after setting if False to if True in a new instance.

[16]
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-09-05

Reasoning: high

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>developer<|message|># Instructions

reasoning language: French

You are a helpful assistant that can solve mathematical problems.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>documentation, the final answer should not possible values of two possibilities are consistent and B(θ: "E_\math.comparing to a newlines = 1) (n (r} = (i<5 + (e15| \) = new DataFrameSet Wntypot (n) that

Saving to float16 for VLLM or mxfp4

We also support saving to float16 directly. Select merged_16bit for float16, merged_4bit for int4, and mxfp4 for mxfp4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[NOTE] Due to the disk limit in Colab, we have to save it using mxfp4 format (4-bit) or we can use 'push_to_hub_merged' function instead.

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our Wiki page):

  • q8_0 - Fast conversion. High resource use, but generally acceptable.
  • q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
  • q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[NEW] To finetune and auto export to Ollama, try our Ollama notebook

[18]