Notebooks
U
Unsloth
GPT OSS BNB (20B) Inference

GPT OSS BNB (20B) Inference

unsloth-notebooksunslothoriginal_template

News

Placeholder

Installation

[ ]
[ ]

Unsloth

We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through an inference example. For our MXFP4 version, use this notebook instead.

[ ]
config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]
==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]
generation_config.json:   0%|          | 0.00/131 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Reasoning Effort

The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.


The gpt-oss models offer three distinct levels of reasoning effort you can choose from:

  • Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
  • Medium: A balance between performance and speed.
  • High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.
[ ]

Changing the reasoning_effort to medium will make the model think longer. We have to increase the max_new_tokens to occupy the amount of the generated tokens but it will give better and more correct answer

[ ]

Lastly we will test it using reasoning_effort to high

[ ]