GPT OSS BNB (20B) Inference
News
Placeholder
Installation
Unsloth
We're about to demonstrate the power of the new OpenAI GPT-OSS 20B model through an inference example. For our MXFP4 version, use this notebook instead.
config.json: 0%| | 0.00/1.15k [00:00<?, ?B/s]
==((====))== Unsloth: Fast Llama patching release 2024.4 \\ /| GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux. O^O/ \_/ \ Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1. \ / Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False. "-____-" Free Apache license: http://github.com/unslothai/unsloth
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
model.safetensors: 0%| | 0.00/5.70G [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/131 [00:00<?, ?B/s]
tokenizer_config.json: 0%| | 0.00/51.0k [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/9.09M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/449 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Reasoning Effort
The gpt-oss models from OpenAI include a feature that allows users to adjust the model's "reasoning effort." This gives you control over the trade-off between the model's performance and its response speed (latency) which by the amount of token the model will use to think.
The gpt-oss models offer three distinct levels of reasoning effort you can choose from:
- Low: Optimized for tasks that need very fast responses and don't require complex, multi-step reasoning.
- Medium: A balance between performance and speed.
- High: Provides the strongest reasoning performance for tasks that require it, though this results in higher latency.
Changing the reasoning_effort to medium will make the model think longer. We have to increase the max_new_tokens to occupy the amount of the generated tokens but it will give better and more correct answer
Lastly we will test it using reasoning_effort to high