Sft With Unsloth
Supervised Fine-tuning (SFT) with Unsloth
Fine-tuning requires a GPU. If you don't have one locally, you can run this notebook for free on Google Colab using a free NVIDIA T4 GPU instance.
What's in this notebook?
In this notebook you will learn how to perform Supervised Fine-tuning (SFT) using Unsloth for efficient, memory-optimized training. We will use the LFM2.5-1.2B-Instruct model and fine-tune it on the FineTome-100k dataset using LoRA adapters and Unsloth's optimizations for 2x faster training with reduced memory consumption.
We will cover
- Environment setup
- Data preparation
- Model training
- Local inference with your new model
- Model saving and exporting it into the format you need for deployment.
Deployment options
LFM2.5 models are small and efficient, enabling deployment across a wide range of platforms:
| Deployment Target | Use Case |
|---|---|
| 📱 Android | Mobile apps on Android devices |
| 📱 iOS | Mobile apps on iPhone/iPad |
| 🍎 Apple Silicon Mac | Local inference on Mac with MLX |
| 🦙 llama.cpp | Local deployments on any hardware |
| 🦙 Ollama | Local inference with easy setup |
| 🖥️ LM Studio | Desktop app for local inference |
| ⚡ vLLM | Cloud deployments with high throughput |
| ☁️ Modal | Serverless cloud deployment |
| 🏗️ Baseten | Production ML infrastructure |
| 🚀 Fal | Fast inference API |
Installation
Unsloth
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2026.1.2: Fast Lfm2 patching. Transformers: 4.57.3. \\ /| NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors: 0%| | 0.00/2.34G [00:00<?, ?B/s]
generation_config.json: 0%| | 0.00/132 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json: 0.00B [00:00, ?B/s]
special_tokens_map.json: 0%| | 0.00/434 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
We now add LoRA adapters so we only need to update a small amount of parameters!
Unsloth: Making `model.base_model.model.model` require gradients
Data Prep
We now use the LFM format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. LFM renders multi turn conversations like below:
<|startoftext|><|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hey there!<|im_end|>
'<|startoftext|><|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\nHey there!<|im_end|>\n'
We get the first 3000 rows of the dataset
README.md: 0%| | 0.00/982 [00:00<?, ?B/s]
data/train-00000-of-00001.parquet: 0%| | 0.00/117M [00:00<?, ?B/s]
Generating train split: 0%| | 0/100000 [00:00<?, ? examples/s]
We now use standardize_data_formats to try converting datasets to the correct format for finetuning purposes!
Unsloth: Standardizing formats (num_proc=12): 0%| | 0/3000 [00:00<?, ? examples/s]
Let's see how row 100 looks like!
{'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?',
, 'role': 'user'},
, {'content': 'In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.',
, 'role': 'assistant'}],
, 'source': 'infini-instruct-top-500k',
, 'score': 4.774171352386475} We now have to apply the chat template for LFM onto the conversations, and save it to text. We also remove the BOS token otherwise we'll get double BOS tokens!
Map: 0%| | 0/3000 [00:00<?, ? examples/s]
'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\nBoolean operators are logical operators used in programming to manipulate boolean values. They operate on one or more boolean operands and return a boolean result. The three main boolean operators are "AND" (&&), "OR" (||), and "NOT" (!).\n\nThe "AND" operator returns true if both of its operands are true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) and (y < 20) # This expression evaluates to True\n```\n\nThe "OR" operator returns true if at least one of its operands is true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) or (y < 20) # This expression evaluates to True\n```\n\nThe "NOT" operator negates the boolean value of its operand. It returns true if the operand is false, and false if the operand is true. For example:\n\n```python\nx = 5\nresult = not (x > 10) # This expression evaluates to True\n```\n\nOperator precedence refers to the order in which operators are evaluated in an expression. It ensures that expressions are evaluated correctly. In most programming languages, logical AND has higher precedence than logical OR. For example:\n\n```python\nresult = True or False and False # This expression is evaluated as (True or (False and False)), which is True\n```\n\nShort-circuit evaluation is a behavior where the second operand of a logical operator is not evaluated if the result can be determined based on the value of the first operand. In short-circuit evaluation, if the first operand of an "AND" operator is false, the second operand is not evaluated because the result will always be false. Similarly, if the first operand of an "OR" operator is true, the second operand is not evaluated because the result will always be true.\n\nIn programming languages that support short-circuit evaluation natively, you can use it to improve performance or avoid errors. For example:\n\n```python\nif x != 0 and (y / x) > 10:\n # Perform some operation\n```\n\nIn languages without native short-circuit evaluation, you can implement your own logic to achieve the same behavior. Here\'s an example in pseudocode:\n\n```\nif x != 0 {\n if (y / x) > 10 {\n // Perform some operation\n }\n}\n```\n\nTruthiness and falsiness refer to how non-boolean values are evaluated in boolean contexts. In many programming languages, non-zero numbers and non-empty strings are considered truthy, while zero, empty strings, and null/None values are considered falsy.\n\nWhen evaluating boolean expressions, truthiness and falsiness come into play. For example:\n\n```python\nx = 5\nresult = x # The value of x is truthy, so result is also truthy\n```\n\nTo handle cases where truthiness and falsiness are implemented differently across programming languages, you can explicitly check the desired condition. For example:\n\n```python\nx = 5\nresult = bool(x) # Explicitly converting x to a boolean value\n```\n\nThis ensures that the result is always a boolean value, regardless of the language\'s truthiness and falsiness rules.<|im_end|>\n' Train the model
Now let's use Huggingface TRL's SFTTrainer! More docs here: TRL SFT docs. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.
Unsloth: Tokenizing ["text"] (num_proc=16): 0%| | 0/3000 [00:00<?, ? examples/s]
🦥 Unsloth: Padding-free auto-enabled, enabling faster training.
We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
Map (num_proc=16): 0%| | 0/3000 [00:00<?, ? examples/s]
Let's verify masking the instruction part is done! Let's print the 100th row again.
'<|startoftext|><|im_start|>user\nWhat is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?<|im_end|>\n<|im_start|>assistant\nIn programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|im_end|>\n' Now let's print the masked out example - you should see only the answer is present:
' In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|im_end|>\n' GPU = NVIDIA L4. Max memory = 22.161 GB. 2.23 GB of memory reserved.
The model is already on multiple devices. Skipping the move to device specified in `args`. ==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 3,000 | Num Epochs = 1 | Total steps = 60 O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 4 \ / Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8 "-____-" Trainable parameters = 11,108,352 of 1,181,448,960 (0.94% trained)
Unsloth: Will smartly offload gradients to save VRAM!
121.0305 seconds used for training. 2.02 minutes used for training. Peak reserved memory = 3.453 GB. Peak reserved memory for training = 1.223 GB. Peak reserved memory % of max memory = 15.581 %. Peak reserved memory for training % of max memory = 5.519 %.
The sky appears blue due to a phenomenon called Rayleigh scattering. When sunlight enters the Earth's atmosphere, it interacts with molecules and small particles in the air. Shorter wavelengths of light, such as blue and violet, are scattered more than longer wavelengths like red and orange. This scattering occurs because the shorter wavelengths have a higher frequency and interact more strongly with the air molecules. As a result, the blue light is dispersed in all directions, making the sky appear blue to our eyes. However, violet light has an even shorter wavelength than blue light, which means it is scattered even more. In fact, violet light is scattered about
In the jungle, where the trees sway, A sloth slips by, in a slow parade. With a snout so wide, and eyes so bright, It's a sight to behold, in the dappled light. Its fur is soft, like a cloud in the sky, And its tail is long, like a snake in the night. It moves with a lazy grace, like a leaf on the breeze, And spends its days in a peaceful, relaxed ease. But don't let its slow pace fool you, For this sloth has a heart that's full of love. It loves its jungle home
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/chat_template.jinja',
, 'lora_model/tokenizer.json') Now if you want to load the LoRA adapters we just saved for inference, set False to True:
To code up a transformer, you can follow these steps: 1. Define the input and output data structures: Determine the shape of the input data and the expected output data structure. This will help you understand how to transform the input data into the desired output format. 2. Implement the transformation logic: Based on the input data structure, write the code that performs the necessary transformations. This may involve using built-in functions or custom functions to manipulate the data. 3. Handle edge cases: Consider any special cases or edge conditions that may arise during the transformation process. For example, if the input data contains missing values, you
You can also use Hugging Face's AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.
Saving to float16 for VLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our Wiki page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[NEW] To finetune and auto export to Ollama, try our Ollama notebook
Now, use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan here and Open WebUI here
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other links:
- Train your own reasoning model - Llama GRPO notebook Free Colab
- Saving finetunes to Ollama. Free notebook
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab
- See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!


