ERNIE 4 5 21B A3B PT Conversational
To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!
To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.
You will learn how to do data prep, how to train, how to run the model, & how to save it
News
Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog
You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog
Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog
3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context
New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL
Visit our docs for all our model uploads and notebooks.
Installation
Unsloth
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.11.3: Fast Ernie4_5_Moe patching. Transformers: 4.56.2. \\ /| NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.0+cu126. CUDA: 8.9. CUDA Toolkit: 12.6. Triton: 3.5.0 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post1. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors.index.json: 0.00B [00:00, ?B/s]
model-00001-of-00009.safetensors: 0%| | 0.00/4.99G [00:00<?, ?B/s]
model-00002-of-00009.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00003-of-00009.safetensors: 0%| | 0.00/4.99G [00:00<?, ?B/s]
model-00004-of-00009.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00005-of-00009.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00006-of-00009.safetensors: 0%| | 0.00/5.00G [00:00<?, ?B/s]
model-00007-of-00009.safetensors: 0%| | 0.00/4.99G [00:00<?, ?B/s]
model-00008-of-00009.safetensors: 0%| | 0.00/4.99G [00:00<?, ?B/s]
model-00009-of-00009.safetensors: 0%| | 0.00/3.97G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/226 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.model: 0%| | 0.00/1.61M [00:00<?, ?B/s]
tokenizer.json: 0%| | 0.00/11.2M [00:00<?, ?B/s]
added_tokens.json: 0.00B [00:00, ?B/s]
special_tokens_map.json: 0.00B [00:00, ?B/s]
chat_template.jinja: 0%| | 0.00/754 [00:00<?, ?B/s]
We now add LoRA adapters so we only need to update a small amount of parameters!
Unsloth: Making `model.base_model.model.model` require gradients
Data Prep
We now use the ERNIE-4.5 format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. ERNIE-4.5 renders multi turn conversations like below:
<|begin_of_sentence|>User: Hello
Assistant: Hey there!<|end_of_sentence|>
We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3 and more.
README.md: 0%| | 0.00/982 [00:00<?, ?B/s]
data/train-00000-of-00001.parquet: 0%| | 0.00/117M [00:00<?, ?B/s]
Generating train split: 0%| | 0/100000 [00:00<?, ? examples/s]
We now use standardize_data_formats to try converting datasets to the correct format for finetuning purposes!
Unsloth: Standardizing formats (num_proc=12): 0%| | 0/100000 [00:00<?, ? examples/s]
Let's see how row 100 looks like!
{'conversations': [{'content': 'What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?',
, 'role': 'user'},
, {'content': 'In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.',
, 'role': 'assistant'}],
, 'source': 'infini-instruct-top-500k',
, 'score': 4.774171352386475} We now have to apply the chat template for ERNIE-4.5 onto the conversations, and save it to text.
Map: 0%| | 0/100000 [00:00<?, ? examples/s]
Let's see how the chat template did!
'<|begin_of_sentence|>User: What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?\nAssistant: In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|end_of_sentence|></s>' Unsloth: Tokenizing ["text"] (num_proc=16): 0%| | 0/100000 [00:00<?, ? examples/s]
'<|begin_of_sentence|>User: Explain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.\nAssistant: Boolean operators are logical operators used in programming to manipulate boolean values. They operate on one or more boolean operands and return a boolean result. The three main boolean operators are "AND" (&&), "OR" (||), and "NOT" (!).\n\nThe "AND" operator returns true if both of its operands are true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) and (y < 20) # This expression evaluates to True\n```\n\nThe "OR" operator returns true if at least one of its operands is true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) or (y < 20) # This expression evaluates to True\n```\n\nThe "NOT" operator negates the boolean value of its operand. It returns true if the operand is false, and false if the operand is true. For example:\n\n```python\nx = 5\nresult = not (x > 10) # This expression evaluates to True\n```\n\nOperator precedence refers to the order in which operators are evaluated in an expression. It ensures that expressions are evaluated correctly. In most programming languages, logical AND has higher precedence than logical OR. For example:\n\n```python\nresult = True or False and False # This expression is evaluated as (True or (False and False)), which is True\n```\n\nShort-circuit evaluation is a behavior where the second operand of a logical operator is not evaluated if the result can be determined based on the value of the first operand. In short-circuit evaluation, if the first operand of an "AND" operator is false, the second operand is not evaluated because the result will always be false. Similarly, if the first operand of an "OR" operator is true, the second operand is not evaluated because the result will always be true.\n\nIn programming languages that support short-circuit evaluation natively, you can use it to improve performance or avoid errors. For example:\n\n```python\nif x != 0 and (y / x) > 10:\n # Perform some operation\n```\n\nIn languages without native short-circuit evaluation, you can implement your own logic to achieve the same behavior. Here\'s an example in pseudocode:\n\n```\nif x != 0 {\n if (y / x) > 10 {\n // Perform some operation\n }\n}\n```\n\nTruthiness and falsiness refer to how non-boolean values are evaluated in boolean contexts. In many programming languages, non-zero numbers and non-empty strings are considered truthy, while zero, empty strings, and null/None values are considered falsy.\n\nWhen evaluating boolean expressions, truthiness and falsiness come into play. For example:\n\n```python\nx = 5\nresult = x # The value of x is truthy, so result is also truthy\n```\n\nTo handle cases where truthiness and falsiness are implemented differently across programming languages, you can explicitly check the desired condition. For example:\n\n```python\nx = 5\nresult = bool(x) # Explicitly converting x to a boolean value\n```\n\nThis ensures that the result is always a boolean value, regardless of the language\'s truthiness and falsiness rules.<|end_of_sentence|></s>' We also use Unsloth's train_on_completions method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!
Map (num_proc=16): 0%| | 0/100000 [00:00<?, ? examples/s]
Let's verify masking the instruction part is done! Let's print the 100th row again.
'<|begin_of_sentence|>User: What is the modulus operator in programming and how can I use it to calculate the modulus of two given numbers?\nAssistant: In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|end_of_sentence|></s>' Now let's print the masked out example - you should see only the answer is present:
' In programming, the modulus operator is represented by the \'%\' symbol. It calculates the remainder when one number is divided by another. To calculate the modulus of two given numbers, you can use the modulus operator in the following way:\n\n```python\n# Calculate the modulus\nModulus = a % b\n\nprint("Modulus of the given numbers is: ", Modulus)\n```\n\nIn this code snippet, the variables \'a\' and \'b\' represent the two given numbers for which you want to calculate the modulus. By using the modulus operator \'%\', we calculate the remainder when \'a\' is divided by \'b\'. The result is then stored in the variable \'Modulus\'. Finally, the modulus value is printed using the \'print\' statement.\n\nFor example, if \'a\' is 10 and \'b\' is 4, the modulus calculation would be 10 % 4, which equals 2. Therefore, the output of the above code would be:\n\n```\nModulus of the given numbers is: 2\n```\n\nThis means that the modulus of 10 and 4 is 2.<|end_of_sentence|></s>' GPU = NVIDIA L4. Max memory = 22.161 GB. 11.576 GB of memory reserved.
Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1 \\ /| Num examples = 100,000 | Num Epochs = 1 | Total steps = 60 O^O/ \_/ \ Batch size per device = 4 | Gradient accumulation steps = 2 \ / Data Parallel GPUs = 1 | Total batch size (4 x 2 x 1) = 8 "-____-" Trainable parameters = 177,545,216 of 22,002,983,104 (0.81% trained)
Unsloth: Will smartly offload gradients to save VRAM!
1606.6125 seconds used for training. 26.78 minutes used for training. Peak reserved memory = 16.764 GB. Peak reserved memory for training = 5.188 GB. Peak reserved memory % of max memory = 75.646 %. Peak reserved memory for training % of max memory = 23.41 %.
The sequence is a Fibonacci sequence, where each number is the sum of the two preceding ones. The next numbers in the sequence would be 13, 21, 34, 55, 89, 144, and so on.<|end_of_sentence|></s>
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/chat_template.jinja',
, 'lora_model/tokenizer.model',
, 'lora_model/added_tokens.json',
, 'lora_model/tokenizer.json') Now if you want to load the LoRA adapters we just saved for inference, set False to True:
Saving to float16 for VLLM
We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
GGUF / llama.cpp Conversion
To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.
Some supported quant methods (full list on our docs page):
q8_0- Fast conversion. High resource use, but generally acceptable.q4_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.q5_k_m- Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[NEW] To finetune and auto export to Ollama, try our Ollama notebook
Likewise, if you want to instead push to GGUF to your Hugging Face account, set if False to if True and add your Hugging Face token and upload location!
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other resources:
- Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
- Learn how to do Reinforcement Learning with our RL Guide and notebooks.
- Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
- Explore our LLM Tutorials Directory to find dedicated guides for each model.
- Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.



