Notebooks
U
Unsloth
Meta Synthetic Data Llama3.1 (8B)

Meta Synthetic Data Llama3.1 (8B)

unsloth-notebooksunslothnb

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster500K Context

New in Reinforcement Learning: FP8 RLVision RLStandbygpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]
[ ]

Unsloth

Synthetic-data-kit

[3]
nohup: redirecting stderr to stdout
[4]
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 18:42:44 [config.py:2972] Casting torch.bfloat16 to torch.float16.
INFO 04-29 18:43:00 [config.py:2003] Chunked prefill is enabled with max_num_batched_tokens=2048.
E0000 00:00:1745952186.991164    2914 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
E0000 00:00:1745952186.991164    2914 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors']
INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors']
INFO 04-29 18:43:16 [weight_utils.py:265] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds
INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds
INFO 04-29 18:43:35 [gpu_model_runner.py:1347] Model loading took 5.5859 GiB and 20.094828 seconds
INFO 04-29 18:43:49 [backends.py:430] Dynamo bytecode transform time: 13.67 s
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:43:53 [backends.py:136] Cache the graph of shape None for later use
INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s
INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s
INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s
INFO 04-29 18:44:31 [backends.py:148] Compiling a graph for general shape takes 41.48 s
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x
INFO 04-29 18:44:51 [kv_cache_utils.py:637] Maximum concurrency for 10,000 tokens per request: 16.71x

Optional: Function to check if vllm server is running. Change False to True and run cell

[6]

Create data directories

[5]

Ingest source file

Ingest source file "https://ai.meta.com/blog/llama-4-multimodal-intelligence/" . Can also use pdf, docx, ppt and youtube video

[6]
Text successfully extracted to data/output/ai_meta_com.txt

Generate QA pairs

Generate QA pairs with the help of vllm and Llama-3.1-8B-Instruct-unsloth-bnb-4bit. set num_pairs to the number of required pairs

[9]

Proceeding with process_file call...
Processing 1 chunks to generate QA pairs...
Batch processing complete.
Generated 10 QA pairs total
Saving result to data/generated/ai_meta_com_qa_pairs.json
Successfully wrote test file to data/generated/test_write.json
Successfully wrote result to data/generated/ai_meta_com_qa_pairs.json
Content saved to data/generated/ai_meta_com_qa_pairs.json

Generated content (first 500 chars):
{
  "summary": "Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:\n\nMeta AI has announced the Llama 4 herd, a new era of natively multimodal AI innovation, with the release of three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. These models are designed to enable people to build more personalized multimodal experiences, with Llama 4 Scout and Llama 4 Maverick offering industry-leading performance in image and text understanding, an...

Curate Data Pairs

[10]
\Curating generated pairs...
Processing 1 batches of QA pairs...
Batch processing complete.
Rated 10 QA pairs
Retained 10 pairs (threshold: 7.0)
Average score: 8.2
Cleaned content saved to data/cleaned/ai_meta_com_qa_pairs_cleaned.json

Generated content (first 500 chars):
{
  "summary": "Here is a summary of the document in 3-5 sentences, focusing on the main topic and key concepts:\n\nMeta AI has announced the Llama 4 herd, a new era of natively multimodal AI innovation, with the release of three models: Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. These models are designed to enable people to build more personalized multimodal experiences, with Llama 4 Scout and Llama 4 Maverick offering industry-leading performance in image and text understanding, an...

Save to chatML format

[11]
Converted to ft format and saved to data/final/ai_meta_com_qa_pairs_cleaned_ft.json

Converted content (first 500 chars):
[
  {
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the name of the first open-weight natively multimodal models with unprecedented context length support in the Llama 4 series?"
      },
      {
        "role": "assistant",
        "content": "Llama 4 Scout and Llama 4 Maverick"
      }
    ]
  },
  {
    "messages": [
      {
        "role": "system",
        "content": ...
[12]
Attempting to terminate the VLLM server

Unsloth

[16]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch SmolVLMForConditionalGeneration forward function.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-29 16:46:02 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.1: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]
generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

[17]
Unsloth 2025.4.1 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.

Data Prep

We now use the ChatML format for conversation style finetunes. We use Open Assistant conversations in ShareGPT style. ChatML renders multi turn conversations like below:

	<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What's the capital of France?<|im_end|>
<|im_start|>assistant
Paris.

[NOTE] To train only on completions (ignoring the user's input) read our docs here

We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old and our own optimized unsloth template.

Normally one has to train <|im_start|> and <|im_end|>. We instead map <|im_end|> to be the EOS token, and leave <|im_start|> as is. This requires no additional training of additional tokens.

Note ShareGPT uses {"from": "human", "value" : "Hi"} and not {"role": "user", "content" : "Hi"}, so we use mapping to map it.

For text completions like novel writing, try this notebook.

[18]
Generating train split: 0 examples [00:00, ? examples/s]
Map:   0%|          | 0/10 [00:00<?, ? examples/s]
[19]
[{'content': 'You are a helpful assistant.', 'role': 'system'},
, {'content': 'What is the context window of Llama 4 Scout?', 'role': 'user'},
, {'content': '10M', 'role': 'assistant'}]
[20]
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 29 Apr 2025

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the context window of Llama 4 Scout?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

10M<|eot_id|>

If you're looking to make your own chat template, that also is possible! You


must use the Jinja templating regime. We provide our own stripped down version of the Unsloth template which we find to be more efficient, and leverages ChatML, Zephyr and Alpaca styles.

More info on chat templates on our wiki page!

[21]

link text

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None. We also support DPOTrainer and GRPOTrainer for reinforcement learning!!

[22]
Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/10 [00:00<?, ? examples/s]
[23]
GPU = NVIDIA L4. Max memory = 22.161 GB.
3.441 GB of memory reserved.
[24]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10 | Num Epochs = 60 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)
[25]
64.247 seconds used for training.
1.07 minutes used for training.
Peak reserved memory = 3.441 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 15.527 %.
Peak reserved memory for training % of max memory = 0.0 %.

italicized text

Inference

Let's run the model! Since we're using ChatML, use apply_chat_template with add_generation_prompt set to True for inference.

[26]
Unsloth: Will map <|im_end|> to EOS = <|eot_id|>.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
['<|im_start|>user\nContinue the fibonacci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>\n<|im_start|>assistant\n10, 17, 34, 55, 89, 144, 233, 377, 610, 987, 1577, 2584, 4181, 6765, 10946, 17711, 28657, 46367, 75025']

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

[27]
<|im_start|>user
Continue the fibonacci sequence: 1, 1, 2, 3, 5, 8,<|im_end|>
<|im_start|>assistant
8, 13<|im_end|>

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Hugging Face's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[28]
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/tokenizer.json')
[29]
<|im_start|>user
What is a famous tall tower in Paris?<|im_end|>
<|im_start|>assistant
Eiffel
<|im_end|>assistant

A tower in Paris is called Eiffel.<|im_end|>

You can also use Hugging Face's AutoPeftModelForCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth's inference is 2x faster.

[30]

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[31]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our docs page):

  • q8_0 - Fast conversion. High resource use, but generally acceptable.
  • q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
  • q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[32]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

  1. Looking to use Unsloth locally? Read our Installation Guide for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.
  2. Learn how to do Reinforcement Learning with our RL Guide and notebooks.
  3. Read our guides and notebooks for Text-to-speech (TTS) and vision model support.
  4. Explore our LLM Tutorials Directory to find dedicated guides for each model.
  5. Need help with Inference? Read our Inference & Deployment page for details on using vLLM, llama.cpp, Ollama etc.

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0