Notebooks
U
Unsloth
Meta Synthetic Data Llama3 2 (3B)

Meta Synthetic Data Llama3 2 (3B)

unsloth-notebooksunslothoriginal_template

News

Placehoder

Installation

[ ]
[ ]

Unsloth

Primary Goal

Our goal is to make Llama 3.2 3B understand the "Byte Latent Transformer: Patches Scale Better Than Tokens" research paper that was published in December 2024.

We'll use https://github.com/meta-llama/synthetic-data-kit to generate question and answer pairs fully locally which will be used for finetuning Llama 3.2 3B!

[4]
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-01 14:15:23 [__init__.py:239] Automatically detected platform cuda.
config.json:   0%|          | 0.00/890 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]
Unsloth: Using dtype = torch.float16 for vLLM.
Unsloth: vLLM loading unsloth/Llama-3.2-3B-Instruct with actual GPU utilization = 89.39%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 7.19 GB. Also swap space = 2 GB.
vLLM STDOUT: INFO 05-01 14:15:38 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 05-01 14:15:39 [api_server.py:981] vLLM API server version 0.8.2
vLLM STDOUT: INFO 05-01 14:15:39 [api_server.py:982] args: Namespace(subparser='serve', model_tag='unsloth/Llama-3.2-3B-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='unsloth/Llama-3.2-3B-Instruct', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=2048, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=2.0, cpu_offload_gb=0, gpu_memory_utilization=0.8938626454842437, num_gpu_blocks_override=None, max_num_batched_tokens=2048, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=192, max_logprobs=0, disable_log_stats=True, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=2304, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config={"level":3,"splitting_ops":[]}, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7fb5b68e1a80>)
vLLM STDOUT: WARNING 05-01 14:15:40 [config.py:2614] Casting torch.bfloat16 to torch.float16.
vLLM STDOUT: INFO 05-01 14:15:54 [config.py:585] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
vLLM STDOUT: WARNING 05-01 14:15:54 [arg_utils.py:1854] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0.
vLLM STDOUT: INFO 05-01 14:15:54 [api_server.py:241] Started engine process with PID 2108
vLLM STDOUT: INFO 05-01 14:16:02 [__init__.py:239] Automatically detected platform cuda.
vLLM STDOUT: INFO 05-01 14:16:03 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='unsloth/Llama-3.2-3B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.2-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":192}, use_cached_outputs=True,
vLLM STDOUT: INFO 05-01 14:16:06 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
vLLM STDOUT: INFO 05-01 14:16:06 [cuda.py:288] Using XFormers backend.
vLLM STDOUT: INFO 05-01 14:16:07 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
vLLM STDOUT: INFO 05-01 14:16:07 [model_runner.py:1110] Starting to load model unsloth/Llama-3.2-3B-Instruct...
vLLM STDOUT: INFO 05-01 14:16:08 [weight_utils.py:265] Using model weights format ['*.safetensors']
vLLM STDOUT: INFO 05-01 14:25:37 [weight_utils.py:281] Time spent downloading weights for unsloth/Llama-3.2-3B-Instruct: 569.324218 seconds
vLLM STDOUT: INFO 05-01 14:26:04 [loader.py:447] Loading weights took 27.20 seconds
vLLM STDOUT: INFO 05-01 14:26:05 [model_runner.py:1146] Model loading took 6.0160 GB and 597.363926 seconds
vLLM STDOUT: INFO 05-01 14:26:17 [backends.py:415] Using cache directory: /root/.cache/vllm/torch_compile_cache/3fbc507988/rank_0_0 for vLLM's torch.compile
vLLM STDOUT: INFO 05-01 14:26:17 [backends.py:425] Dynamo bytecode transform time: 11.58 s
vLLM STDOUT: INFO 05-01 14:26:58 [backends.py:132] Cache the graph of shape None for later use
vLLM STDOUT: INFO 05-01 14:26:58 [backends.py:144] Compiling a graph for general shape takes 36.18 s
vLLM STDOUT: INFO 05-01 14:26:59 [monitor.py:33] torch.compile takes 47.76 s in total
vLLM STDOUT: INFO 05-01 14:27:02 [worker.py:267] Memory profiling takes 56.21 seconds
vLLM STDOUT: INFO 05-01 14:27:02 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.89) = 13.18GiB
vLLM STDOUT: INFO 05-01 14:27:02 [worker.py:267] model weights take 6.02GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 0.90GiB; the rest of the memory reserved for KV Cache is 6.22GiB.
vLLM STDOUT: INFO 05-01 14:27:02 [executor_base.py:111] # cuda blocks: 3637, # CPU blocks: 1170
vLLM STDOUT: INFO 05-01 14:27:02 [executor_base.py:116] Maximum concurrency for 2048 tokens per request: 28.41x
vLLM STDOUT: INFO 05-01 14:27:05 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
vLLM STDOUT: INFO 05-01 14:27:46 [model_runner.py:1570] Graph capturing finished in 41 secs, took 0.17 GiB
vLLM STDOUT: INFO 05-01 14:27:46 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 100.89 seconds
vLLM STDOUT: WARNING 05-01 14:27:47 [config.py:1028] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
vLLM STDOUT: INFO 05-01 14:27:47 [serving_chat.py:115] Using default chat sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
vLLM STDOUT: INFO 05-01 14:27:47 [serving_completion.py:61] Using default completion sampling params from model: {'temperature': 0.6, 'top_p': 0.9}
vLLM STDOUT: INFO 05-01 14:27:47 [api_server.py:1028] Starting vLLM API server on http://0.0.0.0:8000

--- vLLM Server Ready (Detected: 'Starting vLLM API server on') ---

Generate QA Pairs + Auto clean data

We now use synthetic data kit for question answer pair generation:

[5]

Check if it succeeded:

[6]
 VLLM server is running at http://localhost:8000/v1
Available models: {'object': 'list', 'data': [{'id': 
'unsloth/Llama-3.2-3B-Instruct', 'object': 'model', 'created': 1746109670, 
'owned_by': 'vllm', 'root': 'unsloth/Llama-3.2-3B-Instruct', 'parent': None, 
'max_model_len': 2048, 'permission': [{'id': 
'modelperm-4b22dda9834e436cb033f1d00ee42ee8', 'object': 'model_permission', 
'created': 1746109670, 'allow_create_engine': False, 'allow_sampling': True, 
'allow_logprobs': True, 'allow_search_indices': False, 'allow_view': True, 
'allow_fine_tuning': False, 'organization': '*', 'group': None, 'is_blocking': 
False}]}]}
 Checking VLLM server at http://localhost:8000/v1...

Document Parsing (PDF, CSV, HTML etc.)

Now, let's take the Byte Latent Transformer: Patches Scale Better Than Tokens research paper found at https://arxiv.org/abs/2412.09871 and covert it to Q&A pairs in order to finetune Llama 3.2!

[7]
 Processing https://arxiv.org/html/2412.09871v1...
 Text successfully extracted to data/output/arxiv_org.txt
37 ['data/output/arxiv_org_0.txt', 'data/output/arxiv_org_1.txt', 'data/output/arxiv_org_2.txt']

We see around 37 chunks of data. We now call synthetic-data-kit to create some pairs of data for 3 of our chunks.

You can process more chunks, but it'll be much slower!

Using --num-pairs will generate approximately that many QA pairs. However it might be shorter or longer depending on the max_seq_length of the loaded up model. So if you specify 100, you might only get 10 since the model's max sequence length is capped.

[8]
Processing 1 chunks to generate QA pairs...
Batch processing complete.
Generated 10 QA pairs total
Saving result to data/generated/arxiv_org_0_qa_pairs.json
Successfully wrote test file to data/generated/test_write.json
Successfully wrote result to data/generated/arxiv_org_0_qa_pairs.json
 Generating qa content from data/output/arxiv_org_0.txt...
 Content saved to data/generated/arxiv_org_0_qa_pairs.json
Processing 1 chunks to generate QA pairs...
Batch processing complete.
Generated 12 QA pairs total
Saving result to data/generated/arxiv_org_1_qa_pairs.json
Successfully wrote test file to data/generated/test_write.json
Successfully wrote result to data/generated/arxiv_org_1_qa_pairs.json
 Generating qa content from data/output/arxiv_org_1.txt...
 Content saved to data/generated/arxiv_org_1_qa_pairs.json
Processing 1 chunks to generate QA pairs...
Batch processing complete.
Generated 13 QA pairs total
Saving result to data/generated/arxiv_org_2_qa_pairs.json
Successfully wrote test file to data/generated/test_write.json
Successfully wrote result to data/generated/arxiv_org_2_qa_pairs.json
 Generating qa content from data/output/arxiv_org_2.txt...
 Content saved to data/generated/arxiv_org_2_qa_pairs.json

Optionally, you can clean up the data via pruning "bad" or low quality examples and convert the rest to JSON format for finetuning!

[ ]

We now convert the generated datasets into QA formats so we can load it for finetuning:

[12]
 Converting data/generated/arxiv_org_0_qa_pairs.json to ft format with json 
storage...
 Converted to ft format and saved to data/final/arxiv_org_0_qa_pairs_ft.json
 Converting data/generated/arxiv_org_1_qa_pairs.json to ft format with json 
storage...
 Converted to ft format and saved to data/final/arxiv_org_1_qa_pairs_ft.json
 Converting data/generated/arxiv_org_2_qa_pairs.json to ft format with json 
storage...
 Converted to ft format and saved to data/final/arxiv_org_2_qa_pairs_ft.json

Let's load up the data and see what the synthetic data looks like!

[25]
[27]
{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
,  {'content': 'What is the primary unit of computation in the Byte Latent Transformer (BLT) architecture?',
,   'role': 'user'},
,  {'content': 'Patches', 'role': 'assistant'}]}
[28]
{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
,  {'content': 'How do patches in BLT architecture get segmented?',
,   'role': 'user'},
,  {'content': 'Based on the entropy of the next byte', 'role': 'assistant'}]}
[31]
{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
,  {'content': 'What is the main purpose of the formal definition of the patching function?',
,   'role': 'user'},
,  {'content': 'to formally describe the process of segmenting bytes into patches',
,   'role': 'assistant'}]}

Finally free vLLM process to save memory and to allow for finetuning!

[32]
Attempting to terminate the VLLM server gracefully...
Server terminated gracefully.

Fine-tuning Synthetic Dataset with Unsloth

[33]
==((====))==  Unsloth 2025.4.3: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]
generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]
tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]
special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

[34]
Unsloth 2025.4.3 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.

Data Prep

We now use the Llama-3.2 format for conversation style finetunes. The chat template renders conversations like below: (Cutting Knowledge Date is by default there!)

	<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 May 2025

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

2<|eot_id|>

[35]
Map:   0%|          | 0/35 [00:00<?, ? examples/s]

See result of the first row:

[36]
{'messages': [{'content': 'You are a helpful assistant.', 'role': 'system'},
,  {'content': 'What is the primary unit of computation in the Byte Latent Transformer (BLT) architecture?',
,   'role': 'user'},
,  {'content': 'Patches', 'role': 'assistant'}],
, 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 01 May 2025\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWhat is the primary unit of computation in the Byte Latent Transformer (BLT) architecture?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nPatches<|eot_id|>'}

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[54]
Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/35 [00:00<?, ? examples/s]
[38]
GPU = Tesla T4. Max memory = 14.741 GB.
3.441 GB of memory reserved.
[41]
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 35 | Num Epochs = 15 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)
[42]
110.0157 seconds used for training.
1.83 minutes used for training.
Peak reserved memory = 3.441 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 23.343 %.
Peak reserved memory for training % of max memory = 0.0 %.

Inference

Let's run the model! Use apply_chat_template with add_generation_prompt set to True for inference.

[43]
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
A novel architecture that uses patches to encode bytes<|eot_id|>

The model learns about the research paper!!

[45]
Improved inference capabilities, enhanced model robustness, and the ability to scale while maintaining a fixed-inference budget<|eot_id|>

Saving, loading finetuned models

To save the final model as LoRA adapters, either use Huggingface's push_to_hub for an online save or save_pretrained for a local save.

[NOTE] This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

[46]
('lora_model/tokenizer_config.json',
, 'lora_model/special_tokens_map.json',
, 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set False to True:

[49]
It dynamically groups bytes into patches preserving access to the byte-level information<|eot_id|>

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our Wiki page):

  • q8_0 - Fast conversion. High resource use, but generally acceptable.
  • q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
  • q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.
[53]