Unsloth Gemma3 (270M) Phone Deployment

Gemma3 (270M) Phone Deployment

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / Gemma3_(270M)_Phone_Deployment.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Introducing Unsloth Studio — a new open source, no-code web UI to train and run LLMs. Blog

_{Train models — no code needed}

_{Run GGUF models on Mac, Windows & Linux}

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]

Unsloth

[2]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.9: Fast Gemma3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.
Unsloth: Gemma3 does not support SDPA - switching to fast eager.
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained

model.safetensors:   0%|          | 0.00/536M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Applying QAT to mitigate quantization degradation

Data Prep

We now use the Gemma-3 format for conversation style finetunes. We use Maxime Labonne's FineTome-100k dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:

<bos><start_of_turn>user
Hello!<end_of_turn>
<start_of_turn>model
Hey there!<end_of_turn>

We use our get_chat_template function to get the correct chat template. We support zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3 and more.

[3]

[4]

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

We sample 10k rows to speed up training for the first successful phone deployment.

[5]

We now use standardize_data_formats to try converting datasets to the correct format for finetuning purposes!

[6]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how row 100 looks like!

[7]

{'conversations': [{'content': 'You are a helpful assistant, with no access to external functions.',
,   'role': 'system'},
,  {'content': '20 // 3 - 4 ** 2 + 5=', 'role': 'user'},
,  {'content': "The answer to this expression is -6. \nHere's how to solve it: \n1. First, we need to evaluate the exponentiation operator (**), which has higher precedence than the other operators. So, 4 ** 2 equals 16. \n2. Next, we need to perform the division operation (20 // 3), which gives us 6 (since 20 divided by 3 is 6 with a remainder of 2). \n3. Now we can substitute the values we have so far into the expression: \n6 - 16 + 5 \n4. Finally, we can perform the remaining addition and subtraction operations from left to right: \n6 - 16 = -10 \n-10 + 5 = -5 \nTherefore, the final answer is -6.",
,   'role': 'assistant'},
,  {'content': 'Can you explain the concept of neural networks?',
,   'role': 'user'},
,  {'content': 'Sure, I can explain neural networks. Neural networks are a type of machine learning algorithm that are inspired by the structure and function of the human brain. They consist of layers of interconnected nodes, or neurons, each of which receives input data, processes it using a set of weights, and produces an output signal.\n\nThe basic idea behind neural networks is to use these layers of neurons to learn complex patterns in data, such as images or speech, by adjusting the weights between them through a process called training. During training, the network is fed a large amount of labeled examples, and adjusts its internal parameters to minimize the difference between its predictions and the true labels.\n\nOnce trained, a neural network can be used for a wide range of tasks, including classification, regression, and generation. They have been successfully applied to many areas of research and commercial applications, such as image recognition, speech recognition, natural language processing, and robotics.\n\nDo you have any specific questions about neural networks?',
,   'role': 'assistant'}],
, 'source': 'glaive-function-calling-v2-sharegpt',
, 'score': 3.7550501823425293}

We now have to apply the chat template for Gemma-3 onto the conversations, and save it to text. We remove the <bos> token using removeprefix('<bos>') since we're finetuning. The Processor will add this token before training and the model expects only one.

[8]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Let's see how the chat template did! Notice there is no <bos> token as the processor tokenizer will be adding one.

[9]

"<start_of_turn>user\nYou are a helpful assistant, with no access to external functions.\n\n20 // 3 - 4 ** 2 + 5=<end_of_turn>\n<start_of_turn>model\nThe answer to this expression is -6. \nHere's how to solve it: \n1. First, we need to evaluate the exponentiation operator (**), which has higher precedence than the other operators. So, 4 ** 2 equals 16. \n2. Next, we need to perform the division operation (20 // 3), which gives us 6 (since 20 divided by 3 is 6 with a remainder of 2). \n3. Now we can substitute the values we have so far into the expression: \n6 - 16 + 5 \n4. Finally, we can perform the remaining addition and subtraction operations from left to right: \n6 - 16 = -10 \n-10 + 5 = -5 \nTherefore, the final answer is -6.<end_of_turn>\n<start_of_turn>user\nCan you explain the concept of neural networks?<end_of_turn>\n<start_of_turn>model\nSure, I can explain neural networks. Neural networks are a type of machine learning algorithm that are inspired by the structure and function of the human brain. They consist of layers of interconnected nodes, or neurons, each of which receives input data, processes it using a set of weights, and produces an output signal.\n\nThe basic idea behind neural networks is to use these layers of neurons to learn complex patterns in data, such as images or speech, by adjusting the weights between them through a process called training. During training, the network is fed a large amount of labeled examples, and adjusts its internal parameters to minimize the difference between its predictions and the true labels.\n\nOnce trained, a neural network can be used for a wide range of tasks, including classification, regression, and generation. They have been successfully applied to many areas of research and commercial applications, such as image recognition, speech recognition, natural language processing, and robotics.\n\nDo you have any specific questions about neural networks?<end_of_turn>\n"

Train the model

Fine-tuning requires careful experimentation. To avoid wasting hours on a broken pipeline, we start with a 5-step sanity check. This ensures the training stabilizes and the model exports correctly to your phone.

Run this short test first. If the export succeeds, come back and set max_steps = -1 (or num_train_epochs = 1) for the full training run.

[10]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/10000 [00:00<?, ? examples/s]

🦥 Unsloth: Padding-free auto-enabled, enabling faster training.

[11]

GPU = Tesla T4. Max memory = 14.741 GB.
1.666 GB of memory reserved.

Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)

[12]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 268,098,176 of 268,098,176 (100.00% trained)

Unsloth: Will smartly offload gradients to save VRAM!

[13]

851.6698 seconds used for training.
14.19 minutes used for training.
Peak reserved memory = 13.57 GB.
Peak reserved memory for training = 11.904 GB.
Peak reserved memory % of max memory = 92.056 %.
Peak reserved memory for training % of max memory = 80.754 %.

Saving, loading finetuned models

To save the model for phone deployment, we first save the model and tokenizer via save_pretrained.

[14]

('gemma_phone_model/tokenizer_config.json',
, 'gemma_phone_model/special_tokens_map.json',
, 'gemma_phone_model/chat_template.jinja',
, 'gemma_phone_model/tokenizer.model',
, 'gemma_phone_model/added_tokens.json',
, 'gemma_phone_model/tokenizer.json')

We then export directly from the local folder using Optimum Executorch as per the documentation.

[15]

Writing export_gemma_model.py

[16]

2026-01-02 09:15:27.775125: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1767345327.794751    6153 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767345327.800610    6153 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767345327.815563    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345327.815587    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345327.815591    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345327.815594    6153 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu126 for torchao version 0.14.0         Please see GitHub issue #2919 for more info
Exporting Gemma3-270M to ExecuTorch format...
`torch_dtype` is deprecated! Use `dtype` instead!
WARNING:coremltools:scikit-learn version 1.7.1 is not supported. Minimum required version: 0.17. Maximum required version: 1.5.1. Disabling scikit-learn conversion API.
WARNING:coremltools:XGBoost version 3.1.2 has not been tested with coremltools. You may run into unexpected errors. XGBoost 1.4.2 is the most recent version that has been tested.
WARNING:coremltools:TensorFlow version 2.19.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.
WARNING:coremltools:Torch version 2.9.0+cu126 has not been tested with coremltools. You may run into unexpected errors. Torch 2.7.0 is the most recent version that has been tested.
WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLCPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLGPUComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLNeuralEngineComputeDeviceProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLComputePlanProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLModelProxy: No module named 'coremltools.libcoremlpython'
WARNING:coremltools:Failed to load _MLModelAssetProxy: No module named 'coremltools.libcoremlpython'
/usr/local/lib/python3.12/dist-packages/torch/_dynamo/output_graph.py:1711: UserWarning: While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self'].cache.layers[0]", "L['self'].cache.layers[1]", "L['self'].cache.layers[2]", "L['self'].cache.layers[3]", "L['self'].cache.layers[4]", "L['self'].cache.layers[6]", "L['self'].cache.layers[7]", "L['self'].cache.layers[8]", "L['self'].cache.layers[9]", "L['self'].cache.layers[10]", "L['self'].cache.layers[12]", "L['self'].cache.layers[13]", "L['self'].cache.layers[14]", "L['self'].cache.layers[15]", "L['self'].cache.layers[16]"]
  warnings.warn(
I tokenizers:regex.cpp:27] Registering override fallback regex
/usr/local/lib/python3.12/dist-packages/executorch/exir/emit/_emitter.py:1594: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized, unless a pass which sets meta["et_init_buffer"] to True such as InitializedMutableBufferPass is run.
  warnings.warn(
Exported: model.pte (1663.25 MB)

Export complete!

And we have the file Gemma3 model.pte of size 306M!

[17]

-rw-r--r-- 1 root root 1.7G Jan  2 09:20 gemma_output/model.pte

Test Inference on Exported Model

[18]

Writing test_executorch.py

[19]

2026-01-02 09:25:10.490054: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1767345910.715842    8915 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1767345910.779700    8915 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1767345911.259764    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345911.259808    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345911.259812    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1767345911.259815    8915 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu126 for torchao version 0.14.0         Please see GitHub issue #2919 for more info

⚠️ DISCLAIMER: Python-based perf measurements are approximate and may not match absolute speeds on Android/iOS apps. They are intended for relative comparisons—-e.g. SDPA vs. custom SDPA, FP16 vs. FP32—-so you can gauge performance improvements from each optimization step. For end-to-end, platform-accurate benchmarks, please use the official ExecuTorch apps:
  • iOS:     https://github.com/pytorch/executorch/tree/main/extension/benchmark/apple/Benchmark
  • Android: https://github.com/pytorch/executorch/tree/main/extension/benchmark/android/benchmark

PyTorchObserver {"prompt_tokens": 17, "generated_tokens": 9, "model_load_start_ms": 0, "model_load_end_ms": 0, "inference_start_ms": 1767345947281, "token_encode_end_ms": 1767345947281, "model_execution_start_ms": 1767345956758, "model_execution_end_ms": 1767345956926, "inference_end_ms": 1767345956926, "prompt_eval_end_ms": 1767345955587, "first_token_ms": 1767345955763, "aggregate_sampling_time_ms": 9620, "SCALING_FACTOR_UNITS_PER_SECOND": 1000}
	Prompt Tokens: 17 Generated Tokens: 9
	Model Load Time:		0.000000 (seconds)
	Total inference time:		9.645000 (seconds)		 Rate: 	0.933126 (tokens/second)
		Prompt evaluation:	8.306000 (seconds)		 Rate: 	2.046713 (tokens/second)
		Generated 9 tokens:	1.339000 (seconds)		 Rate: 	6.721434 (tokens/second)
	Time to first generated token:	8.482000 (seconds)
	Sampling time over 26 tokens:	9.620000 (seconds)
Generated: user
What is 2 + 2?
model
2 + 2 = 4

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0.