Unsloth Qwen3 (0 6B) Phone Deployment

Qwen3 (0 6B) Phone Deployment

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / Qwen3_(0_6B)-Phone_Deployment.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context

New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[1]

Unsloth

[2]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: You selected full finetuning support, but 4bit / 8bit is enabled - disabling LoRA / QLoRA.
==((====))==  Unsloth 2025.12.9: Fast Qwen3 patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.33.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Float16 full finetuning uses more memory since we upcast weights to float32.

model.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/707 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Applying QAT to mitigate quantization degradation

Data Prep

Qwen3 has both reasoning and a non reasoning mode. So, we should use 2 datasets:

We use the Open Math Reasoning dataset which was used to win the AIMO (AI Mathematical Olympiad - Progress Prize 2) challenge! We sample 10% of verifiable reasoning traces that used DeepSeek R1, and which got > 95% accuracy.
We also leverage Maxime Labonne's FineTome-100k dataset in ShareGPT style. But we need to convert it to HuggingFace's normal multiturn format as well.

[3]

README.md:   0%|          | 0.00/603 [00:00<?, ?B/s]

data/cot-00000-of-00001.parquet:   0%|          | 0.00/106M [00:00<?, ?B/s]

Generating cot split:   0%|          | 0/19252 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see the structure of both datasets:

[4]

Dataset({
,    features: ['expected_answer', 'problem_type', 'problem_source', 'generation_model', 'pass_rate_72b_tir', 'problem', 'generated_solution', 'inference_mode'],
,    num_rows: 19252
,})

[5]

Dataset({
,    features: ['conversations', 'source', 'score'],
,    num_rows: 100000
,})

We now convert the reasoning dataset into conversational format:

[6]

[7]

Map:   0%|          | 0/19252 [00:00<?, ? examples/s]

Let's see the first transformed row:

[8]

"<|im_start|>user\nGiven $\\sqrt{x^2+165}-\\sqrt{x^2-52}=7$ and $x$ is positive, find all possible values of $x$.<|im_end|>\n<|im_start|>assistant\n<think>\nOkay, let's see. I need to solve the equation √(x² + 165) - √(x² - 52) = 7, and find all positive values of x. Hmm, radicals can be tricky, but maybe if I can eliminate the square roots by squaring both sides. Let me try that.\n\nFirst, let me write down the equation again to make sure I have it right:\n\n√(x² + 165) - √(x² - 52) = 7.\n\nOkay, so the idea is to isolate one of the radicals and then square both sides. Let me try moving the second radical to the other side:\n\n√(x² + 165) = 7 + √(x² - 52).\n\nNow, if I square both sides, maybe I can get rid of the square roots. Let's do that:\n\n(√(x² + 165))² = (7 + √(x² - 52))².\n\nSimplifying the left side:\n\nx² + 165 = 49 + 14√(x² - 52) + (√(x² - 52))².\n\nThe right side is expanded using the formula (a + b)² = a² + 2ab + b². So the right side becomes 7² + 2*7*√(x² - 52) + (√(x² - 52))², which is 49 + 14√(x² - 52) + (x² - 52).\n\nSo putting it all together:\n\nx² + 165 = 49 + 14√(x² - 52) + x² - 52.\n\nHmm, let's simplify the right side. The x² terms will cancel out, right? Let's subtract x² from both sides:\n\n165 = 49 + 14√(x² - 52) - 52.\n\nSimplify the constants on the right:\n\n49 - 52 is -3, so:\n\n165 = -3 + 14√(x² - 52).\n\nNow, add 3 to both sides to isolate the radical term:\n\n165 + 3 = 14√(x² - 52).\n\nSo 168 = 14√(x² - 52).\n\nDivide both sides by 14:\n\n168 / 14 = √(x² - 52).\n\n12 = √(x² - 52).\n\nNow, square both sides again to eliminate the square root:\n\n12² = x² - 52.\n\n144 = x² - 52.\n\nAdd 52 to both sides:\n\n144 + 52 = x².\n\n196 = x².\n\nSo x = √196 = 14.\n\nBut wait, since the problem states that x is positive, we only take the positive root. So x = 14.\n\nBut hold on, when dealing with squaring equations, sometimes extraneous solutions can come up. I should check if this solution actually satisfies the original equation.\n\nLet's plug x = 14 back into the original equation:\n\n√(14² + 165) - √(14² - 52) = ?\n\nCalculate each term:\n\n14² is 196.\n\nSo first radical: √(196 + 165) = √361 = 19.\n\nSecond radical: √(196 - 52) = √144 = 12.\n\nSo 19 - 12 = 7, which is exactly the right-hand side. So yes, it checks out.\n\nTherefore, the only solution is x = 14. Since the problem says x is positive, we don't have to consider negative roots. So I think that's the answer.\n</think>\n\nTo solve the equation \\(\\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\\) for positive \\(x\\), we proceed as follows:\n\n1. Start with the given equation:\n   \\[\n   \\sqrt{x^2 + 165} - \\sqrt{x^2 - 52} = 7\n   \\]\n\n2. Isolate one of the square roots by moving \\(\\sqrt{x^2 - 52}\\) to the right side:\n   \\[\n   \\sqrt{x^2 + 165} = 7 + \\sqrt{x^2 - 52}\n   \\]\n\n3. Square both sides to eliminate the square root on the left:\n   \\[\n   (\\sqrt{x^2 + 165})^2 = (7 + \\sqrt{x^2 - 52})^2\n   \\]\n   Simplifying both sides, we get:\n   \\[\n   x^2 + 165 = 49 + 14\\sqrt{x^2 - 52} + (x^2 - 52)\n   \\]\n\n4. Combine like terms on the right side:\n   \\[\n   x^2 + 165 = x^2 - 52 + 49 + 14\\sqrt{x^2 - 52}\n   \\]\n   Simplifying further:\n   \\[\n   x^2 + 165 = x^2 - 3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n5. Subtract \\(x^2\\) from both sides:\n   \\[\n   165 = -3 + 14\\sqrt{x^2 - 52}\n   \\]\n\n6. Add 3 to both sides to isolate the term with the square root:\n   \\[\n   168 = 14\\sqrt{x^2 - 52}\n   \\]\n\n7. Divide both sides by 14:\n   \\[\n   12 = \\sqrt{x^2 - 52}\n   \\]\n\n8. Square both sides again to eliminate the square root:\n   \\[\n   12^2 = x^2 - 52\n   \\]\n   Simplifying:\n   \\[\n   144 = x^2 - 52\n   \\]\n\n9. Add 52 to both sides to solve for \\(x^2\\):\n   \\[\n   196 = x^2\n   \\]\n\n10. Take the positive square root (since \\(x\\) is positive):\n    \\[\n    x = \\sqrt{196} = 14\n    \\]\n\n11. Verify the solution by substituting \\(x = 14\\) back into the original equation:\n    \\[\n    \\sqrt{14^2 + 165} - \\sqrt{14^2 - 52} = \\sqrt{196 + 165} - \\sqrt{196 - 52} = \\sqrt{361} - \\sqrt{144} = 19 - 12 = 7\n    \\]\n    The solution checks out.\n\nThus, the only positive solution is:\n\\[\n\\boxed{14}\n\\]<|im_end|>\n"

Next we take the non reasoning dataset and convert it to conversational format as well.

We have to use Unsloth's standardize_sharegpt function to fix up the format of the dataset first.

[9]

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/100000 [00:00<?, ? examples/s]

Let's see the first row

[10]

'<|im_start|>user\nExplain what boolean operators are, what they do, and provide examples of how they can be used in programming. Additionally, describe the concept of operator precedence and provide examples of how it affects the evaluation of boolean expressions. Discuss the difference between short-circuit evaluation and normal evaluation in boolean expressions and demonstrate their usage in code. \n\nFurthermore, add the requirement that the code must be written in a language that does not support short-circuit evaluation natively, forcing the test taker to implement their own logic for short-circuit evaluation.\n\nFinally, delve into the concept of truthiness and falsiness in programming languages, explaining how it affects the evaluation of boolean expressions. Add the constraint that the test taker must write code that handles cases where truthiness and falsiness are implemented differently across different programming languages.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nBoolean operators are logical operators used in programming to manipulate boolean values. They operate on one or more boolean operands and return a boolean result. The three main boolean operators are "AND" (&&), "OR" (||), and "NOT" (!).\n\nThe "AND" operator returns true if both of its operands are true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) and (y < 20)  # This expression evaluates to True\n```\n\nThe "OR" operator returns true if at least one of its operands is true, and false otherwise. For example:\n\n```python\nx = 5\ny = 10\nresult = (x > 0) or (y < 20)  # This expression evaluates to True\n```\n\nThe "NOT" operator negates the boolean value of its operand. It returns true if the operand is false, and false if the operand is true. For example:\n\n```python\nx = 5\nresult = not (x > 10)  # This expression evaluates to True\n```\n\nOperator precedence refers to the order in which operators are evaluated in an expression. It ensures that expressions are evaluated correctly. In most programming languages, logical AND has higher precedence than logical OR. For example:\n\n```python\nresult = True or False and False  # This expression is evaluated as (True or (False and False)), which is True\n```\n\nShort-circuit evaluation is a behavior where the second operand of a logical operator is not evaluated if the result can be determined based on the value of the first operand. In short-circuit evaluation, if the first operand of an "AND" operator is false, the second operand is not evaluated because the result will always be false. Similarly, if the first operand of an "OR" operator is true, the second operand is not evaluated because the result will always be true.\n\nIn programming languages that support short-circuit evaluation natively, you can use it to improve performance or avoid errors. For example:\n\n```python\nif x != 0 and (y / x) > 10:\n    # Perform some operation\n```\n\nIn languages without native short-circuit evaluation, you can implement your own logic to achieve the same behavior. Here\'s an example in pseudocode:\n\n```\nif x != 0 {\n    if (y / x) > 10 {\n        // Perform some operation\n    }\n}\n```\n\nTruthiness and falsiness refer to how non-boolean values are evaluated in boolean contexts. In many programming languages, non-zero numbers and non-empty strings are considered truthy, while zero, empty strings, and null/None values are considered falsy.\n\nWhen evaluating boolean expressions, truthiness and falsiness come into play. For example:\n\n```python\nx = 5\nresult = x  # The value of x is truthy, so result is also truthy\n```\n\nTo handle cases where truthiness and falsiness are implemented differently across programming languages, you can explicitly check the desired condition. For example:\n\n```python\nx = 5\nresult = bool(x)  # Explicitly converting x to a boolean value\n```\n\nThis ensures that the result is always a boolean value, regardless of the language\'s truthiness and falsiness rules.<|im_end|>\n'

Now let's see how long both datasets are:

[11]

19252
100000

The non reasoning dataset is much longer. Let's assume we want the model to retain some reasoning capabilities, but we specifically want a chat model.

Let's define a ratio of chat only data. The goal is to define some mixture of both sets of data.

Let's select 75% reasoning and 25% chat based:

[12]

Let's sample the reasoning dataset by 75% (or whatever is 100% - chat_percentage)

[13]

19252
6417
0.2499902606256574

Finally combine both datasets:

[14]

Train the model

Now let's train our model. We do 60 steps to speed things up, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

[15]

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/25669 [00:00<?, ? examples/s]

🦥 Unsloth: Padding-free auto-enabled, enabling faster training.

[16]

GPU = Tesla T4. Max memory = 14.741 GB.
3.518 GB of memory reserved.

Let's train the model! To resume a training run, set trainer.train(resume_from_checkpoint = True)

[17]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 25,669 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 751,632,384 of 751,632,384 (100.00% trained)

Unsloth: Will smartly offload gradients to save VRAM!

[18]

695.3901 seconds used for training.
11.59 minutes used for training.
Peak reserved memory = 10.562 GB.
Peak reserved memory for training = 7.044 GB.
Peak reserved memory % of max memory = 71.65 %.
Peak reserved memory for training % of max memory = 47.785 %.

Saving, loading finetuned models

To save the model for phone deployment, we first save the model via save_pretrained_torchao

[19]

We then use Executorch's Qwen3 conversion process

[20]

Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu126 for torchao version 0.14.0         Please see GitHub issue #2919 for more info
I tokenizers:regex.cpp:27] Registering override fallback regex
<frozen runpy>:128: RuntimeWarning: 'executorch.examples.models.qwen3.convert_weights' found in sys.modules after import of package 'executorch.examples.models.qwen3', but prior to execution of 'executorch.examples.models.qwen3.convert_weights'; this may result in unpredictable behaviour
Loading checkpoint...
Loading checkpoint from pytorch_model directory
Converting checkpoint...
Saving checkpoint...
Done.

[21]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   348  100   348    0     0    750      0 --:--:-- --:--:-- --:--:--   750

And finally we export to a .pte file which can be used for deployment

[22]

Skipping import of cpp extensions due to incompatible torch version 2.9.0+cu126 for torchao version 0.14.0         Please see GitHub issue #2919 for more info
I tokenizers:regex.cpp:27] Registering override fallback regex
[INFO 2026-01-02 09:17:24,012 utils.py:164] NumExpr defaulting to 2 threads.
[INFO 2026-01-02 09:17:25,160 export_llama_lib.py:781] Applying quantizers: []
[INFO 2026-01-02 09:17:26,240 export_llama_lib.py:703] Checkpoint dtype: torch.float32
[INFO 2026-01-02 09:17:26,240 custom_kv_cache.py:367] Replacing KVCache with CustomKVCache. This modifies the model in place.
[INFO 2026-01-02 09:17:26,383 custom_ops.py:34] Looking for libcustom_ops_aot_lib.so in /usr/local/lib/python3.12/dist-packages/executorch/extension/llm/custom_ops
[INFO 2026-01-02 09:17:26,386 custom_ops.py:39] Loading custom ops library: /usr/local/lib/python3.12/dist-packages/executorch/extension/llm/custom_ops/libcustom_ops_aot_lib.so
[INFO 2026-01-02 09:17:26,397 builder.py:192] Model after source transforms: Transformer(
  (tok_embeddings): Embedding(151936, 1024)
  (layers): ModuleList(
    (0-27): 28 x TransformerBlock(
      (attention): AttentionMHA(
        (q_norm_fn): RMSNorm()
        (k_norm_fn): RMSNorm()
        (wq): Linear(in_features=1024, out_features=2048, bias=False)
        (wk): Linear(in_features=1024, out_features=1024, bias=False)
        (wv): Linear(in_features=1024, out_features=1024, bias=False)
        (wo): Linear(in_features=2048, out_features=1024, bias=False)
        (rope): Rope()
        (kv_cache): CustomKVCache()
        (SDPA): SDPACustom()
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=1024, out_features=3072, bias=False)
        (w2): Linear(in_features=3072, out_features=1024, bias=False)
        (w3): Linear(in_features=1024, out_features=3072, bias=False)
      )
      (attention_norm): RMSNorm()
      (ffn_norm): RMSNorm()
    )
  )
  (rope): Rope()
  (norm): RMSNorm()
  (output): Linear(in_features=1024, out_features=151936, bias=False)
)
[INFO 2026-01-02 09:17:26,442 builder.py:218] Exporting with:
[INFO 2026-01-02 09:17:26,446 builder.py:219] inputs: (tensor([[2, 3, 4]]), {'input_pos': tensor([0])})
[INFO 2026-01-02 09:17:26,446 builder.py:220] kwargs: None
[INFO 2026-01-02 09:17:26,446 builder.py:221] dynamic shapes: ({1: Dim('token_dim', min=0, max=128)}, {'input_pos': {0: 1}})
[INFO 2026-01-02 09:18:15,928 builder.py:255] Running canonical pass: RemoveRedundantTransposes
[INFO 2026-01-02 09:18:16,131 export_llama_lib.py:854] Lowering model using following partitioner(s): 
[INFO 2026-01-02 09:18:16,131 export_llama_lib.py:856] --> XnnpackDynamicallyQuantizedPartitioner
[INFO 2026-01-02 09:18:16,131 export_llama_lib.py:856] --> XnnpackPartitioner
[INFO 2026-01-02 09:18:16,131 builder.py:341] Using pt2e [] to quantizing the model...
[INFO 2026-01-02 09:18:16,131 builder.py:394] No quantizer provided, passing...
[INFO 2026-01-02 09:18:16,133 builder.py:216] Re-exporting with:
[INFO 2026-01-02 09:18:16,133 builder.py:219] inputs: (tensor([[2, 3, 4]]), {'input_pos': tensor([0])})
[INFO 2026-01-02 09:18:16,133 builder.py:220] kwargs: None
[INFO 2026-01-02 09:18:16,133 builder.py:221] dynamic shapes: ({1: Dim('token_dim', min=0, max=128)}, {'input_pos': {0: 1}})
/usr/local/lib/python3.12/dist-packages/executorch/exir/emit/_emitter.py:1594: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized, unless a pass which sets meta["et_init_buffer"] to True such as InitializedMutableBufferPass is run.
  warnings.warn(
[INFO 2026-01-02 09:28:08,838 builder.py:511] Required memory for activation in bytes: [0, 243793984]
modelname: qwen3_0.6B_model
output_file: qwen3_0.6B_model.pte
[INFO 2026-01-02 09:28:15,301 utils.py:141] Saved exported program to qwen3_0.6B_model.pte

And we have the file qwen3_0.6B_model.pte of around 472MB!

[23]

-rw-r--r-- 1 root root 472M Jan  2 09:28 qwen3_0.6B_model.pte

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0.