Liquid Grpo With Unsloth

Grpo With Unsloth

liquid-cookbooklanguage-modelslaptopedgeiosnotebooksfinetuninglanguage-modelandroid

alph-notebooks/liquid-cookbook / grpo_with_unsloth.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

GRPO Fine-tuning with Unsloth

Fine-tuning requires a GPU. If you don't have one locally, you can run this notebook for free on Google Colab using a free NVIDIA T4 GPU instance.

What's in this notebook?

In this notebook you will learn how to use Group Relative Policy Optimization (GRPO) for reinforcement learning-based fine-tuning with Unsloth. We will use the LFM2.5-1.2B-Instruct model and train it on mathematical reasoning tasks using the Open R1 Math dataset. GRPO is ideal for verifiable tasks where you can programmatically evaluate the correctness of model outputs, such as math problems, code generation, and structured data tasks.

We will cover

Environment setup
Data preparation for GRPO
Reward function design
Model training with GRPO
Local inference with your new model
Model saving and exporting it into the format you need for deployment.

Deployment options

LFM2.5 models are small and efficient, enabling deployment across a wide range of platforms:

Deployment Target	Use Case
📱 Android	Mobile apps on Android devices
📱 iOS	Mobile apps on iPhone/iPad
🍎 Apple Silicon Mac	Local inference on Mac with MLX
🦙 llama.cpp	Local deployments on any hardware
🦙 Ollama	Local inference with easy setup
🖥️ LM Studio	Desktop app for local inference
⚡ vLLM	Cloud deployments with high throughput
☁️ Modal	Serverless cloud deployment
🏗️ Baseten	Production ML infrastructure
🚀 Fal	Fast inference API

Need help building with our models and tools?

Join the Liquid AI Discord Community and ask.

And now, let the fine tune begin!

Installation

[ ]

Unsloth

Goal: To convert LFM2.5-1.2B-Instruct into a reasoning model via GRPO by using OpenR1's Math dataset.

We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.

[ ]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2026.1.2: Fast Lfm2 patching. Transformers: 4.57.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.

model.safetensors:   0%|          | 0.00/2.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Unsloth: Making `model.base_model.model.model` require gradients

[ ]

<|startoftext|><|im_start|>user
Solve x^5 + 3x^4 - 10 = 3.<|im_end|>
<|im_start|>assistant
We are given the equation:

$$
x^5 + 3x^4 - 10 = 3
$$

### Step 1: Move all terms to one side

$$
x^5 + 3x^4 - 10 - 3 = 0
$$

$$
x^5 + 3x^4 - 13 = 0
$$

Now we have the equation:

$$
x^5 + 3x^4 - 13 = 0
$$

This is a fifth-degree polynomial equation, which is generally difficult to solve algebraically. Let's try to find rational roots using the Rational Root Theorem.

### Step 2: Apply Rational Root Theorem

Possible rational roots are factors of the constant term (-13) divided by factors of the leading coefficient (1):

Possible rational roots: $ \pm1, \pm13 $

Try $ x = 1 $:

$$
1^5 + 3(1)^4 - 13 = 1 + 3 - 13 = -9 \neq 0
$$

Try $ x = -1 $:

$$
(-1)^5 + 3(-1)^4 - 13 = -1 + 3 - 13 = -11 \neq 0
$$

Try $ x = 13 $: too large, not likely.

Try $ x = -13 $: also too large.

So no rational roots. Let's try to estimate the solution numerically.

### Step

GRPO chat template

Since we're using an instruct model, we need to ask the model to output <think> and </think> tokens in the system prompt. This will guide the model's responses and make the final answers easier to parse.

[ ]

Let's see how the chat template behaves on an example:

[ ]

Pre fine-tuning for formatting

We now use a subset of NVIDIA's Open Math Reasoning dataset which was filtered to only include high quality DeepSeek R1 traces.

We'll only filter ~59 or so examples to first "prime" / pre fine-tune the model to understand our custom GRPO formatting.

[ ]

We have to format the dataset to follow our GRPO style formatting:

[ ]

Check to see if it worked:

[ ]

Let's truncate the pre fine-tuning dataset to max_seq_length/2 since we don't want too long reasoning traces.

Note this might take 2 minutes!

[ ]

We then tokenize the messages and convert it to a Hugging Face compatible dataset format:

[ ]

Let's now pre fine-tune the model so it follows our custom GRPO formatting!

[ ]

Let's check if the model has learnt to follow the custom format:

[ ]

Let's use Unsloth normal inference (not vLLM):

[ ]

Yes it did follow the formatting! Great! Let's remove some items before the GRPO step

[ ]

Data Prep

We're using Hugging Face's Open R1 Math dataset. You can also utilize OpenAI's famous GSM8K dataset

[ ]

Let's look at the first row:

[ ]

In GSM8K, we notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

[ ]

Let's map the dataset! and see the first row:

[ ]

We create a regex format to match the reasoning sections and answers:

[ ]

We verify it works:

[ ]

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

[ ]

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

[ ]

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

[ ]

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

[ ]

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via float and sees if it's the same.

[ ]

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

[ ]

Train the model

Now set up GRPO Trainer and all configurations!

[ ]

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[ ]

Inference

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

[ ]

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

[ ]

Verify LoRA is actually trained!

[ ]

Now we load the LoRA and test:

[ ]

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our Wiki page):

q8_0 - Fast conversion. High resource use, but generally acceptable.
q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[NEW] To finetune and auto export to Ollama, try our Ollama notebook

[ ]

Now, use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️