Unsloth HuggingFace Course DeepSeek R1 0528 Qwen3 (8B) GRPO

HuggingFace Course DeepSeek R1 0528 Qwen3 (8B) GRPO

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / HuggingFace Course-DeepSeek_R1_0528_Qwen3_(8B)_GRPO.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

In this Hugging Face and Unsloth notebook, you will learn to transform DeepSeek R1 0528 Qwen3 (8B) GRPO into a Reasoning model using GRPO.

To install Unsloth on your local device, follow our guide. This notebook is licensed LGPL-3.0.

You will learn how to do data prep, how to train, how to run the model, & how to save it

News

Train MoEs - DeepSeek, GLM, Qwen and gpt-oss 12x faster with 35% less VRAM. Blog

You can now train embedding models 1.8-3.3x faster with 20% less VRAM. Blog

Ultra Long-Context Reinforcement Learning is here with 7x more context windows! Blog

3x faster LLM training with 30% less VRAM and 500K context. 3x faster • 500K Context

New in Reinforcement Learning: FP8 RL • Vision RL • Standby • gpt-oss RL

Visit our docs for all our model uploads and notebooks.

Installation

[ ]

Unsloth

Goal: To convert DeepSeek-R1-0528-Qwen3-8B into a reasoning model via GRPO by using OpenR1's Math dataset.

We also use langid for language detection. Our main goal is to force the model to generate reasoning traces in Indonesian, and we create a reward function using langid to check this.

[ ]

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 75.8 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
  Building wheel for langid (setup.py) ... done

[ ]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-05 23:14:18 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-05 23:14:18 [__init__.py:239] Automatically detected platform cuda.

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}

==((====))==  Unsloth 2025.5.9: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}

Unsloth: Your GPU cannot handle sequence lengths of 256 due to limited GPU memory.
Unsloth: Your GPU can only handle approximately the maximum sequence length of 256.
Unsloth: vLLM loading unsloth/DeepSeek-R1-0528-Qwen3-8B with actual GPU utilization = 69.34%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 256. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 0.0 GB. Also swap space = 0 GB.

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}

INFO 06-05 23:14:32 [config.py:2968] Downcasting torch.float32 to torch.float16.
INFO 06-05 23:14:46 [config.py:717] This model supports multiple tasks: {'classify', 'embed', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
WARNING 06-05 23:14:46 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. 
Unsloth: vLLM Bitsandbytes config using kwargs = {'load_in_8bit': False, 'load_in_4bit': True, 'bnb_4bit_compute_dtype': 'float16', 'bnb_4bit_quant_storage': 'uint8', 'bnb_4bit_quant_type': 'fp4', 'bnb_4bit_use_double_quant': False, 'llm_int8_enable_fp32_cpu_offload': False, 'llm_int8_has_fp16_weight': False, 'llm_int8_skip_modules': [], 'llm_int8_threshold': 6.0}
INFO 06-05 23:14:46 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='Erland/DeepSeek-R1-0528-Qwen3-8B', speculative_config=None, tokenizer='Erland/DeepSeek-R1-0528-Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=256, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Erland/DeepSeek-R1-0528-Qwen3-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"backend":"inductor","splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"debug":false,"dce":true,"coordinate_descent_tuning":true,"trace.enabled":false,"trace.graph_diagram":false,"triton.cudagraphs":true,"compile_threads":48,"max_autotune":false,"disable_progress":false,"verbose_progress":true,"enable_auto_functionalized_v2":false},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":128}, use_cached_outputs=False,

tokenizer_config.json:   0%|          | 0.00/9.49k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

INFO 06-05 23:14:50 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-05 23:14:50 [cuda.py:289] Using XFormers backend.
INFO 06-05 23:14:50 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-05 23:14:50 [model_runner.py:1108] Starting to load model Erland/DeepSeek-R1-0528-Qwen3-8B...
INFO 06-05 23:14:51 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 06-05 23:14:51 [weight_utils.py:265] Using model weights format ['*.safetensors']

model-00001-of-00007.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00003-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00004-of-00007.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00005-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00006-of-00007.safetensors:   0%|          | 0.00/4.83G [00:00<?, ?B/s]

model-00007-of-00007.safetensors:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

INFO 06-05 23:22:53 [weight_utils.py:281] Time spent downloading weights for Erland/DeepSeek-R1-0528-Qwen3-8B: 481.789170 seconds

model.safetensors.index.json:   0%|          | 0.00/32.9k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/7 [00:00<?, ?it/s]

INFO 06-05 23:25:08 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-05 23:25:08 [model_runner.py:1140] Model loading took 5.9071 GiB and 617.498293 seconds
INFO 06-05 23:25:19 [worker.py:287] Memory profiling takes 10.35 seconds
INFO 06-05 23:25:19 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.69) = 10.22GiB
INFO 06-05 23:25:19 [worker.py:287] model weights take 5.91GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.70GiB; the rest of the memory reserved for KV Cache is 3.59GiB.
INFO 06-05 23:25:20 [executor_base.py:112] # cuda blocks: 1632, # CPU blocks: 0
INFO 06-05 23:25:20 [executor_base.py:117] Maximum concurrency for 256 tokens per request: 102.00x
INFO 06-05 23:25:20 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Capturing CUDA graph shapes:   0%|          | 0/19 [00:00<?, ?it/s]

INFO 06-05 23:26:13 [model_runner.py:1592] Graph capturing finished in 53 secs, took 0.72 GiB
INFO 06-05 23:26:13 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 64.84 seconds
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']
Unsloth: Just some info: will skip parsing ['post_feedforward_layernorm', 'pre_feedforward_layernorm']

tokenizer_config.json:   0%|          | 0.00/9.49k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Unsloth 2025.5.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

GRPO Chat Template

Distill Qwen3 from Deepseek has a chat template that is used to format the input and output of the model. This is used to make the model output in a chat format. Including the reasoning step. We have to use that chat template since the model is trained using it.

Let's see how our chat template behaves on an example:

[ ]

'You are given a problem.\nThink about the problem and provide your working out.\nYou must think in Bahasa Indonesia.'

[ ]

<｜begin▁of▁sentence｜><｜User｜>What is 1+1?<｜assistant｜>2<｜end▁of▁sentence｜><｜User｜>What is 1+1?<｜assistant｜>2<｜end▁of▁sentence｜>

Data Prep

We're using Hugging Face's Open R1 Math dataset. You can also utilize OpenAI's famous GSM8K dataset

[ ]

README.md:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

en/train-00000-of-00001.parquet:   0%|          | 0.00/5.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14116 [00:00<?, ? examples/s]

Dataset({
,    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info'],
,    num_rows: 14116
,})

Let's look at the first row:

[ ]

'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.'

[ ]

'34'

In GSM8K, we notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

[ ]

'34'

Let's map the dataset! and see the first row:

[ ]

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

{'prompt': [{'content': 'You are given a problem.\nThink about the problem and provide your working out.\nYou must think in Bahasa Indonesia.',
,   'role': 'system'},
,  {'content': 'In triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.',
,   'role': 'user'}],
, 'solution': '34',
, 'data_source': 'math_dapo',
, 'source_prompt': [{'content': 'Solve the following math problem step by step. The last line of your response should be of the form Answer: $Answer (without quotes) where $Answer is the answer to the problem.\n\nIn triangle $ABC$, $\\sin \\angle A = \\frac{4}{5}$ and $\\angle A < 90^\\circ$. Let $D$ be a point outside triangle $ABC$ such that $\\angle BAD = \\angle DAC$ and $\\angle BDC = 90^\\circ$. Suppose that $AD = 1$ and that $\\frac{BD}{CD} = \\frac{3}{2}$. If $AB + AC$ can be expressed in the form $\\frac{a\\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.\n\nRemember to put your answer on its own line after "Answer:".',
,   'role': 'user'}],
, 'ability': 'MATH',
, 'reward_model': {'ground_truth': '34', 'style': 'rule-lighteval/MATH_v2'},
, 'extra_info': {'index': '9a9b6eb4-a1cb-49d1-8c1e-62eaf2f74079'},
, 'answer': '34'}

We create a regex format to match the reasoning sections and answers:

[ ]

re.compile(r'</think>(.*)', re.DOTALL|re.UNICODE)

We verify it works:

[ ]

['Hence, the solution is 2.']

[ ]

['\n\nHence, the solution is 2']

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

[ ]

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

[ ]

We want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

[ ]

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

[ ]

['0.34']
['123,456']
['-0.234']
['17']

Finally, we will try to enforce the thinking process to be in Bahasa Indonesia. This is a simple version of the language consistency reward that is used in DeepSeek R1 paper

[ ]

en
id
zh

[ ]

[-3.0, -3.0]

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via float and sees if it's the same.

[ ]

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

[ ]

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

<｜begin▁of▁sentence｜>You are given a problem.
Think about the problem and provide your working out.
You must think in Bahasa Indonesia.<｜User｜>In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$.<｜Assistant｜>

Map:   0%|          | 0/14116 [00:00<?, ? examples/s]

Max Length =  180

Train the model

Now set up GRPO Trainer and all configurations!

[ ]

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[ ]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 12,728 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 87,293,952/8,000,000,000 (1.09% trained)

********************Question:
In the diagram, each of the three identical circles touch the other two.  The circumference of each circle is 36.  What is the perimeter of the shaded region? [asy]

defaultpen(1);

path p = (1, 0){down}..{-dir(30)}dir(-60){dir(30)}..{dir(-30)}((2, 0) + dir(-120)){-dir(-30)}..{up}(1, 0)--cycle;
fill(p, gray(0.75));

draw(unitcircle);
draw(shift(2 * dir(-60)) * unitcircle);
draw(shift(2) * unitcircle);
[/asy] 
Answer:
18 
Response:
<think>
Pertama, masalahnya adalah: Dalam diagram ini, ada tiga lingkaran identik yang saling menyentuh dengan sama lain. Setiap lingkaran memiliki keliling 36. Jadi, dulu aku perlu mencari keliling lingkaran karena diperlukan untuk mencari perimeter daerah yang diarsir.

Keliling setiap 
Extracted:
,

Unsloth: Will smartly offload gradients to save VRAM!
********************Question:
In a class of 20 students, all but 4 of the students put their names on a typed assignment. If the teacher randomly guesses, what is the probability that she correctly guesses which paper belongs to each of the four remaining students? Express your answer as a common fraction.The answer is in the form rac{m}{n}, where gcd(m, n) = 1. Please provide the value of m + n. 
Answer:
25 
Response:
<think>
The problem states: In a class of 20 students, all but 4 of the students put their names on a typed assignment. If the teacher randomly guesses, what is the probability that she correctly guesses which paper belongs to each of the four remaining students? I need to find this probability as a fraction in simplest terms and then find the sum of numerator and denominator.

First, I need to understand the situation. There are 20 students, but only 16 of them put their names on their assignments because only 4 did not. The teacher is guessing which paper belongs to which student for the four papers that don't have names. But the problem is about guessing which paper 
Extracted:
20
********************Question:
Find the minimum possible value of
\[\sqrt{58 - 42x} + \sqrt{149 - 140 \sqrt{1 - x^2}}\]where $-1 \le x \le 1.$The answer is in the form k\sqrt{m}+n,. Please provide the value of k + m + n. 
Answer:
110 
Response:
<think>
I need to find the minimum possible value of \(\sqrt{58 - 42x} + \sqrt{149 - 140 \sqrt{1 - x^2}}\) for \(x\) between -1 and 1. The answer should be in the form \(k\sqrt{m} + n\), and then find \(k + m + n\).

First, I notice that there's a square root of \(1 - x^2\), which suggests trigonometric substitution because \(1 - x^2\) often appears as cosine squared if \(x = \sin \theta\) or something. Let me try that.

Set \(x = \sin \theta\), 
Extracted:
58
********************Question:
Find the number of three-element subsets of $\{1, 2, 3,...,13\}$ that contain at least one element that is a multiple of $2$, at least one element that is a multiple of $3$, and at least one element that is a multiple of $5$ such as $\{2,3, 5\}$ or $\{6, 10,13\}$. 
Answer:
63 
Response:
<think>
The problem is to find the number of three-element subsets of {1, 2, 3, ..., 13} that contain at least one multiple of 2, at least one multiple of 3, and at least one multiple of 5. Examples given are {2,3,5} and {6,10,13}. I need to box my final answer.

First, I should figure out what the universal set is. The set is {1,2,3,...,13}, so there are 13 elements.

I need to find three-element subsets that satisfy three conditions: at least one even number (multiple of 2), at least 
Extracted:
1,
********************Question:
The graph of the equation $x^{y}=y^{x}$ in the first quadrant (i.e., the region where $x>0$ and $y>0$) consists of a straight line and a curve. Find the sum of the coordinates of the intersection points of a straight line and a curve, rounded down. 
Answer:
5 
Response:
<think>
Pertama, saya perlu memahami persamaan \( x^y = y^x \) di kuadran pertama, di mana x dan y positif. Soal menyatakan bahwa grafik ini terdiri dari garis lurus dan kurva, dan saya harus mencari jumlah koordinat di titik potong antara garis lurus dan kurva tersebut.

Saya ingat bahwa persamaan \( x^y = y^ x \) sering diselesaikan dengan mengambil logaritma atau mencari hubungan antara x dan y. Misalnya, jika x dan y sama, maka \( x^x = x^x \), yang selalu benar, jadi garis y = x adalah 
Extracted:
,
********************Question:
Triangle $\triangle ABC$ has $AB= 3$, $BC = 4$, and $AC = 5$. Let $M$ and $N$ be the midpoints of $AC$ and $BC$, respectively. If line $AN$ intersects the circumcircle of triangle $\triangle BMC$ at points $X$ and $Y$, then $XY^2 = \frac{m}{n}$ for some relatively prime positive integers $m,n$. Find $m+n$. 
Answer:
933 
Response:
<think>
I am considering a geometry problem. We have a triangle with sides AB=3, BC=4, AC=5. I recognize that this is a right-angled triangle because 3-4-5 is a Pythagorean triple. Let me verify: AB² + BC² = 3² + 4² = 9 + 16 = 25, which is AC². So yes, angle at B is 90 degrees.

Now, M and N are midpoints. M is midpoint of AC, N is midpoint of BC. Since AC is 5, M is midpoint so AM = MC 
Extracted:
.
********************Question:
Find the product of uncommon real roots of the two polynomials $x^4 + 2x^3 - 8x^2 - 6x + 15$ and $x^3 + 4x^2 - x - 10$. 
Answer:
6 
Response:
<think>
Saya harus mencari hasil kali akar real yang tidak umum dari dua polinomial: \( x^4 + 2x^3 - 8x^2 - 6x + 15 \) dan \( x^3 + 4x^2 - x - 10 \). Akar tidak umum berarti akar yang dimiliki oleh salah satu polinomial tetapi bukan akar dari polinomial lainnya.

Pertama, saya perlu mencari akar dari kedua polinomial itu. Mulai dengan polinomial kubik yang lebih sederhana: \( x^3 + 4x^2 - x - 10 \).

Cara memfaktorkan polinomial. Saya akan mencoba mencari akar rasional menggunakan Teore 
Extracted:
4
********************Question:
A Parliament of $2000 MP$s decided to ratify the state budget of 200 expenditure items. Each $MP$ prepared a draft budget with what (s)he thinks the maximum possible allocation for each item so that the total expenditure does not exceed a given ceiling, S. For each item, the Parliament approves the maximum expenditure approved by not less than k $MP$s. What is the least value of k to ensure that the approved total does not exceed S? 
Answer:
1991 
Response:
<think>
Saya perlu memahami masalah ini dengan baik. Ada 2.000 anggota parlemen, dan mereka harus menyetujui anggaran untuk 200 item. Setiap MP membuat rancangan anggaran mereka sendiri dengan alokasi maksimum untuk setiap item sehingga total tidak melebihi S. Kemudian, parlemen menyetujui alokasi maksimum untuk setiap item jika disetujui oleh tidak kurang dari k orang. Pertanyaannya adalah, nilai k terkecil apa yang memastikan bahwa total alokasi yang dis 
Extracted:
.
********************Question:
Let $T$ be the set of all positive integer divisors of $2004^{100}$. What is the largest possible number of elements of a subset $S$ of $T$ such that no element in $S$ divides any other element in $S$? 
Answer:
10201 
Response:
<think>
The problem is to find the largest subset S of the positive integer divisors of \( 2004^{100} \) such that no element in S divides any other element in S. I need to reason step by step and box my final answer.

First, I need to understand what the set T is. T is the set of all positive divisors of \( 2004^{100} \). So, I should start by factoring 2004 into its prime factors to understand the structure of T.

Let me factorize 2004. 2004 divided by 2 is 1002, so 2 is a factor. 1002 divided by 2 is 501, so another 2.  
Extracted:
2004
********************Question:
Evaluate the expression \[(751 - 745) + (748 - 742) + (745 - 739) + (742 - 736) + \cdots + (499 - 493) + (496 - 490).\] 
Answer:
516 
Response:
<think>
Pertama, soalnya adalah untuk mengevaluasi ekspresi ini:  
\[(751 - 745) + (748 - 742) + (745 - 739) + (742 - 736) + \cdots + (499 - 493) + (496 - 490).\]

Saya harus memahami pola dalam ekspresi ini. Setiap suku adalah perbedaan antara dua angka. Mari kita lihat perbedaan antara angka pertama dan angka kedua dalam setiap pasangan.

Untuk (751 - 74 
Extracted:
,
********************Question:
Find the eighth term of the sequence $1440,$ $1716,$ $1848,\ldots,$ whose terms are formed by multiplying the corresponding terms of two arithmetic sequences.
 
Answer:
348 
Response:
<think>
Saya harus mencari suhu utama dari suatu urutan yang diberikan: 1440, 1716, 1848, dan seterusnya. Suku-sukunya terbentuk dengan mengalikan istilah yang sesuai dari dua urutan aritmatika.

Pertama, saya perlu mengingat apa itu urutan aritmatika. Urutan aritmatika adalah urutan bilangan di mana selisih antara dua suku yang berurutan adalah konstan. Jadi, jika saya punya dua urutan aritmatika, misalnya a, a+d, a+2d, ... dan b, b+e, b+2e, ..., maka hasil kali suku ke-n dari kedua urutan tersebut adalah n-th term dari urutan yang diberikan.

Mis 
Extracted:
1440,
********************Question:
Compute the $100^{\text{th}}$ smallest positive integer $n$ that satisfies the following three congruences:

\[
\begin{aligned}
\left\lfloor \frac{n}{8} \right\rfloor &\equiv 3 \pmod{4}, \\
\left\lfloor \frac{n}{32} \right\rfloor &\equiv 2 \pmod{4}, \\
\left\lfloor \frac{n}{256} \right\rfloor &\equiv 1 \pmod{4}.
\end{aligned}
\]

Here, $\lfloor \cdot \rfloor$ denotes the greatest integer function. 
Answer:
6491 
Response:
<think>
Saya perlu mencari bilangan bulat positif ke-100 yang memenuhi tiga kongrens berikut:

1. \(\left\lfloor \frac{n}{8} \right\rfloor \equiv 3 \pmod{4}\)

2. \(\left\lfloor \frac{n}{32} \right\rfloor \equiv 2 
Extracted:
-100
********************Question:
Compute \[\lfloor 1 \rfloor + \lfloor 1.6 \rfloor + \lfloor 2.2 \rfloor + \lfloor 2.8 \rfloor + \dots + \lfloor 99.4 \rfloor + \lfloor 100 \rfloor,\]where the arguments of the floor functions are in arithmetic progression. 
Answer:
8317 
Response:
<think>
Saya perlu menghitung jumlah dari floor function untuk suku-suku dalam barisan arithmetic. Barisan diberikan mulai dari 1, 1.6, 2.2, 2.8, dan seterusnya hingga 100. Suku pertama adalah 1, dan yang terakhir adalah 100. Saya harus mencari pola atau aturan untuk barisan ini.

Pertama, temukan pola barisan itu. Suku pertama: 1, suku kedua: 1.6, suku ketiga: 2.2, suku keempat: 2.8. Perbedaan 
Extracted:
.
********************Question:
\[ \int_{-10}^{10} \left| 4 - \left| 3 - \left| 2 - \left| 1 - |x| \right| \right| \right| \right| \, \mathrm{d}x \] 
Answer:
44 
Response:
<think>
Saya harus mengevaluasi integral tertentu dari -10 hingga 10 dari fungsi mutlak yang bersarang, yaitu |4 - |3 - |2 - |1 - |x||||. Karena ini adalah integral tertentu dengan batas asimtotik, mungkin fungsinya periodik atau memiliki pola yang berulang. Saya perlu memahami perilaku fungsi ini terhadap x.

Pertama, mari kita definisikan f(x) = |4 - |3 - |2 - |1 - |x||||.

Karena ada mutlak, fungsinya akan selalu non-negatif, dan mungkin terdefinisi untuk semua x, karena ada |x| di dalamnya 
Extracted:
-10
********************Question:
The number $734,851,474,594,578,436,096$ is equal to $n^6$ for some positive integer $n$. What is the value of $n$? 
Answer:
3004 
Response:
<think>
Saya diberikan angka 734851474594578436096 dan diinformasikan bahwa ini sama dengan n^6 untuk suatu bilangan bulat positif n. Saya perlu mencari nilai n.

Karena ini adalah sebuaik pangkat 6, dan pangkat 6 adalah hasil kali dari pangkat 2, 3, atau 1 dan pangkat lainnya, mungkin lebih mudah untuk memeriksa apakah ini adalah pangkat kuadrat atau pangkat kubik dan seterusnya, dan kemudian mengambil akar untuk mendapatkan n.

Saya tahu bahwa untuk mencari akar pangkat k, umumnya dengan mengambil akar pangkat dua atau tiga lebih mud 
Extracted:
734851474594578436096
********************Question:
What is the sum of all possible values of $\cos(2\theta)$ if $\cos(2\theta) = 2\cos(\theta)$ for a real number $\theta$? 
Answer:
2 
Response:
<think>
Saya perlu mencari jumlah dari semua nilai mungkin untuk cos(2θ) jika cos(2θ) = 2cos(θ) untuk suatu bilangan real θ. Saya harus berpikir dalam bahasa Indonesia, tetapi solusi ini harus dalam bahasa Inggris karena instruksi untuk jawaban dalam bahasa Inggris.

Pertama, saya ingat identitas trigonometri untuk cos(2θ). Ada beberapa, seperti:

1. cos(2θ) = cos²(θ) - sin²(θ)

2. cos(2θ) = 2cos²(θ) - 1

3. cos(2θ) = 1 - 2sin²(θ)

4. cos(2θ) = (1 - tan²(θ)) / (1 + tan²(θ)), tetapi mungkin tidak 
Extracted:
2
********************Question:
An engineer invested $\$10,\!000$ in a six-month savings certificate that paid a simple annual interest rate of $12\%$. After six months, she invested the total value of her investment in another six-month certificate. After six more months, the investment was worth $\$11,\!130$. If the annual interest rate of the second certificate is $r\%,$ then what is $r?$ 
Answer:
10 
Response:
<think>
Pertama, soalnya: Seorang insinyur menanam modal $10.000 di sertifikat tabungan dengan masa simpan enam bulan dan suku bunga sederhana tahunan 12%. Setelah enam bulan, dia menginvestasikan nilai total investasinya ke dalam sertifikat lain dengan masa simpan enam bulan lagi. Setelah enam bulan lagi, investasinya bernilai $11.130. Suku bunga tahunan untuk sertifikat kedua adalah r%. Kita perlu mencari r.

Pertama, karena ini suku 
Extracted:
,
********************Question:
In a far-off land, three fish can be traded for two loaves of bread, and a loaf of bread can be traded for four bags of rice. How many bags of rice is one fish worth? Provide your answer in the form \(\frac{k}{m}\), where \(k\) and \(m\) are integers, and give the value of \(k + m\). 
Answer:
11 
Response:
<think>
The problem is set in a far-off land where three fish can be traded for two loaves of bread, and a loaf of bread can be traded for four bags of rice. I need to find out how many bags of rice one fish is worth, and express it as a fraction k/m, then find k + m.

First, I need to find the value of one fish in terms of rice. I have two trades: one involving fish and bread, and another involving bread and rice.

Let me denote:

Let B be the number of loaves of bread.

From the first trade: three fish = two loaves of bread. So, 3 fish = 2B.

Therefore, one fish = (2B)/3.

 
Extracted:
,
********************Question:
In triangle $ABC$, $\sin \angle A = \frac{4}{5}$ and $\angle A < 90^\circ$. Let $D$ be a point outside triangle $ABC$ such that $\angle BAD = \angle DAC$ and $\angle BDC = 90^\circ$. Suppose that $AD = 1$ and that $\frac{BD}{CD} = \frac{3}{2}$. If $AB + AC$ can be expressed in the form $\frac{a\sqrt{b}}{c}$ where $a, b, c$ are pairwise relatively prime integers, find $a + b + c$. 
Answer:
34 
Response:
<think>
The problem involves a triangle with given angles and a ratio, and I need to find AB + AC in a specific form. First, since it's a geometry problem with trig functions and a ratio, I should draw a sketch.

Triangle ABC has sin angle A = 4/5 and angle A is less than 90 degrees, so it's acute. Sin A = 4/5, so I can find cos A since it's acute. Cos 
Extracted:
,
********************Question:
Find the number of $10$-digit palindromes that are not divisible by $11$. 
Answer:
0 
Response:
<think>
Saya perlu menemukan berapa banyak palindrom 10-digit yang tidak habis dibagi oleh 11. Palindrom 10-digit berarti angka tersebut sama ketika dibaca dari depan atau belakang. Contohnya, 123454321 adalah palindrom.

Sebuah palindrom 10-digit memiliki digit sebagai berikut: digit pertama (d1), kedua (d2), ketiga (d3), keempat (d4), kelima (d5), keenam (d6), ketujuh (d7), kedelapan (d8), kesembilan (d9), dan kesepuluh (d10). Karena palindrom, d1 = d10, d2 = d9, d3 = d8, d4 = d7, dan d5 = d6.

Jadi, palindrom 10-digit se 
Extracted:
10

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}

TrainOutput(global_step=100, training_loss=0.00025416401336116225, metrics={'train_runtime': 5481.0959, 'train_samples_per_second': 0.073, 'train_steps_per_second': 0.018, 'total_flos': 0.0, 'train_loss': 0.00025416401336116225})

Inference

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

[ ]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

' | Socratic\nWhat is the sqrt of 101?\nAlgebra\nQuestion\nVincent A.\nAnswer\n10.049875, or as an irrational number, it is not integer, I see what you mean though.\nExplanation:\nsqrt(100) = 10, sqrt(121)=11, so sqrt(101) is between 10 and 11, and there is not integer between 10 and 11, so it is irrational.\nWe can leave it as sqrt(101) or give the approximate value.\nThe question asks "what is the sqrt of 101", and it is a math problem, so probably they want to know if it is integer or not.\nOr perhaps do some calculations.\nI think the intended answer is that it is irrational and not an integer, but since 10^2=100 and 11^2=121, and 10^2 <101<11^2, so no integer square.\nI could also use calculator, but I think since it\'s a math problem, perhaps we need to show'

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

[ ]

Unrecognized keys in `rope_scaling` for 'rope_type'='yarn': {'attn_factor'}

Verify LoRA is actually trained!

[ ]

Now we load the LoRA and test. We tested without using our custom system prompt which should not (or minimal) affect toward the model's original reasoning ability.:

[ ]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

"<think>\nI have this equation: (x + 2)² = 0. It looks simple, but I need to solve for x. Since it's a squared term equal to zero, that means the thing inside the parentheses must be zero because only zero squared is zero.\n\nSo, if (x + 2)² = 0, then x + 2 must be equal to zero. Because if x + 2 were anything else, say 1, squared is 1, which is not zero. Or -1, squared is also 1, not zero. So only when x + 2 is zero, the square is zero.\n\nSo, x + 2 = 0, which means x = -2.\n\nI think that's it. Let me verify by plugging it back into the equation.\n\nx = -2, so ( -2 + 2 )² = (0)² = 0, which equals 0. Perfect.\n\nI recall that in algebra, this is related to the zero product property or something. Basically, if a product is zero, then one of the factors must be zero. Here, it's not a product, but"

Next, let's test using our system prompt which should use the new language :

[ ]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

"<think>\nI need to solve the equation (x + 2)^2 = 0. It's a quadratic equation, but it's squared, so it might be simpler. Let me think about this in Bahasa Indonesia first.\n\nPermulaan: Saya diberikan persamaan (x + 2)^2 = 0, dan saya harus menyelesaikannya. Saya perlu mencari nilai x yang memenuhi persamaan ini.\n\nSaya mulai dengan merasakan apa itu persamaan. Ini adalah persamaan kuadrat karena ada eksponen 2, tapi tidak semua persamaan kuadrat memiliki dua solusi; beberapa mungkin memiliki solusi ganda.\n\nSaya tahu bahwa jika sesuatu dikalikan dengan dirinya sendiri dan hasilnya nol, maka satu di antaranya harus nol. Jadi, untuk (x + 2)^2 = 0, itu berarti x + 2 harus sama dengan 0, karena jika x + 2 tidak"

Lets compare our results with system prompt but without our LoRA

[ ]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'<think>\nBaik, mari kita selesaikan persamaan kuadrat sederhana ini. Soalnya adalah: Solve (x + 2)^2 = 0.\n\nPertama, saya perlu memahami apa yang diminta. Ini adalah persamaan kuadrat yang diatur dalam bentuk kuadrat. Saya harus mencari nilai dari x yang memenuhi persamaan tersebut.\n\nSaya tahu bahwa ketika suatu bilangan kuadrat sama dengan nol, itu berarti bilangan tersebut adalah nol. Jadi, dalam hal ini, (x + 2)^2 = 0 berarti kuadrat dari (x + 2) adalah nol. Maka, untuk kuadrat suatu bilangan sama dengan nol, bilangan tersebut haruslah nol itu sendiri.\n\nOleh karena itu, (x + 2) haruslah sama dengan nol. Jadi, x + 2 = 0.\n\nSekarang, untuk mencari x, saya'

Let's take 20 samples, and compare the the amount of using our LoRA and not using it, and see which one has better amount of correct language

[ ]

Dataset({
,    features: ['prompt', 'solution', 'data_source', 'source_prompt', 'ability', 'reward_model', 'extra_info', 'answer'],
,    num_rows: 20
,})

[ ]

Comparing language usage with and without LoRA on 20 samples:
============================================================

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 5/20 samples...

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 10/20 samples...

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 15/20 samples...

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Processed 20/20 samples...

============================================================
RESULTS:
With LoRA - Indonesian responses: 16/20 (80.0%)
Without LoRA - Indonesian responses: 9/20 (45.0%)
Improvement: +7 Indonesian responses with LoRA

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[ ]

GGUF / llama.cpp Conversion

To save to GGUF / llama.cpp, we support it natively now! We clone llama.cpp and we default save it to q8_0. We allow all methods like q4_k_m. Use save_pretrained_gguf for local saving and push_to_hub_gguf for uploading to HF.

Some supported quant methods (full list on our docs page):

q8_0 - Fast conversion. High resource use, but generally acceptable.
q4_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
q5_k_m - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[NEW] To finetune and auto export to Ollama, try our Ollama notebook

[ ]

Now, use the deepseek_r1_finetune.Q8_0.gguf file or deepseek_r1_finetune.Q4_K_M.gguf file in llama.cpp.

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0.