Goal: Teach a model to play sudoku with GRPO using Unsloth and NeMo Gym
Our goal is to teach Qwen-2.5-1.5b-Instruct to play sudoku using GRPO on a single GPU!
You will learn how to:
- configure an Unsloth optimized model
- start a NeMo Gym resources server
- train using Unsloth and NeMo Gym
- test and save the trained model
This notebook was developed on 1 H100 GPU through NVIDIA Brev.
If you are using a GPU with lower VRAM, you should adjust configuration parameters accordingly, such as max output length, quantization, or parameter efficient finetuning. Unsloth has a bunch of examples of low VRAM training that work with NeMo Gym training environments!
Installation
If you are using Google Colab, please visit Unsloth installation docs rather than the pip install below.
Load the model
In this example, we will do full finetuning, but Unsloth supports optimized low precision (e.g. 4 or 8 bit) or parameter-efficient training methods (e.g. LoRA). Check out Unsloth's documentation if you are interested in these methods!
If you want to try out LoRA, uncomment the code below, and make sure that full_finetuning is set to False above. LoRA is a parameter-efficient training method that reduces computational cost by only training a small percentage of the full model parameters.
NeMo Gym resources server setup
NeMo Gym resources servers provide tool implementations, logic to process actions, update state, provide observations, and calculate rewards for actions taken.
The reasoning gym resources server is an integration of reasoning gym, which is a library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). It includes more than 100 tasks over many domains with configurable difficulty, including but not limited to algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games.
If you are using Google Colab, add the flag uv_pip_set_python=true to ng_run command.
The cell below will automatically:
- Clone NeMo Gym (requires Python 3.12+ and
uvon the system) - Set up the virtual environment and install dependencies
- Create the mini sudoku training dataset
- Start the resources server in the background
Google Colab is auto-detected and the uv_pip_set_python=true flag is added when needed.
NeMo Gym starts a head server on port 11000 by default, and the resources server port is selected at random from available ports, unless specified otherwise. We can automatically extract the resources server port using the head server:
Dataset prep
Next, let's create and load the dataset. We can generate a mini sudoku dataset using the script in NeMo Gym.
The dataset was created automatically by the setup cell above. Now load it!
Define reward function
Now lets create a reward function that uses NeMo Gym's verifier
Configure and launch GRPO
In this example, we will train the model using Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. Unsloth also supports GSPO, GAPO, Dr GRPO and more!
Below we set training hyperparameters. We will train for 100 steps and see significant improvements in the models performance at completing mini sudoku!
During training, you should see the reward rise from around 0.15 to 0.6 over 100 steps as the model learns how to play this version of sudoku.
Test the trained model!
Saving to float16 or MXFP4 for vLLM
Unsloth supports saving to float16 directly. Select merged_16bit for float16. Unsloth also supports saving in low or mixed precision such as mxfp4, and allows lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.
And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!
Some other resources:
- Train your own reasoning model - Llama GRPO notebook Free Colab
- Saving finetunes to Ollama. Free notebook
- Llama 3.2 Vision finetuning - Radiography use case. Free Colab
- See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!


