Unsloth NeMo Gym Sudoku

NeMo Gym Sudoku

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / NeMo-Gym-Sudoku.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Goal: Teach a model to play sudoku with GRPO using Unsloth and NeMo Gym

Our goal is to teach Qwen-2.5-1.5b-Instruct to play sudoku using GRPO on a single GPU!

You will learn how to:

configure an Unsloth optimized model
start a NeMo Gym resources server
train using Unsloth and NeMo Gym
test and save the trained model

This notebook was developed on 1 H100 GPU through NVIDIA Brev.

If you are using a GPU with lower VRAM, you should adjust configuration parameters accordingly, such as max output length, quantization, or parameter efficient finetuning. Unsloth has a bunch of examples of low VRAM training that work with NeMo Gym training environments!

Installation

If you are using Google Colab, please visit Unsloth installation docs rather than the pip install below.

[ ]

Load the model

In this example, we will do full finetuning, but Unsloth supports optimized low precision (e.g. 4 or 8 bit) or parameter-efficient training methods (e.g. LoRA). Check out Unsloth's documentation if you are interested in these methods!

[ ]

If you want to try out LoRA, uncomment the code below, and make sure that full_finetuning is set to False above. LoRA is a parameter-efficient training method that reduces computational cost by only training a small percentage of the full model parameters.

[ ]

NeMo Gym resources server setup

NeMo Gym resources servers provide tool implementations, logic to process actions, update state, provide observations, and calculate rewards for actions taken.

The reasoning gym resources server is an integration of reasoning gym, which is a library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). It includes more than 100 tasks over many domains with configurable difficulty, including but not limited to algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games.

If you are using Google Colab, add the flag uv_pip_set_python=true to ng_run command.

The cell below will automatically:

Clone NeMo Gym (requires Python 3.12+ and uv on the system)
Set up the virtual environment and install dependencies
Create the mini sudoku training dataset
Start the resources server in the background

Google Colab is auto-detected and the uv_pip_set_python=true flag is added when needed.

[ ]

NeMo Gym starts a head server on port 11000 by default, and the resources server port is selected at random from available ports, unless specified otherwise. We can automatically extract the resources server port using the head server:

[ ]

Dataset prep

Next, let's create and load the dataset. We can generate a mini sudoku dataset using the script in NeMo Gym.

The dataset was created automatically by the setup cell above. Now load it!

[ ]

Define reward function

Now lets create a reward function that uses NeMo Gym's verifier

[ ]

Configure and launch GRPO

In this example, we will train the model using Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. Unsloth also supports GSPO, GAPO, Dr GRPO and more!

Below we set training hyperparameters. We will train for 100 steps and see significant improvements in the models performance at completing mini sudoku!

[ ]

During training, you should see the reward rise from around 0.15 to 0.6 over 100 steps as the model learns how to play this version of sudoku.

[ ]

Test the trained model!

[ ]

Saving to float16 or MXFP4 for vLLM

Unsloth supports saving to float16 directly. Select merged_16bit for float16. Unsloth also supports saving in low or mixed precision such as mxfp4, and allows lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[ ]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0