Unsloth NeMo Gym Multi Environment

NeMo Gym Multi Environment

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / NeMo-Gym-Multi-Environment.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Goal: Multi-environment GRPO using Unsloth and NeMo Gym

Modern reinforcement learning often involves training a model for more than one task. For example, we may want to train an agent to do deep research, software engineering, and puzzle solving simultaneously. Each environment can have different dependencies, tools, state, or other complex requirements.

In this notebook, we demonstrate how to train a model with multiple NeMo Gym environments using Unsloth. Our goal is to teach Qwen-2.5-1.5b-Instruct to play sudoku AND follow instructions better using GRPO on a single GPU!

You will learn how to:

configure an Unsloth optimized model
start multiple NeMo Gym resources servers
train using Unsloth and multiple NeMo Gym environments
test and save the trained model

This notebook was developed on 1 H100 GPU through NVIDIA Brev.

If you are using a GPU with lower VRAM, you should adjust configuration parameters accordingly, such as max output length, quantization, or parameter efficient finetuning. Unsloth has a bunch of examples of low VRAM training that work with NeMo Gym training environments!

Installation

If you are using Google Colab, please visit Unsloth installation docs rather than the pip install below.

[ ]

Load the model

In this example, we will do full finetuning, but Unsloth supports optimized low precision (e.g. 4 or 8 bit) or parameter-efficient training methods (e.g. LoRA). Check out Unsloth's documentation if you are interested in these methods!

[ ]

If you want to try out LoRA, uncomment the code below, and make sure that full_finetuning is set to False above. LoRA is a parameter-efficient training method that reduces computational cost by only training a small percentage of the full model parameters.

[ ]

NeMo Gym resources server setup

NeMo Gym resources servers provide tool implementations, logic to process actions, update state, provide observations, and calculate rewards for actions taken.

The reasoning gym resource server is an integration of reasoning gym, which is a library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). It includes more than 100 tasks over many domains with configurable difficulty, including but not limited to algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games.

The instruction following resources server evaluates language model responses against instruction-following criteria using Open-Instruct and IFEval constraints.

If you are using Google Colab, add the flag uv_pip_set_python=true to ng_run command.

The cell below will automatically:

Clone NeMo Gym (requires Python 3.12+ and uv on the system)
Set up the virtual environment and install dependencies
Create the mini sudoku training dataset
Download the instruction following dataset
Start both resources servers in the background

Google Colab is auto-detected and the uv_pip_set_python=true flag is added when needed.

Note: The instruction_following config also starts an agent server alongside the resources server. The agent server is unused during training and can be ignored.

[ ]

NeMo Gym starts a head server on port 11000 by default, and the resources server port is selected at random from available ports, unless specified otherwise. We can automatically extract the resources servers ports using the head server:

[ ]

Dataset prep

Next, let's create and load the dataset. We can generate a mini sudoku dataset using the create script in NeMo Gym, which uses the reasoning gym library.

Both datasets were created automatically by the setup cell above. Now load them! We will limit each dataset to 1000 samples to create an even task distribution.

[ ]

Define reward function

Now lets create a reward function that uses NeMo Gym's verifiers, routing tasks to resources servers using the server names:

[ ]

Configure and launch GRPO

In this example, we will train the model using Group Relative Policy Optimization (GRPO), an efficient and effective reinforcement learning algorithm. Unsloth also supports GSPO, GAPO, Dr GRPO and more!

Below we set training hyperparameters. We will train for 100 steps and see significant improvements in the models performance at completing both mini sudoku and diverse instruction following tasks!

[ ]

During training, you should see the average reward increases during training as the model learns how to complete sudoku and instruction following tasks better, though it will be noisy as the tasks are diverse. In order to monitor model improvements closer, an evaluation dataset can be added to training.

Lets start training!

[ ]

Test the trained model!

[ ]

Saving to float16 or MXFP4 for vLLM

Unsloth supports saving to float16 directly. Select merged_16bit for float16. Unsloth also supports saving in low or mixed precision such as mxfp4, and allows lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens. See our docs for more deployment options.

[ ]

And we're done! If you have any questions on Unsloth, we have a Discord channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other resources:

Train your own reasoning model - Llama GRPO notebook Free Colab
Saving finetunes to Ollama. Free notebook
Llama 3.2 Vision finetuning - Radiography use case. Free Colab
See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our documentation!

Join Discord if you need help + ⭐️ Star us on Github ⭐️

This notebook and all Unsloth notebooks are licensed LGPL-3.0