Unsloth OpenEnv Gpt Oss (20B) Reinforcement Learning 2048 Game BF16

OpenEnv Gpt Oss (20B) Reinforcement Learning 2048 Game BF16

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

OpenEnv: Agentic Execution Environments

We're using the new OpenEnv library which has over 2000+ environments for RL!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide.

Goal: Make gpt-oss play games with Reinforcement Learning

Our goal is to make OpenAI's open model gpt-oss 20b play the 2048 game with reinforcement learning. We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose.

Installation

We'll be using Unsloth to do RL on GPT-OSS 20B, and OpenEnv for the environment interactions. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!

[ ]

We will then install OpenEnv from source:

[ ]

We'll load GPT-OSS 20B and set some parameters:

max_seq_length = 768 The maximum context length of the model. Increasing it will use more memory.
lora_rank = 4 The larger this number, the smarter the RL process, but the slower and more memory usageload_in_16bit will be faster but will need a 64GB GPU or more (MI300)
offload_embedding = True New Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.

[ ]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

WARNING:torchao:Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu126 for torchao version 0.14.1             Please see https://github.com/pytorch/ao/issues/2919 for more info

🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.10: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00009.safetensors:   0%|          | 0.00/4.50G [00:00<?, ?B/s]

model-00002-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00004-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00005-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00006-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00007-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00008-of-00009.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00009-of-00009.safetensors:   0%|          | 0.00/2.75G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/440 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.

[ ]

Unsloth: Making `model.base_model.model.model` require gradients

2048 game environment with OpenEnv

We first launch an OpenEnv process and import it! This will allows us to see how the 2048 implementation looks like!

[ ]

We'll be using Unsloth's OpenEnv implementation and wrapping the launch_openenv with some setup arguments:

[ ]

Let's see how the current 2048 game state looks like:

[ ]

Unsloth: Creating new OpenEnv process at port = 12724.........

OpenSpielObservation(done=False, reward=None, metadata={}, info_state=[2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], legal_actions=[1, 2, 3], game_phase='initial', current_player_id=0, opponent_last_action=None)

First let's convert the state into a list of list of numbers!

[ ]

([[2, 0, 2, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 4)

We also want to pretty print the game board!

[ ]

┌───┬───┬───┬───┐
│  2│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘

We can see the legal_actions ie what you can take as [0, 1, 2, 3] Let's try doing the action 0.

[ ]

┌───┬───┬───┬───┐
│  2│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘

So it looks like 0 is a move up action! Let's try 1.

[ ]

┌───┬───┬───┬───┐
│  .│  .│  .│  4│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘

1 is a move right action. And 2:

[ ]

┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  4│
└───┴───┴───┴───┘

2 is a move down. And I guess 3 is just move left!

[ ]

┌───┬───┬───┬───┐
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  2│  4│  .│  .│
└───┴───┴───┴───┘

We can also print the game status which indicates if no more moves are possible, and also the possible actions you can take!

[ ]

False
[0, 1, 2, 3]

RL Environment Setup

We'll set up a function to accept some strategy that'll emit an action within 0123 and check the game state.

We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!

[ ]

Let's make a generic strategy to just hit 3. We should expect this generic strategy to fail:

[ ]

(3, False)

To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.

[ ]

Code Execution

To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.

For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:

[ ]

Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}

For the below piece of code, since we import numpy, we should not allow the execution:

[ ]

Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}

We also disallow global variable access. We'll use Unsloth's create_locked_down_function function

[ ]

name 'np' is not defined

[ ]

Data & RL task setup

We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.

[ ]

Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.

First, let's prompt GPT-OSS without RL and see how it goes:

[ ]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-27

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>Need simple heuristic: pick move that maximizes merged tiles? use quick scan.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
    '''Return a move that merges tiles or keeps board unchanged.'''
    import copy, itertools
    def shift_line(line, rev=False):
        if rev: line = line[::-1]
        new = [x for x in line if x]
        merged = []
        i = 0
        while i < len(new):
            if i+1 < len(new) and new[i]==new[i+1]:
                merged.append(new[i]*2); i+=2
            else:
                merged.append(new[i]); i+=1
        merged += [0]*(4-len(merged))
        if rev: merged = merged[::-1]
        return merged
    def play(b, dir):
        if dir==0:   # up
            nb=[[0]*4 for _ in range(4)]
            for c in range(4):
                col=[b[r][c] for r in range(4)]
                nb=[row+[0] for row in nb]
                for r in range(4): nb[r][c]=shift_line(col)[r]
            return nb
        if dir==1:   # right
            nb=[[shift_line(row,True)[::-1] for row in b]]
        if dir==2:   # down
            nb=[[0]*4 for _ in range(4)]
            for c in range(4):
                col=[b[r][c] for r in range(4)][::-1]
                merged=shift_line(col)
                for r in range(4): nb[3-r][c]=merged[r]
            return nb
        if dir==3:   # left
            nb=[[shift_line(row)[::-1] for row in b]]
        return nb
    def score(b):
        s=0
        for r in range(4):
            for c in range(4):
                v=b[r][c]
                if v>0: s+=v*(1 if v==2 else 2)
        return s
    best, best_d=score(board),None
    for d in range(4):
        nb=play(board,d)
        if nb!=board:
            if score(nb)>best:
                best, best_d=score(nb), d
    return str(best_d if best_d is not None else 0)

Reward functions

We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.

And 3 reward functions:

function_works which rewards the model if the strategy is a valid Python function.
no_cheating which checks if the function imported other modules, and if it did, we penalize it.
strategy_succeeds which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.

[ ]

def strategy(board):
    return "0" # Example

Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:

[ ]

(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
,  'stdlib': [],
,  'non_stdlib': [],
,  'relative_imports': 0})

[ ]

no_cheating checks if the function cheated since it might have imported Numpy or other functions:

[ ]

Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "0" which would fail after a time limit of 10 seconds.

We also add a global PRINTER to print out the strategy and board state.

[ ]

We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.

[ ]

{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "0", "1", "2", "3" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n    return "0" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
,   'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'}

Train the model

Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.

We're also using TrackIO which allows you to visualize all training metrics straight inside the notebook fully locally!

[ ]

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[ ]

And let's train the model! NOTE This might be quite slow! 600 steps takes ~5 hours or longer.

TrackIO might be a bit slow to load - wait 2 minutes until the graphs pop up!

[ ]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)

* Running on public URL: https://e870a3fed110b90ddb.gradio.live
* Trackio project initialized: huggingface
* Trackio metrics logged to: /root/.cache/huggingface/trackio

`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.

* Created new run: dainty-sunset-0
def strategy(board):
    '''Return a move that merges the most tiles (0=up,1=right,2=down,3=left).'''
    size = 4
    moves = [0,1,2,3]
    best = 0
    bestmove = 0
    for m in moves:
        merged = 0
        for i in range(size):
            line = []
            for j in range(size):
                if m == 0: v = board[j][i]
                elif m == 1: v = board[i][size-1-j]
                elif m == 2: v = board[size-1-j][i]
                else: v = board[i][j]
                if v: line.append(v)
            merged_this = 0
            j = 0
            while j < len(line)-1:
                if line[j] == line[j+1]:
                    merged_this += 1
                    j += 2
                else:
                    j += 1
            merged += merged_this
        if merged > best:
            best, bestmove = merged, m
    return str(bestmove)
Steps = 24 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
└───┴───┴───┴───┘
Steps = 13 If Done = False
def strategy(board):
    # Move mapping: 0=up,1=down,2=left,3=right
    def swipe(b,dir):
        n=len(b);m=len(b[0]);res=[[0]*m for _ in range(n)]
        for i in range(n):
            for j in range(m):
                r=0 if dir==3 else (n-1 if dir==1 else 0 if dir==0 else m-1)
                cell=b[i][j]
                if cell==0: continue
                if dir==2: # left
                    pos=0
                    while pos+1<m and res[i][pos+1]==0: pos+=1
                elif dir==3: # right
                    pos=m-1
                    while pos-1>=0 and res[i][pos-1]==0: pos-=1
                elif dir==0: # up
                    pos=0
                    while pos+1<n and res[pos+1][j]==0: pos+=1
                else: # down
                    pos=n-1
                    while pos-1>=0 and res[pos-1][j]==0: pos-=1
                # merge
                if dir==2 and pos>0 and res[i][pos-1]==cell:
                    res[i][pos-1]*=2; pos=pos-1
                elif dir==3 and pos<m-1 and res[i][pos+1]==cell:
                    res[i][pos+1]*=2; pos=pos+1
                elif dir==0 and pos>0 and res[pos-1][j]==cell:
                    res[pos-1][j]*=2; pos=pos-1
                else: pos=pos
                if dir==2: res[i][pos]=cell
                elif dir==3: res[i][pos]=cell
                elif dir==0: res[pos][j]=cell
                else: res[pos][j]=cell
        return res
    def score(b):
        s=0
        for row in b:
            s+=sum(x for x in row if x>0)
        return s
    best=-1;best_dir='0'
    for d in range(4):
        nb=swipe(board,d)
        if nb!=board:
            val=score(nb)
            if val>best:
                best=val;best_dir=str(d)
    return best_dir
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘

Unsloth: Will smartly offload gradients to save VRAM!
Steps = 4 If Done = False
def strategy(board):
    def apply(b, d):
        n=len(b); res=[[0]*n for _ in range(n)]
        for i in range(n):
            if d==0: r=[b[j][i] for j in range(n)][::-1]
            elif d==1: r=[b[n-1-j][i] for j in range(n)]
            elif d==2: r=[b[i][j] for j in range(n)][::-1]
            else: r=[b[i][n-1-j] for j in range(n)]
            merged=[]
            skip=False
            for v in r:
                if v:
                    if not merged or merged[-1]!=v:
                        merged.append(v)
                    else:
                        merged[-1]*=2; merged.append(0)
            merged=[x for x in merged if x]
            merged+= [0]*(n-len(merged))
            if d==0:
                for j,v in enumerate(merged): res[n-1-j][i]=v
            elif d==1:
                for j,v in enumerate(merged): res[j][i]=v
            elif d==2:
                for j,v in enumerate(merged): res[i][j]=v
            else:
                for j,v in enumerate(merged): res[i][n-1-j]=v
        return res
    best=-1; bestm=0
    for m in range(4):
        nb=apply(board,m)
        mx=max(max(row) for row in nb)
        if mx>best:
            best=mx; bestm=m
    return str(bestm)  # 0-up,1-down,2-left,3-right
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
Exception = 'int' object is not iterable
Steps = 2 If Done = False
def strategy(board):
    def compress(row):
        new=[x for x in row if x>0]
        res=[]
        skip=False
        for i,v in enumerate(new):
            if skip: skip=False; continue
            if i+1<len(new) and new[i]==new[i+1]:
                res.append(v*2); skip=True
            else: res.append(v)
        return res+[0]*(4-len(res))
    def move(b,dir):
        nb=[[b[r][c] for r in range(4)] for c in range(4)] if dir==2 else b
        if dir==1: nb=list(reversed(nb))
        res=[[0]*4 for _ in range(4)]
        for r in range(4):
            res[r]=compress(nb[r])
        if dir==1: res=list(reversed(res))
        if dir==3: res=list(zip(*res[::-1]))
        if dir==2: res=list(zip(*res))
        return [list(row) for row in res]
    scores=[]
    for d in "0123":
        nb=move(board, int(d))
        scores.append((sum(sum(row) for row in nb), d))
    return max(scores)[1]
┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    import random, itertools
    # helper to copy and shift board
    def shift(b, dir):
        n=4; res=[[0]*n for _ in range(n)]
        for i in range(n):
            line = [b[i][j] if dir%2==0 else b[j][i] for j in range(n)]
            if dir%2==1: line.reverse()
            new=[x for x in line if x]
            merged=[]
            j=0
            while j<len(new):
                if j+1<len(new) and new[j]==new[j+1]:
                    merged.append(new[j]*2); j+=2
                else:
                    merged.append(new[j]); j+=1
            merged += [0]*(n-len(merged))
            if dir%2==1: merged.reverse()
            for j,x in enumerate(merged):
                if dir%2==0: res[i][j]=x
                else: res[j][i]=x
        return res
    def empty(b): return sum(row.count(0) for row in b)
    moves="0123"
    best=None; bestv=-1
    for m in moves:
        nb=shift(board,int(m))
        v=empty(nb)
        if v>bestv: bestv, best=m, v
    return best if best is not None else random.choice(moves)
Exception = '>' not supported between instances of 'int' and 'str'
Steps = 2 If Done = False
def strategy(board):
    import functools
    # SCORE: sum of board weighted by position (desc to keep high tile in corner)
    def score(b):
        w = [[16,8,4,2],[8,4,2,1],[4,2,1,0.5],[2,1,0.5,0.25]]
        return sum(b[i][j]*w[i][j] for i in range(4) for j in range(4))
    # Apply move
    def move(b, d):
        def compress(row):          # slide and merge
            new=[x for x in row if x]
            merged=[]
            i=0
            while i<len(new):
                if i+1<len(new) and new[i]==new[i+1]:
                    merged.append(new[i]*2); i+=2
                else:
                    merged.append(new[i]); i+=1
            return merged+[0]*(4-len(merged))
        rot = {0:lambda x:x,
               1:lambda x: list(zip(*x[::-1])),
               2:lambda x: [row[::-1] for row in x[::-1]],
               3:lambda x: [list(row) for row in zip(*x)][::-1]}
        inv = {0:lambda x:x,
               1:lambda x: [list(row) for row in zip(*x)][::-1],
               2:lambda x: [row[::-1] for row in x[::-1]],
               3:lambda x: list(zip(*x[::-1]))}
        r = rot[d](b)
        moved = [compress(row) for row in r]
        return inv[d](moved)
    best = -1; best_d=0
    for d in range(4):
        nb = move(board,d)
        s = score(nb)
        if s>best: best,best_d=s,d
    return str(best_d)
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘
Steps = 14 If Done = False
def strategy(board):
    def move(b,dir):
        n=len(b); res=[row[:] for row in b]
        delta=[(0,1),(1,0),(0,-1),(-1,0)]
        dr,dc=delta[dir]
        lined=[]
        for i in range(n):
            line=[]
            for j in range(n):
                r,c=(i+dr*j)%n,(j+dc*i)%n
                if b[r][c]!=0: line.append(b[r][c])
            merged=[]
            skip=False
            for k,x in enumerate(line):
                if skip: skip=False; continue
                if k+1<len(line) and line[k+1]==x:
                    merged.append(x*2); skip=True
                else: merged.append(x)
            merged+= [0]*(n-len(merged))
            for j,v in enumerate(merged):
                r,c=(i+dr*j)%n,(j+dc*i)%n
                res[r][c]=v
        return res
    for d in range(4):
        if move(board,d)!=board:
            return str(d)
    return "0"
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  2│
└───┴───┴───┴───┘
Steps = 8 If Done = False
def strategy(board):
    # 0: up, 1: down, 2: left, 3: right
    N = len(board)
    def can_move(dir):
        if dir==0: # up
            for c in range(N):
                for r in range(1,N):
                    if board[r][c]!=0 and (board[r-1][c]==0 or board[r-1][c]==board[r][c]): return True
        elif dir==1: # down
            for c in range(N):
                for r in range(N-2,-1,-1):
                    if board[r][c]!=0 and (board[r+1][c]==0 or board[r+1][c]==board[r][c]): return True
        elif dir==2: # left
            for r in range(N):
                for c in range(1,N):
                    if board[r][c]!=0 and (board[r][c-1]==0 or board[r][c-1]==board[r][c]): return True
        else: # right
            for r in range(N):
                for c in range(N-2,-1,-1):
                    if board[r][c]!=0 and (board[r][c+1]==0 or board[r][c+1]==board[r][c]): return True
        return False
    for d in range(4):
        if can_move(d):
            return str(d)
    return "0"
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  2│
└───┴───┴───┴───┘
None
Steps = 5 If Done = False
def strategy(board):
    import copy, random
    # Meld canned functions
    def rotate(b):
        return [list(x)[::-1] for x in zip(*b)]
    def move(b, dir):
        if dir==1:
            b=rotate(b)
        elif dir==2:
            b=[row[::-1] for row in b]
        elif dir==3:
            b=[row[::-1] for row in zip(*b[::-1])]
        new=[ [0]*4 for _ in range(4) ]
        for r in range(4):
            tmp=[x for x in b[r] if x]
            merged=[]
            i=0
            while i<len(tmp):
                if i+1<len(tmp) and tmp[i]==tmp[i+1]:
                    merged.append(tmp[i]*2); i+=2
                else:
                    merged.append(tmp[i]); i+=1
            for c,val in enumerate(merged):
                new[r][c]=val
        # rotate back
        if dir==1:
            new=rotate(rotate(rotate(new)))
        elif dir==2:
            new=[row[::-1] for row in new]
        elif dir==3:
            new=[row[::-1] for row in zip(*new[::-1])]
        return new
    # evaluate board
    def score(b):
        s=0
        for r in range(4):
            for c in range(4):
                v=b[r][c]
                if v:
                    # favor higher values and empties
                    s+=v* (1 if c==0 else 0)  # simple heuristic
        return s
    best=0; bests=[-1]*4
    for d in range(4):
        nb=move(board,d)
        if nb!=board:
            bests[d]=score(nb)
    return str(bests.index(max(bests))+0)  # return 0-3
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘

Inference

Now let's try the model we just trained!

[ ]

Saving to float16 or `MXFP4`

We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]

And we're done!

Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS

This notebook and all Unsloth notebooks are licensed LGPL-3.0.