OpenEnv Gpt Oss (20B) Reinforcement Learning 2048 Game BF16
Goal: Make gpt-oss play games with Reinforcement Learning
Our goal is to make OpenAI's open model gpt-oss 20b play the 2048 game with reinforcement learning. We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose.
We will then install OpenEnv from source:
We'll load GPT-OSS 20B and set some parameters:
max_seq_length = 768The maximum context length of the model. Increasing it will use more memory.lora_rank = 4The larger this number, the smarter the RL process, but the slower and more memory usageload_in_16bitwill be faster but will need a 64GB GPU or more (MI300)offload_embedding = TrueNew Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
WARNING:torchao:Skipping import of cpp extensions due to incompatible torch version 2.8.0+cu126 for torchao version 0.14.1 Please see https://github.com/pytorch/ao/issues/2919 for more info
🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.10.10: Fast Gpt_Oss patching. Transformers: 4.56.2. \\ /| Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.8.0+cu126. Triton: 3.4.0 \ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
model.safetensors.index.json: 0.00B [00:00, ?B/s]
model-00001-of-00009.safetensors: 0%| | 0.00/4.50G [00:00<?, ?B/s]
model-00002-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00003-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00004-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00005-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00006-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00007-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00008-of-00009.safetensors: 0%| | 0.00/4.94G [00:00<?, ?B/s]
model-00009-of-00009.safetensors: 0%| | 0.00/2.75G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/165 [00:00<?, ?B/s]
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json: 0%| | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/440 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.
Unsloth: Making `model.base_model.model.model` require gradients
2048 game environment with OpenEnv
We first launch an OpenEnv process and import it! This will allows us to see how the 2048 implementation looks like!
We'll be using Unsloth's OpenEnv implementation and wrapping the launch_openenv with some setup arguments:
Let's see how the current 2048 game state looks like:
Unsloth: Creating new OpenEnv process at port = 12724.........
OpenSpielObservation(done=False, reward=None, metadata={}, info_state=[2.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], legal_actions=[1, 2, 3], game_phase='initial', current_player_id=0, opponent_last_action=None) First let's convert the state into a list of list of numbers!
([[2, 0, 2, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 4)
We also want to pretty print the game board!
┌───┬───┬───┬───┐ │ 2│ .│ 2│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ └───┴───┴───┴───┘
We can see the legal_actions ie what you can take as [0, 1, 2, 3] Let's try doing the action 0.
┌───┬───┬───┬───┐ │ 2│ .│ 2│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ └───┴───┴───┴───┘
So it looks like 0 is a move up action! Let's try 1.
┌───┬───┬───┬───┐ │ .│ .│ .│ 4│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ 2│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ └───┴───┴───┴───┘
1 is a move right action. And 2:
┌───┬───┬───┬───┐ │ .│ .│ .│ 2│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ 2│ .│ 4│ └───┴───┴───┴───┘
2 is a move down. And I guess 3 is just move left!
┌───┬───┬───┬───┐ │ 2│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ .│ .│ ├───┼───┼───┼───┤ │ .│ .│ 2│ .│ ├───┼───┼───┼───┤ │ 2│ 4│ .│ .│ └───┴───┴───┴───┘
We can also print the game status which indicates if no more moves are possible, and also the possible actions you can take!
False [0, 1, 2, 3]
RL Environment Setup
We'll set up a function to accept some strategy that'll emit an action within 0123 and check the game state.
We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
Let's make a generic strategy to just hit 3. We should expect this generic strategy to fail:
(3, False)
To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.
Code Execution
To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.
For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:
Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}
For the below piece of code, since we import numpy, we should not allow the execution:
Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}
We also disallow global variable access. We'll use Unsloth's create_locked_down_function function
name 'np' is not defined
60
Data & RL task setup
We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.
First, let's prompt GPT-OSS without RL and see how it goes:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-27
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>Need simple heuristic: pick move that maximizes merged tiles? use quick scan.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
'''Return a move that merges tiles or keeps board unchanged.'''
import copy, itertools
def shift_line(line, rev=False):
if rev: line = line[::-1]
new = [x for x in line if x]
merged = []
i = 0
while i < len(new):
if i+1 < len(new) and new[i]==new[i+1]:
merged.append(new[i]*2); i+=2
else:
merged.append(new[i]); i+=1
merged += [0]*(4-len(merged))
if rev: merged = merged[::-1]
return merged
def play(b, dir):
if dir==0: # up
nb=[[0]*4 for _ in range(4)]
for c in range(4):
col=[b[r][c] for r in range(4)]
nb=[row+[0] for row in nb]
for r in range(4): nb[r][c]=shift_line(col)[r]
return nb
if dir==1: # right
nb=[[shift_line(row,True)[::-1] for row in b]]
if dir==2: # down
nb=[[0]*4 for _ in range(4)]
for c in range(4):
col=[b[r][c] for r in range(4)][::-1]
merged=shift_line(col)
for r in range(4): nb[3-r][c]=merged[r]
return nb
if dir==3: # left
nb=[[shift_line(row)[::-1] for row in b]]
return nb
def score(b):
s=0
for r in range(4):
for c in range(4):
v=b[r][c]
if v>0: s+=v*(1 if v==2 else 2)
return s
best, best_d=score(board),None
for d in range(4):
nb=play(board,d)
if nb!=board:
if score(nb)>best:
best, best_d=score(nb), d
return str(best_d if best_d is not None else 0)
Reward functions
We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.
And 3 reward functions:
function_workswhich rewards the model if the strategy is a valid Python function.no_cheatingwhich checks if the function imported other modules, and if it did, we penalize it.strategy_succeedswhich checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
def strategy(board):
return "0" # Example
Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:
(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
, 'stdlib': [],
, 'non_stdlib': [],
, 'relative_imports': 0}) no_cheating checks if the function cheated since it might have imported Numpy or other functions:
Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "0" which would fail after a time limit of 10 seconds.
We also add a global PRINTER to print out the strategy and board state.
We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.
181
{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "0", "1", "2", "3" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n return "0" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
, 'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'} Train the model
Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.
We're also using TrackIO which allows you to visualize all training metrics straight inside the notebook fully locally!
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`. We will change the batch size of 1 to the `num_generations` of 2
And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!
You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
| Step | Training Loss | reward | reward_std | completion_length | kl |
|---|---|---|---|---|---|
| 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
| 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
| 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
And let's train the model! NOTE This might be quite slow! 600 steps takes ~5 hours or longer.
TrackIO might be a bit slow to load - wait 2 minutes until the graphs pop up!
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
"-____-" Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
* Running on public URL: https://e870a3fed110b90ddb.gradio.live * Trackio project initialized: huggingface * Trackio metrics logged to: /root/.cache/huggingface/trackio
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.
* Created new run: dainty-sunset-0
def strategy(board):
'''Return a move that merges the most tiles (0=up,1=right,2=down,3=left).'''
size = 4
moves = [0,1,2,3]
best = 0
bestmove = 0
for m in moves:
merged = 0
for i in range(size):
line = []
for j in range(size):
if m == 0: v = board[j][i]
elif m == 1: v = board[i][size-1-j]
elif m == 2: v = board[size-1-j][i]
else: v = board[i][j]
if v: line.append(v)
merged_this = 0
j = 0
while j < len(line)-1:
if line[j] == line[j+1]:
merged_this += 1
j += 2
else:
j += 1
merged += merged_this
if merged > best:
best, bestmove = merged, m
return str(bestmove)
Steps = 24 If Done = False
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ 2│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ 2│
└───┴───┴───┴───┘
Steps = 13 If Done = False
def strategy(board):
# Move mapping: 0=up,1=down,2=left,3=right
def swipe(b,dir):
n=len(b);m=len(b[0]);res=[[0]*m for _ in range(n)]
for i in range(n):
for j in range(m):
r=0 if dir==3 else (n-1 if dir==1 else 0 if dir==0 else m-1)
cell=b[i][j]
if cell==0: continue
if dir==2: # left
pos=0
while pos+1<m and res[i][pos+1]==0: pos+=1
elif dir==3: # right
pos=m-1
while pos-1>=0 and res[i][pos-1]==0: pos-=1
elif dir==0: # up
pos=0
while pos+1<n and res[pos+1][j]==0: pos+=1
else: # down
pos=n-1
while pos-1>=0 and res[pos-1][j]==0: pos-=1
# merge
if dir==2 and pos>0 and res[i][pos-1]==cell:
res[i][pos-1]*=2; pos=pos-1
elif dir==3 and pos<m-1 and res[i][pos+1]==cell:
res[i][pos+1]*=2; pos=pos+1
elif dir==0 and pos>0 and res[pos-1][j]==cell:
res[pos-1][j]*=2; pos=pos-1
else: pos=pos
if dir==2: res[i][pos]=cell
elif dir==3: res[i][pos]=cell
elif dir==0: res[pos][j]=cell
else: res[pos][j]=cell
return res
def score(b):
s=0
for row in b:
s+=sum(x for x in row if x>0)
return s
best=-1;best_dir='0'
for d in range(4):
nb=swipe(board,d)
if nb!=board:
val=score(nb)
if val>best:
best=val;best_dir=str(d)
return best_dir
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ 2│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ 2│ .│
└───┴───┴───┴───┘
Unsloth: Will smartly offload gradients to save VRAM!
Steps = 4 If Done = False
def strategy(board):
def apply(b, d):
n=len(b); res=[[0]*n for _ in range(n)]
for i in range(n):
if d==0: r=[b[j][i] for j in range(n)][::-1]
elif d==1: r=[b[n-1-j][i] for j in range(n)]
elif d==2: r=[b[i][j] for j in range(n)][::-1]
else: r=[b[i][n-1-j] for j in range(n)]
merged=[]
skip=False
for v in r:
if v:
if not merged or merged[-1]!=v:
merged.append(v)
else:
merged[-1]*=2; merged.append(0)
merged=[x for x in merged if x]
merged+= [0]*(n-len(merged))
if d==0:
for j,v in enumerate(merged): res[n-1-j][i]=v
elif d==1:
for j,v in enumerate(merged): res[j][i]=v
elif d==2:
for j,v in enumerate(merged): res[i][j]=v
else:
for j,v in enumerate(merged): res[i][n-1-j]=v
return res
best=-1; bestm=0
for m in range(4):
nb=apply(board,m)
mx=max(max(row) for row in nb)
if mx>best:
best=mx; bestm=m
return str(bestm) # 0-up,1-down,2-left,3-right
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ 2│
├───┼───┼───┼───┤
│ 2│ .│ .│ .│
└───┴───┴───┴───┘
Exception = 'int' object is not iterable
Steps = 2 If Done = False
def strategy(board):
def compress(row):
new=[x for x in row if x>0]
res=[]
skip=False
for i,v in enumerate(new):
if skip: skip=False; continue
if i+1<len(new) and new[i]==new[i+1]:
res.append(v*2); skip=True
else: res.append(v)
return res+[0]*(4-len(res))
def move(b,dir):
nb=[[b[r][c] for r in range(4)] for c in range(4)] if dir==2 else b
if dir==1: nb=list(reversed(nb))
res=[[0]*4 for _ in range(4)]
for r in range(4):
res[r]=compress(nb[r])
if dir==1: res=list(reversed(res))
if dir==3: res=list(zip(*res[::-1]))
if dir==2: res=list(zip(*res))
return [list(row) for row in res]
scores=[]
for d in "0123":
nb=move(board, int(d))
scores.append((sum(sum(row) for row in nb), d))
return max(scores)[1]
┌───┬───┬───┬───┐
│ .│ .│ .│ 2│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ 2│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
└───┴───┴───┴───┘
def strategy(board):
import random, itertools
# helper to copy and shift board
def shift(b, dir):
n=4; res=[[0]*n for _ in range(n)]
for i in range(n):
line = [b[i][j] if dir%2==0 else b[j][i] for j in range(n)]
if dir%2==1: line.reverse()
new=[x for x in line if x]
merged=[]
j=0
while j<len(new):
if j+1<len(new) and new[j]==new[j+1]:
merged.append(new[j]*2); j+=2
else:
merged.append(new[j]); j+=1
merged += [0]*(n-len(merged))
if dir%2==1: merged.reverse()
for j,x in enumerate(merged):
if dir%2==0: res[i][j]=x
else: res[j][i]=x
return res
def empty(b): return sum(row.count(0) for row in b)
moves="0123"
best=None; bestv=-1
for m in moves:
nb=shift(board,int(m))
v=empty(nb)
if v>bestv: bestv, best=m, v
return best if best is not None else random.choice(moves)
Exception = '>' not supported between instances of 'int' and 'str'
Steps = 2 If Done = False
def strategy(board):
import functools
# SCORE: sum of board weighted by position (desc to keep high tile in corner)
def score(b):
w = [[16,8,4,2],[8,4,2,1],[4,2,1,0.5],[2,1,0.5,0.25]]
return sum(b[i][j]*w[i][j] for i in range(4) for j in range(4))
# Apply move
def move(b, d):
def compress(row): # slide and merge
new=[x for x in row if x]
merged=[]
i=0
while i<len(new):
if i+1<len(new) and new[i]==new[i+1]:
merged.append(new[i]*2); i+=2
else:
merged.append(new[i]); i+=1
return merged+[0]*(4-len(merged))
rot = {0:lambda x:x,
1:lambda x: list(zip(*x[::-1])),
2:lambda x: [row[::-1] for row in x[::-1]],
3:lambda x: [list(row) for row in zip(*x)][::-1]}
inv = {0:lambda x:x,
1:lambda x: [list(row) for row in zip(*x)][::-1],
2:lambda x: [row[::-1] for row in x[::-1]],
3:lambda x: list(zip(*x[::-1]))}
r = rot[d](b)
moved = [compress(row) for row in r]
return inv[d](moved)
best = -1; best_d=0
for d in range(4):
nb = move(board,d)
s = score(nb)
if s>best: best,best_d=s,d
return str(best_d)
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ 2│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ 2│ .│
└───┴───┴───┴───┘
Steps = 14 If Done = False
def strategy(board):
def move(b,dir):
n=len(b); res=[row[:] for row in b]
delta=[(0,1),(1,0),(0,-1),(-1,0)]
dr,dc=delta[dir]
lined=[]
for i in range(n):
line=[]
for j in range(n):
r,c=(i+dr*j)%n,(j+dc*i)%n
if b[r][c]!=0: line.append(b[r][c])
merged=[]
skip=False
for k,x in enumerate(line):
if skip: skip=False; continue
if k+1<len(line) and line[k+1]==x:
merged.append(x*2); skip=True
else: merged.append(x)
merged+= [0]*(n-len(merged))
for j,v in enumerate(merged):
r,c=(i+dr*j)%n,(j+dc*i)%n
res[r][c]=v
return res
for d in range(4):
if move(board,d)!=board:
return str(d)
return "0"
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ 2│ .│ 2│
└───┴───┴───┴───┘
Steps = 8 If Done = False
def strategy(board):
# 0: up, 1: down, 2: left, 3: right
N = len(board)
def can_move(dir):
if dir==0: # up
for c in range(N):
for r in range(1,N):
if board[r][c]!=0 and (board[r-1][c]==0 or board[r-1][c]==board[r][c]): return True
elif dir==1: # down
for c in range(N):
for r in range(N-2,-1,-1):
if board[r][c]!=0 and (board[r+1][c]==0 or board[r+1][c]==board[r][c]): return True
elif dir==2: # left
for r in range(N):
for c in range(1,N):
if board[r][c]!=0 and (board[r][c-1]==0 or board[r][c-1]==board[r][c]): return True
else: # right
for r in range(N):
for c in range(N-2,-1,-1):
if board[r][c]!=0 and (board[r][c+1]==0 or board[r][c+1]==board[r][c]): return True
return False
for d in range(4):
if can_move(d):
return str(d)
return "0"
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ 2│ .│ 2│
└───┴───┴───┴───┘
None
Steps = 5 If Done = False
def strategy(board):
import copy, random
# Meld canned functions
def rotate(b):
return [list(x)[::-1] for x in zip(*b)]
def move(b, dir):
if dir==1:
b=rotate(b)
elif dir==2:
b=[row[::-1] for row in b]
elif dir==3:
b=[row[::-1] for row in zip(*b[::-1])]
new=[ [0]*4 for _ in range(4) ]
for r in range(4):
tmp=[x for x in b[r] if x]
merged=[]
i=0
while i<len(tmp):
if i+1<len(tmp) and tmp[i]==tmp[i+1]:
merged.append(tmp[i]*2); i+=2
else:
merged.append(tmp[i]); i+=1
for c,val in enumerate(merged):
new[r][c]=val
# rotate back
if dir==1:
new=rotate(rotate(rotate(new)))
elif dir==2:
new=[row[::-1] for row in new]
elif dir==3:
new=[row[::-1] for row in zip(*new[::-1])]
return new
# evaluate board
def score(b):
s=0
for r in range(4):
for c in range(4):
v=b[r][c]
if v:
# favor higher values and empties
s+=v* (1 if c==0 else 0) # simple heuristic
return s
best=0; bests=[-1]*4
for d in range(4):
nb=move(board,d)
if nb!=board:
bests[d]=score(nb)
return str(bests.index(max(bests))+0) # return 0-3
┌───┬───┬───┬───┐
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ .│ .│ .│
├───┼───┼───┼───┤
│ .│ 2│ .│ .│
├───┼───┼───┼───┤
│ 2│ .│ .│ .│
└───┴───┴───┴───┘
Saving to float16 or MXFP4
We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
And we're done!
Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS
This notebook and all Unsloth notebooks are licensed LGPL-3.0.


