Gpt Oss (20B) Reinforcement Learning 2048 Game DGX Spark
Goal: Make GPT-OSS play games with Reinforcement Learning
Our goal is to make GPT-OSS play the 2048 game with reinforcement learning, or a variant of it called GRPO.
We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
Installation
We'll be using Unsloth to do RL on GPT-OSS 20B. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster, which allows us to fit GPT-OSS RL in a free Google Colab instance.
We'll load GPT-OSS 20B and set some parameters:
max_seq_length = 768The maximum context length of the model. Increasing it will use more memory, and 768 was the maximum we found to fit on a free 15GB Tesla T4 machinelora_rank = 4The larger this number, the smarter the RL process, but the slower and more memory usageload_in_4bit = TrueUses quantization to reduce memory usage by 75% without reducing accuracy that much.load_in_16bitwill be faster but will need a 80GB GPU (H100, B200)offload_embedding = TrueNew Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. 🦥 Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.10.1: Fast Gpt_Oss patching. Transformers: 4.56.2. \\ /| Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0 \ / Bfloat16 = FALSE. FA [Xformers = None. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Unsloth: Using float16 precision for gpt_oss won't work! Using float32.
model.safetensors.index.json: 0.00B [00:00, ?B/s]
Fetching 4 files: 0%| | 0/4 [00:00<?, ?it/s]
model-00001-of-00004.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00004-of-00004.safetensors: 0%| | 0.00/1.16G [00:00<?, ?B/s]
model-00002-of-00004.safetensors: 0%| | 0.00/4.00G [00:00<?, ?B/s]
model-00003-of-00004.safetensors: 0%| | 0.00/3.37G [00:00<?, ?B/s]
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]
generation_config.json: 0%| | 0.00/165 [00:00<?, ?B/s]
Unsloth: Offloading embeddings to RAM to save 1.08 GB.
tokenizer_config.json: 0.00B [00:00, ?B/s]
tokenizer.json: 0%| | 0.00/27.9M [00:00<?, ?B/s]
special_tokens_map.json: 0%| | 0.00/446 [00:00<?, ?B/s]
chat_template.jinja: 0.00B [00:00, ?B/s]
To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.
Unsloth: Making `model.base_model.model.model` require gradients
2048 game
We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
[NOTE] 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See Wikipedia page for more details.
┌───┬───┬───┬───┬───┐ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ 2│ .│ .│ .│ 2│ └───┴───┴───┴───┴───┘ ongoing
GameBoard(size=5, seed=42, target=8, probability_fours=0.1)
We'll use WASD for the action space:
W
A S D
Also game.state() will say success if we succeeded in getting the target!
┌───┬───┬───┬───┬───┐ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ 2│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ 4│ .│ .│ .│ .│ └───┴───┴───┴───┴───┘ ongoing
┌───┬───┬───┬───┬───┐ │ 4│ .│ 2│ .│ .│ ├───┼───┼───┼───┼───┤ │ 2│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ └───┴───┴───┴───┴───┘ ongoing
┌───┬───┬───┬───┬───┐ │ .│ .│ .│ 4│ 2│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ 2│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ 4│ .│ .│ .│ .│ └───┴───┴───┴───┴───┘ ongoing
┌───┬───┬───┬───┬───┐ │ 4│ .│ .│ 4│ 4│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ 4│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ └───┴───┴───┴───┴───┘ ongoing
┌───┬───┬───┬───┬───┐ │ .│ .│ 2│ 4│ 8│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ 4│ ├───┼───┼───┼───┼───┤ │ .│ .│ .│ .│ .│ └───┴───┴───┴───┴───┘ success
If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
┌───┬───┬───┐ │ .│ 4│ .│ ├───┼───┼───┤ │ .│ .│ 2│ ├───┼───┼───┤ │ .│ .│ .│ └───┴───┴───┘ failed
RL Environment Setup
We'll set up a function to accept some strategy that'll emit an action within WASD and check the game state.
We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
Let's make a generic strategy to just hit W. We should expect this generic strategy to fail:
Timed out with error = Timed out after 2s
To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.
Code Execution
To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.
For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:
Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}
For the below piece of code, since we import numpy, we should not allow the execution:
Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}
We also disallow global variable access. We'll use Unsloth's create_locked_down_function function
name 'np' is not defined
60
Data & RL task setup
We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.
First, let's prompt GPT-OSS without RL and see how it goes:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-05
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need simple heuristic: return random move or based on empty spaces. Provide function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
# Count empty cells and choose move that reduces board density
empty_counts = {
"W": sum(row.count(0) for row in board[1:]),
"S": sum(row.count(0) for row in board[:-1]),
"A": sum(col.count(0) for col in zip(*board)[1:]),
"D": sum(col.count(0) for col in zip(*board)[:-1]),
}
# Prefer the direction with the most empty cells
return max(empty_counts, key=empty_counts.get)
```<|return|>
Reward functions
We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.
And 3 reward functions:
function_workswhich rewards the model if the strategy is a valid Python function.no_cheatingwhich checks if the function imported other modules, and if it did, we penalize it.strategy_succeedswhich checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
def strategy(board):
return "W" # Example
Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:
(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
, 'stdlib': [],
, 'non_stdlib': [],
, 'relative_imports': 0}) no_cheating checks if the function cheated since it might have imported Numpy or other functions:
Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
We also add a global PRINTER to print out the strategy and board state.
We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like H100s.
181
{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "W", "A", "S", "D" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n return "W" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
, 'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'} Train the model
Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`. We will change the batch size of 1 to the `num_generations` of 2
And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!
You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
| Step | Training Loss | reward | reward_std | completion_length | kl |
|---|---|---|---|---|---|
| 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
| 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
| 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
Unsloth: Switching to float32 training since model cannot work with float16
And let's train the model!
NOTE A T4 free GPU might take 5 minutes for one generation sadly since it's an old GPU - A100 or H100 will be much faster!
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998, 'pad_token_id': 200017}.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 2
\\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 1,000
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
"-____-" Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.
None
Steps = 1 State = failed
def strategy(board):
# simple heuristic: prefer right or down, then left, then up
for move in "R D L U".split():
pass
┌───┬───┬───┬───┬───┬───┐
│ 2│ 2│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
└───┴───┴───┴───┴───┴───┘
Steps = 1 State = failed
def strategy(board):
# Helper: simulate a move, return new board and score
def simulate(board, dir):
n = len(board)
new = [[0]*n for _ in range(n)]
score = 0
for i in range(n):
# extract line
if dir == 'A':
line = [board[i][j] for j in range(n)]
rev = False
elif dir == 'D':
line = [board[i][j] for j in range(n-1, -1, -1)]
rev = True
elif dir == 'W':
line = [board[j][i] for j in range(n)]
rev = False
else: # 'S'
line = [board[j][i] for j in range(n-1, -1, -1)]
rev = True
# compress and merge
new_line = [x for x in line if x != 0]
merged = []
j = 0
while j < len(new_line):
if j + 1 < len(new_line) and new_line[j] == new_line[j+1]:
merged.append(new_line[j]*2)
score += new_line[j]*2
j += 2
else:
merged.append(new_line[j])
j += 1
# fill with zeros
merged += [0]*(n-len(merged))
# place back
if rev:
merged = merged[::-1]
if dir in ('A','D'):
for j in range(n):
new[i][j] = merged[j]
else:
for j in range(n):
new[j][i] = merged[j]
return new, score
best, best_dir = 0, None
for dir in ('W','A','S','D'):
_, score = simulate(board, dir)
if score > best:
best, best_dir = score, dir
return best_dir # returns one of 'W','A','S','D'
┌───┬───┬───┬───┬───┬───┐
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ 4│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ 2│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
├───┼───┼───┼───┼───┼───┤
│ .│ .│ .│ .│ .│ .│
└───┴───┴───┴───┴───┴───┘
Unsloth: Will smartly offload gradients to save VRAM!
def strategy(board):
# helpers
def move(b, d):
n = len(b)
def compress(row):
new = [x for x in row if x!=0]
for i in range(len(new)-1):
if new[i]==new[i+1]:
new[i]*=2; new[i+1]=0
return [x for x in new if x!=0]+[0]*(n-len(new))
res=[[0]*n for _ in range(n)]
if d=="W":
for j in range(n):
col=[b[i][j] for i in range(n)]
col=compress(col)
for i in range(n):
res[i][j]=col[i]
elif d=="S":
for j in range(n):
col=[b[i][j] for i in range(n)][::-1]
col=compress(col)
col=col[::-1]
for i in range(n):
res[i][j]=col[i]
elif d=="A":
for i in range(n):
row=compress(b[i])
res[i]=row
elif d=="D":
for i in range(n):
row=compress(b[i][::-1])
row=row[::-1]
res[i]=row
return res
def score(b):
return sum(sum(row) for row in b)
moves="WASD"
best=None; best_val=-1
for m in moves:
nb=move(board, m)
val=score(nb)
if val>best_val and any(nb[i][j]!=board[i][j] for i in range(len(nb)) for j in range(len(nb[0]))):
best_val=val; best=m
return best if best else "W"
Exception = list index out of range
Timeout
Steps = 475 State = failed
def strategy(board):
def move_possible(board, direction):
rows, cols = len(board), len(board[0])
if direction == 'W':
for j in range(cols):
for i in range(1, rows):
if board[i][j] != 0:
for k in range(i-1, -1, -1):
if board[k][j] == 0 or board[k][j] == board[i][j]:
return True
if board[k][j] != 0:
break
elif direction == 'S':
for j in range(cols):
for i in range(rows-2, -1, -1):
if board[i][j] != 0:
for k in range(i+1, rows):
if board[k][j] == 0 or board[k][j] == board[i][j]:
return True
if board[k][j] != 0:
break
elif direction == 'A':
for i in range(rows):
for j in range(1, cols):
if board[i][j] != 0:
for k in range(j-1, -1, -1):
if board[i][k] == 0 or board[i][k] == board[i][j]:
return True
if board[i][k] != 0:
break
elif direction == 'D':
for i in range(rows):
for j in range(cols-2, -1, -1):
if board[i][j] != 0:
for k in range(j+1, cols):
if board[i][k] == 0 or board[i][k] == board[i][j]:
return True
if board[i][k] != 0:
break
return False
# Prefer moves that allow a merge as they increase score
for d in ('W', 'S', 'A', 'D'):
if move_possible(board, d):
return d
# If no merges are possible, pick any direction that moves tiles
for d in ('W', 'S', 'A', 'D'):
if any(board[i][j] != 0 for i in range(len(board)) for j in range(len(board[0]))):
return d
return 'W'
┌───┬───┬───┬───┬───┬───┐
│ 2│ 16│ 4│ 2│ 8│ 4│
├───┼───┼───┼───┼───┼───┤
│ 8│ 2│ 32│ 8│128│ 8│
├───┼───┼───┼───┼───┼───┤
│ 32│ 64│256│ 2│ 64│ 32│
├───┼───┼───┼───┼───┼───┤
│128│ 8│ 16│ 64│ 32│ 8│
├───┼───┼───┼───┼───┼───┤
│ 4│ 2│ 4│ 16│ 8│ 4│
├───┼───┼───┼───┼───┼───┤
│ 64│ 4│ 2│ 8│ 4│ 2│
└───┴───┴───┴───┴───┴───┘
Exception = '>' not supported between instances of 'tuple' and 'float'
def strategy(board):
import random, copy
def rotate(b):
return [[b[3-j][i] for j in range(4)] for i in range(4)]
def compress(b):
new = []
for row in b:
new_row = [x for x in row if x != 0]
new_row += [0]*(4-len(new_row))
new.append(new_row)
return new
def merge(b):
for row in b:
for i in range(3):
if row[i]==row[i+1] and row[i]!=0:
row[i]*=2
row[i+1]=0
def move(b, dir):
if dir=="W":
return merge(rotate(compress(rotate(b))))
if dir=="S":
return rotate(merge(compress(rotate(b))))
if dir=="A":
return merge(compress(b))
if dir=="D":
return rotate(merge(compress(rotate(b)))) # actually reverse
best_score=0
best_move=None
for move_dir in "WASD":
new_board=move(copy.deepcopy(board), move_dir)
score=sum(sum(row) for row in new_board)
if score>best_score:
best_score=score
best_move=move_dir
return best_move
Exception = 'NoneType' object is not iterable
Exception = name 'n' is not defined
Timeout
Timeout
None
Timeout
def strategy(board):
# Prioritize merges, then favor left/up moves
rows, cols = len(board), len(board[0]) if board else 0
# Helper to check if a move is possible
def can_move(direction):
if direction == 'W':
for c in range(cols):
for r in range(rows-1):
if board[r][c] == 0 or board[r][c] == board[r+1][c]:
return True
elif direction == 'A':
for r in range(rows):
for c in range(cols-1):
if board[r][c] == 0 or board[r][c] == board[r][c+1]:
return True
elif direction == 'S':
for c in range(cols):
for r in range(rows-1,0,-1):
if board[r][c] == 0 or board[r][c] == board[r-1][c]:
return True
elif direction == 'D':
for r in range(rows):
for c in range(cols-1,0,-1):
if board[r][c] == 0 or board[r][c] == board[r][c-1]:
return True
return False
# Generate all moves
moves = []
for d in ['W', 'A', 'S', 'D']:
if can_move(d):
moves.append(d)
# If multiple moves, pick one that maximizes the sum of merges
if not moves:
return 'W' # fallback
# Simple heuristic: prefer first move that allows a merge
return moves[0]
Timeout
Steps = 1512 State = failed
def strategy(board):
# helper to check possible merge in a row or column
def can_merge(lst):
for i in range(len(lst)-1):
if lst[i] > 0 and lst[i] == lst[i+1]:
return True
return False
# try to move in a direction that creates a merge
for dir, delta in [("W", (-1,0)), ("A", (0,-1)), ("S", (1,0)), ("D", (0,1))]:
merged = False
for i in range(len(board)):
for j in range(len(board[0])):
if board[i][j] > 0:
ni, nj = i + delta[0], j + delta[1]
if 0 <= ni < len(board) and 0 <= nj < len(board[0]):
if board[ni][nj] == 0:
return dir
if board[ni][nj] == board[i][j]:
return dir
# fallback: move down
return "S"
┌────┬────┬────┬────┬────┬────┐
│ 512│ 16│ 256│ 4│ 64│ 8│
├────┼────┼────┼────┼────┼────┤
│ 128│ 64│1024│ 32│ 8│ 64│
├────┼────┼────┼────┼────┼────┤
│ 64│ 8│ 256│ 128│ 4│ 16│
├────┼────┼────┼────┼────┼────┤
│ 4│ 256│ 16│ 4│ 16│ 8│
├────┼────┼────┼────┼────┼────┤
│ 128│ 64│ 32│ 16│ 8│ 4│
├────┼────┼────┼────┼────┼────┤
│ 64│ 32│ 16│ 8│ 4│ 2│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
def strategy(board):
# Simple greedy: choose direction that keeps tiles sorted in ascending order left-bottom
best = " "
best_val = -1
for d in "WASD":
# simulate move
b = [row[:] for row in board]
# merge function
def merge(row):
new = [x for x in row if x != 0]
res = []
i = 0
while i < len(new):
if i+1 < len(new) and new[i] == new[i+1]:
res.append(new[i]*2)
i += 2
else:
res.append(new[i])
i += 1
return res + [0]*(len(row)-len(res))
moved = False
if d == "W":
for col in range(4):
col_vals = [board[r][col] for r in range(4)]
merged = merge(col_vals)
for r in range(4):
b[r][col] = merged[r]
elif d == "S":
for col in range(4):
col_vals = [board[r][col] for r in range(4)][::-1]
merged = merge(col_vals)[::-1]
for r in range(4):
b[r][col] = merged[r]
elif d == "A":
for r in range(4):
row_vals = board[r]
merged = merge(row_vals)
b[r] = merged
elif d == "D":
for r in range(4):
row_vals = board[r][::-1]
merged = merge(row_vals)[::-1]
b[r] = merged
score = sum(filter(None, [x for row in b for x in row]))
if score > best_val:
best_val = score
best = d
return best
Timeout
Timeout
Exception = 'str' object is not callable
Timeout
def strategy(board):
# helper to rotate board
def rotate(b): return [list(col)[::-1] for col in zip(*b)]
# helper to move up
def move_up(b):
n=len(b)
new=[[] for _ in range(n)]
for j in range(n):
col=[b[i][j] for i in range(n) if b[i][j]!=0]
merged=[]
i=0
while i< len(col):
if i+1<len(col) and col[i]==col[i+1]:
merged.append(col[i]*2)
i+=2
else:
merged.append(col[i])
i+=1
new_col=[0]*(n-len(merged))+merged
for i in range(n):
new[i][j]=new_col[i]
return new
best=None
best_val=-1
for dir in ["W","A","S","D"]:
# move board in given direction
b=[row[:] for row in board]
if dir=="W": b=move_up(b)
elif dir=="S": b=[list(row[::-1]) for row in move_up([row[::-1] for row in b])]
elif dir=="A": b=[list(row[::-1]) for row in move_up([row[::-1] for row in b])]
elif dir=="D": b=[list(row[::-1]) for row in b]
# evaluate
val=max(max(row) for row in b)
if val>best_val:
best_val=val; best=dir
return best
Exception = list assignment index out of range
Timeout
Exception = list index out of range
def strategy(board):
import copy
moves = "WASD"
best = None
best_score = -1
for m in moves:
b = copy.deepcopy(board)
if m=="W":
for c in range(len(b)):
merged = []
for r in range(len(b)):
val = b[r][c]
if val!=0:
merged.append(val)
i=0
while i+1<len(merged):
if merged[i]==merged[i+1]:
merged[i]*=2
merged.pop(i+1)
i+=1
merged+= [0]*(len(b)-len(merged))
for r in range(len(b)):
b[r][c]=merged[r]
elif m=="S":
for c in range(len(b)):
merged = []
for r in reversed(range(len(b))):
val = b[r][c]
if val!=0:
merged.append(val)
i=0
while i+1<len(merged):
if merged[i]==merged[i+1]:
merged[i]*=2
merged.pop(i+1)
i+=1
merged+= [0]*(len(b)-len(merged))
for r in range(len(b)):
b[r][c]=merged[len(b)-1-r]
elif m=="A":
for r in range(len(b)):
row = b[r]
merged = [v for v in row if v!=0]
i=0
while i+1<len(merged):
if merged[i]==merged[i+1]:
merged[i]*=2
merged.pop(i+1)
i+=1
merged+= [0]*(len(b)-len(merged))
b[r]=merged
elif m=="D":
for r in range(len(b)):
row = list(reversed(b[r]))
merged = [v for v in row if v!=0]
i=0
while i+1<len(merged):
if merged[i]==merged[i+1]:
merged[i]*=2
merged.pop(i+1)
i+=1
merged+= [0]*(len(b)-len(merged))
b[r]=list(reversed(merged))
score=sum(sum(row) for row in b)
if score>best_score:
best_score=score; best=m
return best
Timeout
Timeout
Exception = unsupported operand type(s) for -: 'range' and 'int'
def strategy(board):
# board is a 4x4 list of ints, 0 for empty
# Simple greedy: move that merges most tiles
moves = {}
dirs = {"W": (-1,0), "A": (0,-1), "S": (1,0), "D": (0,1)}
for d, (dr,dc) in dirs.items():
# simulate move
new_board = [row[:] for row in board]
merged = 0
for i in range(4):
for j in range(4):
if new_board[i][j]==0: continue
ni, nj = i+dr, j+dc
while 0<=ni<4 and 0<=nj<4 and new_board[ni][nj]==0:
ni+=dr; nj+=dc
if 0<=ni<4 and 0<=nj<4 and new_board[ni][nj]==new_board[i][j]:
merged+=1
moves[d]=merged
# choose direction with most merges, default W
best = max(moves, key=moves.get)
return best
Timeout
Timeout
Timeout
Exception = list index out of range
def strategy(board):
moves = "WASD"
best = None
best_score = -1
for m in moves:
new_board = [row[:] for row in board]
if m == "W":
new_board = _move_up(new_board)
elif m == "A":
new_board = _move_left(new_board)
elif m == "S":
new_board = _move_down(new_board)
else: # "D"
new_board = _move_right(new_board)
score = sum(sum(row) for row in new_board)
if score > best_score:
best_score, best = score, m
return best
def _compress(line):
nonzero = [x for x in line if x]
res = []
i = 0
while i < len(nonzero):
if i + 1 < len(nonzero) and nonzero[i] == nonzero[i+1]:
res.append(nonzero[i]*2)
i += 2
else:
res.append(nonzero[i])
i += 1
return res + [0]*(len(line)-len(res))
def _move_up(b):
n = len(b)
res = [[0]*n for _ in range(n)]
for j in range(n):
col = [b[i][j] for i in range(n)]
col = _compress(col)
for i in range(n):
res[i][j] = col[i]
return res
def _move_down(b):
n = len(b)
res = [[0]*n for _ in range(n)]
for j in range(n):
col = [b[i][j] for i in range(n)][::-1]
col = _compress(col)
for i in range(n):
res[n-1-i][j] = col[i]
return res
def _move_left(b):
n = len(b)
res = [[0]*n for _ in range(n)]
for i in range(n):
row = _compress(b[i])
res[i] = row
return res
def _move_right(b):
n = len(b)
res = [[0]*n for _ in range(n)]
for i in range(n):
row = _compress(b[i][::-1])[::-1]
res[i] = row
return res
Exception = 'int' object is not subscriptable
Timeout
def strategy(board):
# helper to apply a move and return new board
def move(b, dir):
n = len(b)
res = [[0]*n for _ in range(n)]
for x in range(n):
line = []
for y in range(n):
i,j = (y,x) if dir=="D" else (x,y)
if dir=="A": i=j
# skip for brevity
# simplified heuristic: choose direction that increases sum of merged tiles
best, best_sum = None, -1
dirs = "WASD"
for d in dirs:
new = move(board, d)
merged = sum(c for r in new for c in r) - sum(c for r in board for c in r)
if merged > best_sum:
best_sum, best = merged, d
return best
Exception = 'NoneType' object is not iterable
Timeout
Timeout
Timeout
def strategy(board):
import math
def score(b):
empty = sum(1 for r in b for v in r if v==0)
mx = max(max(row) for row in b)
return empty*10 + mx
best=None; best_score=-math.inf
for move in "WASD":
new=board.copy()
# simulate simple move logic
if move=="W":
for col in range(4):
col_vals=[r[col] for r in new if r[col]!=0]
for i,row in enumerate(col_vals):
new[i][col]=col_vals[i]
for i in range(i+1,4):
new[i][col]=0
elif move=="S":
for col in range(4):
col_vals=[r[col] for r in new if r[col]!=0]
for i,row in enumerate(reversed(col_vals)):
new[3-i][col]=col_vals[i]
for i in range(3-i+1,4):
new[i][col]=0
elif move=="A":
for row in range(4):
row_vals=[v for v in new[row] if v!=0]
for i,v in enumerate(row_vals):
new[row][i]=row_vals[i]
for i in range(i+1,4):
new[row][i]=0
elif move=="D":
for row in range(4):
row_vals=[v for v in new[row] if v!=0]
for i,v in enumerate(reversed(row_vals)):
new[row][3-i]=row_vals[i]
for i in range(3-i+1,4):
new[row][i]=0
sc=score(new)
if sc>best_score:
best_score=sc; best=move
return best
Exception = cannot access local variable 'i' where it is not associated with a value
Timeout
Exception = name 'merge' is not defined
Timeout
Timeout
def strategy(board):
# 4x4 board
moves = 'W A S D'.split()
best = None
best_score = -1
for m in moves:
b = [row[:] for row in board] # copy
for i in range(4):
line = b[i] if m in 'AD' else [row[i] for row in b]
merged = []
skip = False
for j, v in enumerate(line):
if v == 0: continue
if skip:
skip = False
continue
if j + 1 < len(line) and line[j+1] == v:
merged.append(v*2)
skip = True
else:
merged.append(v)
while len(merged) < 4:
merged.append(0)
if m in 'AD':
for k in range(4): b[i][k] = merged[k]
else:
for k in range(4): b[k][i] = merged[k]
score = sum(sum(row) for row in b)
if score > best_score:
best_score = score
best = m
return best
Timeout
Timeout
Timeout
def strategy(board):
# board is a list of lists representing a 4x4 grid.
# possible moves
moves = ['W', 'A', 'S', 'D']
best = None
best_score = -1
def score(b):
s = 0
for row in b:
for v in row:
s += v
return s
for m in moves:
nb = [row[:] for row in board]
# simulate move m (very naive: just return new board if any merge)
merged = False
for i in range(4):
for j in range(4):
if nb[i][j] == 0: continue
for di, dj in ( (-1,0),(1,0),(0,-1),(0,1) ):
ni, nj = i+di, j+dj
if 0<=ni<4 and 0<=nj<4 and nb[ni][nj]==nb[i][j]:
nb[ni][nj] += nb[i][j]
nb[i][j] = 0
merged = True
if merged:
sc = score(nb)
if sc > best_score:
best_score, best = sc, m
return best if best is not None else moves[0]
Timeout
Timeout
Timeout
Exception = cannot access local variable 'val' where it is not associated with a value
None
Timeout
Timeout
Exception = not enough values to unpack (expected 2, got 1)
def strategy(board):
# evaluate a move by the total sum after the move
def sim(b, m):
n = len(b)
b = [row[:] for row in b]
moved = False
if m == 'W':
for j in range(n):
col = [b[i][j] for i in range(n)]
col += [0]*(n-len(col))
newcol = []
i = 0
while i < n:
if col[i] == 0:
i += 1
continue
val = col[i]
i += 1
while i < n and col[i] == 0: i += 1
if i < n and col[i] == val:
val *= 2
i += 1
newcol.append(val)
for i in range(n):
b[i][j] = newcol[i] if i < len(newcol) else 0
moved = True
# other moves omitted for brevity
return b if moved else None
best, best_val = None, -1
for m in "WASD":
r = sim(board, m)
if r:
val = sum(sum(row) for row in r)
if val > best_val:
best_val, best = val, m
return best if best else "W"
Timeout
Exception = list index out of range
Timeout
Timeout
Timeout
def strategy(board):
Timeout
Exception = strategy.<locals>.rotate() takes 1 positional argument but 2 were given
def strategy(board):
# helper to simulate a move
def move(b, direction):
size = len(b)
new = [[0]*size for _ in range(size)]
for i in range(size):
if direction in ('A','D'):
line = b[i] if direction=='D' else b[i][::-1]
else:
line = [b[j][i] for j in range(size)]
if direction=='S': line = line[::-1]
merged = []
skip = False
for val in line:
if val==0: continue
if merged and merged[-1]==val and not skip:
merged[-1] += val
skip = True
else:
merged.append(val)
skip = False
for j,v in enumerate(merged):
new[i if direction=='A' else size-1-i][j if direction=='A' else size-1-j] = v
return new
# evaluate each move
best = None
best_val = -1
for dirc in 'WASD':
new_board = move(board, dirc)
val = sum(sum(row) for row in new_board)
if val > best_val:
best_val = val
best = dirc
return best
Timeout
Timeout
Timeout
Timeout
None
Timeout
Timeout
Timeout
None
Exception = unsupported operand type(s) for -: 'list' and 'int'
Timeout
Timeout
def strategy(board):
# Simple heuristic: move up unless a merge is possible in another direction
# Check if any pair can merge horizontally or vertically
for i in range(4):
for j in range(3):
if board[i][j] == board[i][j+1]:
return "A" # left
for i in range(3):
for j in range(4):
if board[i][j] == board[i+1][j]:
return "W" # up
return "D" # fallback
Timeout
Exception = list index out of range
def strategy(board):
def score_for(move):
B = [row[:] for row in board]
def slide(row):
new = [x for x in row if x != 0]
res = []
skip = False
for i, x in enumerate(new):
if skip:
skip = False
continue
if i+1 < len(new) and new[i] == new[i+1]:
res.append(x*2)
skip = True
else:
res.append(x)
return res + [0]*(len(row)-len(res))
if move=='W':
for i in range(len(B)):
B[i] = slide(B[i])
elif move=='S':
B = B[::-1]
for i in range(len(B)):
B[i] = slide(B[i])
B = B[::-1]
elif move=='A':
for row in B:
row[:] = slide(row)
elif move=='D':
for row in B:
row[:] = slide(row[::-1])[::-1]
empty = sum(cell==0 for r in B for cell in r)
return empty
best=None
for m in 'WASD':
if score_for(m)>best[1] if best else -1:
best=(m,score_for(m))
return best[0]
Timeout
Timeout
Exception = list assignment index out of range
Timeout
Timeout
def strategy(board):
'''
Returns the best next move for a 2048 game using a very small heuristic.
The heuristic looks at the free spaces after the move and chooses the
direction that tends to leave the most empty tiles.
'''
from functools import lru_cache
# Flatten the board for easier hashing
flatten = tuple(tuple(row) for row in board)
# Helper: simulate a move
def move(state, direction):
size = len(state)
new_state = []
for row in state:
merged = []
for d in row:
if d != 0:
merged.append(d)
if direction in ('A', 'D'): # horizontal move
merged = merged[::-1] if direction == 'D' else merged
i = 0
while i < len(merged) - 1:
if merged[i] == merged[i + 1]:
merged[i] *= 2
merged.pop(i + 1)
i += 1
merged += [0] * (size - len(merged))
if direction == 'D':
merged = merged[::-1]
new_state.append(tuple(merged))
else: # vertical move
new_state.append(tuple(merged))
# For vertical moves, reconstruct column-wise
if direction in ('W', 'S'):
transposed = list(zip(*new_state))
new_state = []
for col in transposed:
merged = []
for d in col:
if d != 0:
merged.append(d)
merged = merged[::-1] if direction == 'S' else merged
i = 0
while i < len(merged) - 1:
if merged[i] == merged[i + 1]:
merged[i] *= 2
merged.pop(i + 1)
i += 1
merged += [0] * (size - len(merged))
if direction == 'S':
merged = merged[::-1]
new_state.append(tuple(merged))
new_state = [tuple(row) for row in zip(*new_state)]
return tuple(tuple(row) for row in new_state)
# Count empty tiles
def empty_count(state):
return sum(1 for row in state for cell in row if cell == 0)
best_move = None
best_empty = -1
for move in ['W', 'A', 'S', 'D']:
new_board = move(flatten, move)
e = empty_count(new_board)
if e > best_empty:
best_empty = e
best_move = move
return best_move
Exception = 'str' object is not callable
Timeout
Timeout
Timeout
def strategy(board):
import copy
# Helper to apply a move and return new board
def move(board, dir):
size = len(board)
def compress(line):
new = [x for x in line if x>0]
res = []
i = 0
while i < len(new):
if i+1 < len(new) and new[i]==new[i+1]:
res.append(new[i]*2)
i += 2
else:
res.append(new[i])
i += 1
res += [0]*(size-len(res))
return res
if dir=='W':
new = [compress(col) for col in zip(*board)]
return [list(row) for row in zip(*new)]
if dir=='A':
return [compress(row) for row in board]
if dir=='S':
rev = [list(reversed(row)) for row in board]
new = [compress(row) for row in rev]
return [list(reversed(row)) for row in new]
if dir=='D':
rev = [list(reversed(row)) for row in board]
new = [compress(row) for row in rev]
return [list(row) for row in new]
best = None
best_score = -1
for d in ['W','A','S','D']:
newboard = move(board, d)
# score: sum of all tiles (higher better)
score = sum(sum(row) for row in newboard)
if score > best_score:
best_score, best = score, d
return best
Timeout
Timeout
Timeout
Timeout
Timeout
def strategy(board):
# helper to simulate a move and compute score
def simulate(move):
n = len(board)
new_board = [[0]*n for _ in range(n)]
for i in range(n):
line = board[i] if move in "WB" else [row[i] for row in board]
if move in "DS": # reverse for down/right
line = line[::-1]
merged = []
skip = False
for v in line:
if v == 0: continue
if merged and merged[-1][0] == v and not skip:
merged[-1] = (merged[-1][0]*2, merged[-1][1]+1)
skip = True
else:
merged.append((v, 0))
skip = False
merged += [(0,0)]*(n-len(merged))
for idx, (v, _) in enumerate(merged):
new_board[i if move in "WD" else idx][idx if move in "WD" else i] = v
return sum(sum(row) for row in new_board)
best_move = None
best_score = -1
for m in "WASD":
try:
score = simulate(m)
if score > best_score:
best_score = score
best_move = m
except:
continue
return best_move or "W"
Timeout
Timeout
Exception = name 'n' is not defined
def strategy(board):
import copy
moves = {'W': (-1,0), 'A': (0,-1), 'S': (1,0), 'D': (0,1)}
def move(b, dir):
size = len(b)
mx, my = moves[dir]
new = [[0]*size for _ in range(size)]
for r in range(size):
line = []
nr = r + mx
for c in range(size):
nc = c + my
if 0 <= nr < size and 0 <= nc < size:
line.append(b[nr][nc])
# compress
res=[]
i=0
while i < len(line):
if i+1<len(line) and line[i]==line[i+1]:
res.append(line[i]*2); i+=2
else:
res.append(line[i]); i+=1
for i,val in enumerate(res):
nr = (r+mx*i if mx else r)
nc = (c+my*i if my else c)
new[nr][nc]=val
return new
def score(b):
s=0
for r in range(len(b)):
for c in range(len(b)):
if b[r][c]>0:
s+=b[r][c]
return s
best=None
best_score=-1
for m in moves:
nb=move(board,m)
s=score(nb)
if s>best_score:
best_score=s; best=m
return best
Exception = list index out of range
Exception = 'NoneType' object is not subscriptable
Exception = name 'col_index' is not defined
def strategy(board):
import copy
moves = "WASD"
best, best_move = -1, "W"
for m in moves:
b = copy.deepcopy(board)
if m == "W":
for i in range(3,-1,-1):
for j in range(4):
if b[i][j] and b[i-1][j] and b[i][j]==b[i-1][j]:
b[i-1][j]*=2; b[i][j]=0
elif m == "S":
for i in range(4):
for j in range(4):
if i<3 and b[i][j] and b[i+1][j] and b[i][j]==b[i+1][j]:
b[i+1][j]*=2; b[i][j]=0
elif m == "A":
for i in range(4):
for j in range(4):
if j<3 and b[i][j] and b[i][j+1] and b[i][j]==b[i][j+1]:
b[i][j+1]*=2; b[i][j]=0
elif m == "D":
for i in range(4):
for j in range(3,-1,-1):
if j>0 and b[i][j] and b[i][j-1] and b[i][j]==b[i][j-1]:
b[i][j-1]*=2; b[i][j]=0
score = sum(sum(row) for row in b)
if score > best:
best, best_move = score, m
return best_move
Timeout
Steps = 1825 State = failed
def strategy(board):
size = len(board)
# Helper to compute score of moves
def score_move(d):
new_board = [row[:] for row in board]
moved = False
if d == "W":
for j in range(size):
col = [new_board[i][j] for i in range(size)]
merged = merge(col)
for i in range(size):
new_board[i][j] = merged[i]
if merged != col:
moved = True
elif d == "S":
for j in range(size):
col = [new_board[i][j] for i in range(size)][::-1]
merged = merge(col)[::-1]
for i in range(size):
new_board[i][j] = merged[i]
if merged[::-1] != col:
moved = True
elif d == "A":
for i in range(size):
row = new_board[i][:]
merged = merge(row)
new_board[i] = merged
if merged != row:
moved = True
elif d == "D":
for i in range(size):
row = new_board[i][::-1]
merged = merge(row)[::-1]
new_board[i] = merged
if merged[::-1] != row:
moved = True
return moved, new_board
def merge(line):
filtered = [x for x in line if x != 0]
merged = []
i = 0
while i < len(filtered):
if i+1 < len(filtered) and filtered[i] == filtered[i+1]:
merged.append(filtered[i]*2)
i += 2
else:
merged.append(filtered[i])
i += 1
merged += [0]*(size-len(merged))
return merged
# Evaluate each direction
best_score = -1
best_dir = "W"
for d in "WASD":
moved, new_board = score_move(d)
if not moved:
continue
# simple heuristic: sum of all tiles
score = sum(sum(row) for row in new_board)
if score > best_score:
best_score = score
best_dir = d
return best_dir
┌────┬────┬────┬────┬────┬────┐
│ 8│ 2│1024│ 2│ 16│ 2│
├────┼────┼────┼────┼────┼────┤
│1024│ 32│ 512│ 32│ 64│ 4│
├────┼────┼────┼────┼────┼────┤
│ 512│ 4│ 128│ 2│ 32│ 16│
├────┼────┼────┼────┼────┼────┤
│ 256│ 64│ 32│ 64│ 16│ 8│
├────┼────┼────┼────┼────┼────┤
│ 64│ 32│ 16│ 2│ 8│ 4│
├────┼────┼────┼────┼────┼────┤
│ 32│ 8│ 2│ 8│ 4│ 2│
└────┴────┴────┴────┴────┴────┘
Timeout
def strategy(board):
# Evaluate score for each move and pick the one with maximal tile value
dirs = {"W": (-1,0), "A": (0,-1), "S": (1,0), "D": (0,1)}
best = None
best_score = -1
for d, (dx, dy) in dirs.items():
new_board = [[0]*4 for _ in range(4)]
moved = False
for i in range(4):
for j in range(4):
ni, nj = i+dx, j+dy
if 0 <= ni < 4 and 0 <= nj < 4:
new_board[ni][nj] = board[i][j]
if new_board[ni][nj] != board[i][j]:
moved = True
if not moved:
continue
score = sum([sum(row) for row in new_board])
if score > best_score:
best_score = score
best = d
return best if best is not None else "W"
Timeout
Timeout
Exception = 'list_reverseiterator' object is not subscriptable
Timeout
def strategy(board):
def score_row(row, dir):
if dir == 'L':
row = row[::-1]
merged = []
skip = False
for val in row:
if val == 0: continue
if skip:
skip = False
continue
if merged and merged[-1] == val:
merged[-1] *= 2
skip = True
else:
merged.append(val)
merged += [0]*(len(row)-len(merged))
if dir == 'L':
merged = merged[::-1]
return merged
def move(board, action):
new_board = [row[:] for row in board]
if action in 'L':
for r in new_board:
new_row = score_row(r, 'L')
for i, val in enumerate(new_row):
r[i] = val
elif action in 'R':
for r in new_board:
new_row = score_row(r, 'R')
for i, val in enumerate(new_row):
r[i] = val
elif action in 'U':
for c in range(4):
col = [new_board[r][c] for r in range(4)]
new_col = score_row(col, 'L')
for r in range(4):
new_board[r][c] = new_col[r]
elif action in 'D':
for c in range(4):
col = [new_board[r][c] for r in range(4)]
new_col = score_row(col, 'R')
for r in range(4):
new_board[r][c] = new_col[r]
return new_board
def empty(board):
return [(r, c) for r in range(4) for c in range(4) if board[r][c] == 0]
actions = 'WASD'
best = None
best_score = -1
for a in actions:
new = move(board, a)
empties = len(empty(new))
merged = sum(1 for r in new for val in r if val >0)
score = empties + merged
if score>best_score:
best_score = score
best = a
return best
Timeout
Timeout
Timeout
Timeout
def strategy(board):
# choose a move that keeps more tiles unchanged
moves = ['W','A','S','D']
best = moves[0]; best_score = -1
for m in moves:
new = board_state_after(board, m)
if new == board:
continue
score = score_board(new)
if score > best_score:
best_score = score; best = m
return best
def board_state_after(board, move):
# simulate move on a copy of the board
from copy import deepcopy
b = deepcopy(board)
n = len(b)
# simple implementation of move logic
def compress(line):
new = [x for x in line if x!=0]
res = []
i=0
while i < len(new):
if i+1<len(new) and new[i]==new[i+1]:
res.append(new[i]*2); i+=2
else:
res.append(new[i]); i+=1
res += [0]*(n-len(res))
return res
if move=='W':
for j in range(n):
col=[b[i][j] for i in range(n)]
col=compress(col)
for i in range(n): b[i][j]=col[i]
elif move=='S':
for j in range(n):
col=[b[i][j] for i in range(n)][::-1]
col=compress(col)[::-1]
for i in range(n): b[i][j]=col[i]
elif move=='A':
for i in range(n):
row=compress(b[i])
b[i]=row
elif move=='D':
for i in range(n):
row=compress(b[i][::-1])[::-1]
b[i]=row
return b
def score_board(board):
# higher score for more homogeneous board
total=0
for row in board:
for v in row:
total+=v
return total
Exception = list assignment index out of range
Timeout
Timeout
def strategy(board):
# simulate four possible moves and choose the one
def move(board, dir):
size = len(board)
def compress(line):
filtered = [x for x in line if x != 0]
merged = []
skip = False
for i in range(len(filtered)):
if skip: skip = False; continue
if i+1 < len(filtered) and filtered[i] == filtered[i+1]:
merged.append(filtered[i]*2)
skip = True
else:
merged.append(filtered[i])
merged += [0]*(size-len(merged))
return merged
new = [[0]*size for _ in range(size)]
if dir == 'W':
for j in range(size):
col = [board[i][j] for i in range(size)]
merged = compress(col)
for i in range(size):
new[i][j] = merged[i]
elif dir == 'S':
for j in range(size):
col = [board[i][j] for i in range(size)][::-1]
merged = compress(col)[::-1]
for i in range(size):
new[i][j] = merged[i]
elif dir == 'A':
for i in range(size):
row = board[i]
merged = compress(row)
new[i] = merged
elif dir == 'D':
for i in range(size):
row = board[i][::-1]
merged = compress(row)[::-1]
new[i] = merged
return new
best = None
best_score = -1
for dir in ('W','A','S','D'):
new = move(board, dir)
score = sum(sum(row) for row in new)
if score > best_score:
best_score = score
best = dir
return best
Timeout
Timeout
Timeout
Timeout
def strategy(board):
def move(board, dir):
import copy
n=len(board)
new=[row[:] for row in board]
if dir=='W':
for j in range(n):
col=[new[i][j] for i in range(n)]
newcol=compress(col)
for i in range(n): new[i][j]=newcol[i]
elif dir=='S':
for j in range(n):
col=[new[i][j] for i in range(n)][::-1]
newcol=compress(col)[::-1]
for i in range(n): new[i][j]=newcol[i]
elif dir=='A':
for i in range(n):
new[i]=compress(new[i])
elif dir=='D':
for i in range(n):
new[i]=compress(new[i])[::-1][::-1]
return new
def compress(line):
filtered=[v for v in line if v>0]
res=[]
i=0
while i<len(filtered):
if i+1<len(filtered) and filtered[i]==filtered[i+1]:
res.append(filtered[i]*2); i+=2
else:
res.append(filtered[i]); i+=1
return res+[0]*(len(line)-len(res))
best=None
best_score=-1
for d in "WASD":
nb=move(board,d)
score=sum(sum(row) for row in nb)
if score>best_score:
best_score=score; best=d
return best
Timeout
Steps = 1264 State = success
def strategy(board):
# board is a 4x4 list of lists
import random
# Directions with priority: diagonal corners
dirs = ['W', 'A', 'S', 'D']
for d in dirs:
new_board = [row[:] for row in board]
if d == 'W':
for j in range(4):
merged = False
for i in range(1, 4):
if new_board[i][j] == new_board[i-1][j] and not merged:
new_board[i-1][j] += new_board[i][j]
new_board[i][j] = 0
merged = True
elif d == 'S':
for j in range(4):
merged = False
for i in range(2, -1, -1):
if new_board[i][j] == new_board[i+1][j] and not merged:
new_board[i+1][j] += new_board[i][j]
new_board[i][j] = 0
merged = True
elif d == 'A':
for i in range(4):
merged = False
for j in range(1, 4):
if new_board[i][j] == new_board[i][j-1] and not merged:
new_board[i][j-1] += new_board[i][j]
new_board[i][j] = 0
merged = True
elif d == 'D':
for i in range(4):
merged = False
for j in range(2, -1, -1):
if new_board[i][j] == new_board[i][j+1] and not merged:
new_board[i][j+1] += new_board[i][j]
new_board[i][j] = 0
merged = True
# measure score: number of non-zero tiles
score = sum(1 for r in new_board for v in r if v != 0)
# choose first direction that reduces empty tiles
if score > sum(1 for r in board for v in r if v != 0):
return d
return random.choice(dirs)
┌────┬────┬────┬────┬────┬────┐
│ .│ .│ .│ 2│ .│ .│
├────┼────┼────┼────┼────┼────┤
│ 4│ .│ .│ .│ .│ .│
├────┼────┼────┼────┼────┼────┤
│ 2│ 32│ 4│ .│ .│ .│
├────┼────┼────┼────┼────┼────┤
│ 64│ 4│ 8│ .│ 16│ 32│
├────┼────┼────┼────┼────┼────┤
│ 32│ 32│ 2│ 2│ 256│ 4│
├────┼────┼────┼────┼────┼────┤
│ 4│ 16│ 4│2048│ 4│ 2│
└────┴────┴────┴────┴────┴────┘
Exception = '>' not supported between instances of 'int' and 'str'
Exception = cannot pickle 'generator' object
Timeout
def strategy(board):
def move(board, direction):
size = len(board)
def compress(line):
new = [x for x in line if x>0]
merged = []
i=0
while i < len(new):
if i+1 < len(new) and new[i]==new[i+1]:
merged.append(new[i]*2)
i+=2
else:
merged.append(new[i])
i+=1
return merged+[0]*(size-len(merged))
new_board=[[0]*size for _ in range(size)]
if direction=='W':
for j in range(size):
col=[board[i][j] for i in range(size)]
col=compress(col)
for i in range(size):
new_board[i][j]=col[i]
elif direction=='S':
for j in range(size):
col=[board[i][j] for i in range(size)][::-1]
col=compress(col)[::-1]
for i in range(size):
new_board[i][j]=col[i]
elif direction=='A':
for i in range(size):
row=compress(board[i])
new_board[i]=row
elif direction=='D':
for i in range(size):
row=compress(board[i][::-1])[::-1]
new_board[i]=row
return new_board
def score(b):
return sum(sum(1 for x in row if x>0) for row in b)
best=None
bestScore=-1
for d in "WASD":
nb=move(board,d)
s=score(nb)
if s>bestScore:
bestScore=s
best=d
return best
Timeout
Timeout
Timeout
Timeout
Timeout
None
Timeout
Exception = list assignment index out of range
Timeout
None
Timeout
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-06
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need a short strategy that picks best move using basic heuristics. Use simple scoring: prefer moving to increase score: choose move that gives largest tile sum after simulation. Write quick function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
import itertools
dirs = ["W","A","S","D"]
best_dir = None
best_val = -1
for d in dirs:
new_board = [row[:] for row in board]
moved = False
if d == "W":
for col in range(4):
temp = [new_board[r][col] for r in range(4) if new_board[r][col]]
merged = []
i=0
while i<len(temp):
if i+1<len(temp) and temp[i]==temp[i+1]:
merged.append(temp[i]*2); i+=2
else:
merged.append(temp[i]); i+=1
for r in range(4):
new_board[r][col]=merged[r] if r<len(merged) else 0
elif d == "S":
for col in range(4):
temp = [new_board[r][col] for r in range(4) if new_board[r][col]]
merged = []
i=len(temp)-1
while i>=0:
if i-1>=0 and temp[i]==temp[i-1]:
merged.insert(0,temp[i]*2); i-=2
else:
merged.insert(0,temp[i]); i-=1
for r in range(4):
new_board[r][col]=merged[r] if r<len(merged) else 0
elif d == "A":
for r in range(4):
temp = [new_board[r][c] for c in range(4) if new_board[r][c]]
merged = []
i=0
while i<len(temp):
if i+1<len(temp) and temp[i]==temp[i+1]:
merged.append(temp[i]*2); i+=2
else:
merged.append(temp[i]); i+=1
for c in range(4):
new_board[r][c]=merged[c] if c<len(merged) else 0
elif d == "D":
for r in range(4):
temp = [new_board[r][c] for c in range(4) if new_board[r][c]]
merged = []
i=len(temp)-1
while i>=0:
if i-1>=0 and temp[i]==temp[i-1]:
merged.insert(0,temp[i]*2); i-=2
else:
merged.insert(0,temp[i]); i-=1
for c in range(4):
new_board[r][c]=merged[c] if c<len(merged) else 0
val = sum(map(sum,new_board))
if val>best_val:
best_val, best_dir = val, d
return best_dir
```<|return|>
Saving to float16 or MXFP4
We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
And we're done!
Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS
This notebook and all Unsloth notebooks are licensed LGPL-3.0.