Gpt Oss (20B) Reinforcement Learning 2048 Game BF16
Goal: Make GPT-OSS play games with Reinforcement Learning
Our goal is to make GPT-OSS play the 2048 game with reinforcement learning, or a variant of it called GRPO.
We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.
Installation
We'll be using Unsloth to do RL on GPT-OSS 20B. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!
We'll load GPT-OSS 20B and set some parameters:
max_seq_length = 768The maximum context length of the model. Increasing it will use more memory.lora_rank = 4The larger this number, the smarter the RL process, but the slower and more memory usageload_in_16bitwill be faster but will need a 64GB GPU or more (MI300)offload_embedding = TrueNew Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.
π¦₯ Unsloth: Will patch your computer to enable 2x faster free finetuning. π¦₯ Unsloth Zoo will now patch everything to make training faster! ==((====))== Unsloth 2025.10.4: Fast Gpt_Oss patching. Transformers: 4.56.2. \\ /| Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.8.0. Triton: 3.4.0 \ / Bfloat16 = TRUE. FA [Xformers = None. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored! Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.
Loading checkpoint shards: 0%| | 0/9 [00:00<?, ?it/s]
To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.
Unsloth: Making `model.base_model.model.model` require gradients
2048 game
We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).
For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.
[NOTE] 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See Wikipedia page for more details.
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β 2β .β .β .β 2β βββββ΄ββββ΄ββββ΄ββββ΄ββββ ongoing
GameBoard(size=5, seed=42, target=8, probability_fours=0.1)
We'll use WASD for the action space:
W
A S D
Also game.state() will say success if we succeeded in getting the target!
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β 2β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β 4β .β .β .β .β βββββ΄ββββ΄ββββ΄ββββ΄ββββ ongoing
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β 4β .β 2β .β .β βββββΌββββΌββββΌββββΌββββ€ β 2β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββ΄ββββ΄ββββ΄ββββ΄ββββ ongoing
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β .β .β .β 4β 2β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β 2β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β 4β .β .β .β .β βββββ΄ββββ΄ββββ΄ββββ΄ββββ ongoing
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β 4β .β .β 4β 4β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β 4β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββ΄ββββ΄ββββ΄ββββ΄ββββ ongoing
βββββ¬ββββ¬ββββ¬ββββ¬ββββ β .β .β 2β 4β 8β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β 4β βββββΌββββΌββββΌββββΌββββ€ β .β .β .β .β .β βββββ΄ββββ΄ββββ΄ββββ΄ββββ success
If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.
βββββ¬ββββ¬ββββ β .β 4β .β βββββΌββββΌββββ€ β .β .β 2β βββββΌββββΌββββ€ β .β .β .β βββββ΄ββββ΄ββββ failed
RL Environment Setup
We'll set up a function to accept some strategy that'll emit an action within WASD and check the game state.
We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!
Let's make a generic strategy to just hit W. We should expect this generic strategy to fail:
Timed out with error = Timed out after 2s
To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.
Code Execution
To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.
For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:
Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}
For the below piece of code, since we import numpy, we should not allow the execution:
Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}
We also disallow global variable access. We'll use Unsloth's create_locked_down_function function
name 'np' is not defined
60
Data & RL task setup
We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.
Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.
First, let's prompt GPT-OSS without RL and see how it goes:
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>Need simple heuristic: pick move that maximizes merged tiles. Score by sum of resulting board. Implement simulation of each move without random tiles. Return best.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
import copy
dirs = {'W':(-1,0),'S':(1,0),'A':(0,-1),'D':(0,1)}
def mv(b, d):
dr,dc=d
n=len(b)
res=[[0]*n for _ in range(n)]
for i in range(n):
line=[]
for j in range(n):
r,c=i+dr*j, j+dc*i if d in 'W' else j+dc*i
if d in 'WA': r,c=j,i+dc*j
val=b[r][c]
if val: line.append(val)
merged=[]
skip=False
for k,v in enumerate(line):
if skip: skip=False; continue
if k+1<len(line) and line[k+1]==v:
merged.append(v*2); skip=True
else: merged.append(v)
for k,v in enumerate(merged):
if d=='W': res[i][k]=v
elif d=='S': res[i][n-1-k]=v
elif d=='A': res[k][i]=v
else: res[n-1-k][i]=v
return res
best='W'; bestval=-1
for k,v in dirs.items():
try:
val=sum(map(sum,mv(board,v)))
if val>bestval: bestval=val; best=k
except: pass
return best
```<|return|>
Reward functions
We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.
And 3 reward functions:
function_workswhich rewards the model if the strategy is a valid Python function.no_cheatingwhich checks if the function imported other modules, and if it did, we penalize it.strategy_succeedswhich checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.
def strategy(board):
return "W" # Example
Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:
(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
, 'stdlib': [],
, 'non_stdlib': [],
, 'relative_imports': 0}) no_cheating checks if the function cheated since it might have imported Numpy or other functions:
Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.
We also add a global PRINTER to print out the strategy and board state.
We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.
181
{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "W", "A", "S", "D" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n return "W" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
, 'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'} Train the model
Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`. We will change the batch size of 1 to the `num_generations` of 2
And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!
You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!
| Step | Training Loss | reward | reward_std | completion_length | kl |
|---|---|---|---|---|---|
| 1 | 0.000000 | 0.125000 | 0.000000 | 200.000000 | 0.000000 |
| 2 | 0.000000 | 0.072375 | 0.248112 | 200.000000 | 0.000000 |
| 3 | 0.000000 | -0.079000 | 0.163776 | 182.500000 | 0.000005 |
And let's train the model!
NOTE This might be quite slow! 600 steps takes ~5 hours or longer.
The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.
==((====))== Unsloth - 2x faster free finetuning | Num GPUs used = 1
\\ /| Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \ Batch size per device = 2 | Gradient accumulation steps = 1
\ / Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
"-____-" Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.
Streaming output truncated to the last 5000 lines. ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β 64β 4β 256β 16β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β1024β 4β 32β 8β 64β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 128β 256β 128β 32β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 64β 32β 64β 16β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 16β 8β 16β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 8β 2β 8β 4β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): import random return random.choice(["W", "A", "S", "D"]) Steps = 1582 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 16β 4β 2β 32β 2β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 256β 512β 4β 16β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β2048β 2β 32β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 2β 128β 8β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 4β .β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Exception = name 'random' is not defined Timeout Timeout def strategy(board): import copy dirs = {'W':(-1,0),'S':(1,0),'A':(0,-1),'D':(0,1)} best_move = None best_score = -1 def move(b, d): rows, cols = 4,4 def slide(row): new = [x for x in row if x] res = [] skip=False i=0 while i < len(new): if i+1<len(new) and new[i]==new[i+1]: res.append(new[i]*2) i+=2 else: res.append(new[i]) i+=1 res += [0]*(rows-len(res)) return res b = copy.deepcopy(b) for _ in range(d[0]): b = b[::-1] for _ in range(d[1]): b = [list(x) for x in zip(*b[::-1])] for i in range(rows): b[i] = slide(b[i]) for _ in range(-d[0]): b = b[::-1] for _ in range(-d[1]): b = [list(x) for x in zip(*b[::-1])] return b for m, (dy,dx) in dirs.items(): nb = move(board,(dy,dx)) score = sum(sum(row) for row in nb) if score>best_score: best_score=score best_move=m return best_move Timeout Timeout Timeout Timeout Timeout def strategy(board): import random moves = ["W", "A", "S", "D"] # simple heuristic: prefer moves that increase score locally best = None best_score = -1 def simulate(board, move): # copy board import copy, math b = copy.deepcopy(board) # apply move logic simplified: just return new board # without full game logic, just random for illustration return b, random.randint(0, 4) for m in moves: _, s = simulate(board, m) if s > best_score: best_score = s best = m return best or random.choice(moves) Steps = 1310 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β 128β 2β 32β 64β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β 32β 256β 4β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 4β 16β 8β2048β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β .β .β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout Timeout def strategy(board): import copy, itertools, random def merge(row): nonlocal score new = [x for x in row if x] merged = [] i = 0 while i < len(new): if i+1 < len(new) and new[i]==new[i+1]: merged.append(new[i]*2) score += new[i]*2 i += 2 else: merged.append(new[i]) i += 1 merged += [0]*(len(row)-len(merged)) return merged def transform(b, dir): new = copy.deepcopy(b) if dir=='W': for c in range(4): col=[new[r][c] for r in range(4)] col=merge(col) for r in range(4): new[r][c]=col[r] if dir=='S': for c in range(4): col=[new[r][c] for r in range(4)][::-1] col=merge(col) col=col[::-1] for r in range(4): new[r][c]=col[r] if dir=='A': for r in range(4): new[r]=merge(new[r]) if dir=='D': for r in range(4): new[r]=merge(new[r][::-1])[::-1] return new def board_score(b): return max(max(row) for row in b) best=None; best_score=-1 dirs='WASD' for d in dirs: new=transform(board,d) if new!=board: s=board_score(new) if s>best_score: best_score=s; best=d return best if best else random.choice('WASD') Timeout Steps = 1129 State = success def strategy(board): import random return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β .β .β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 32β 2β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β2048β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 8β 64β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 32β 16β 8β 4β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout def strategy(board): # Simple heuristic: always push left return "A" Timeout Timeout Timeout Timeout Timeout def strategy(board): return "W" # simple fixed strategy Timeout Timeout Timeout Timeout Timeout def strategy(board): return "W" # Basic placeholder strategy: always move up Timeout Timeout Steps = 1223 State = success def strategy(board): # Simple heuristic: always try to move up if possible, otherwise random legal move from random import choice moves = ['W', 'A', 'S', 'D'] return choice(moves) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 4β .β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β .β 16β 64β 8β 32β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β 64β 32β2048β 128β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 32β 16β 4β 16β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout def strategy(board): """Very simple 2048 strategy: always try to move up if tiles can combine, else down, left, right.""" # helper to check if any move is possible in a given direction def can_move(b, dir): n = len(b) # simulate slide in given direction, return True if any change temp = [row[:] for row in b] moved = False if dir == 'W': for c in range(n): col = [temp[r][c] for r in range(n)] merged, _ = slide(col) if merged != col: moved = True for r in range(n): temp[r][c] = merged[r] elif dir == 'S': for c in range(n): col = [temp[r][c] for r in range(n)][::-1] merged, _ = slide(col) merged = merged[::-1] if merged != [temp[r][c] for r in range(n)]: moved = True for r in range(n): temp[n-1-r][c] = merged[r] elif dir == 'A': for r in range(n): row = temp[r][:] merged, _ = slide(row) if merged != row: moved = True temp[r] = merged elif dir == 'D': for r in range(n): row = temp[r][::-1] merged, _ = slide(row) merged = merged[::-1] if merged != temp[r]: moved = True temp[r] = merged return moved def slide(line): """Slide non-zeros left and merge.""" new = [x for x in line if x != 0] merged = [] skip = False i = 0 while i < len(new): if i+1 < len(new) and new[i] == new[i+1]: merged.append(new[i]*2) i += 2 else: merged.append(new[i]) i += 1 merged += [0]*(len(line)-len(merged)) return merged, merged != line for d in ['W','S','A','D']: if can_move(board, d): return d return 'W' # fallback Steps = 205 State = failed βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ β 2β 8β 4β 2β 8β 2β βββββΌββββΌββββΌββββΌββββΌββββ€ β 64β 2β 32β 4β 64β 32β βββββΌββββΌββββΌββββΌββββΌββββ€ β 4β 8β 4β 64β 32β 16β βββββΌββββΌββββΌββββΌββββΌββββ€ β 2β 4β 2β 32β 4β 8β βββββΌββββΌββββΌββββΌββββΌββββ€ β 4β 2β 4β 16β 8β 2β βββββΌββββΌββββΌββββΌββββΌββββ€ β 2β 4β 2β 8β 2β 4β βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ Timeout Timeout Steps = 988 State = failed def strategy(board): # Helper to simulate a move def move(board, key): def compress(row): new = [x for x in row if x] for i in range(len(new)-1): if new[i]==new[i+1]: new[i]*=2 new[i+1]=0 return [x for x in new if x] def transpose(b): return [list(i) for i in zip(*b)] def reverse(b): return [row[::-1] for row in b] def slide_left(b): return [compress(r) + [0]*(len(b[0])-len(compress(r))) for r in b] def apply(b, k): n=len(b) if k=='W':return transpose(slide_left(transpose(b))) if k=='S':return transpose(reverse(slide_left(reverse(transpose(b))))) if k=='A':return slide_left(b) if k=='D':return reverse(slide_left(reverse(b))) return apply(board, key) moves = "WASD" for m in moves: if move(board, m)!=board: return m return "W" βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ β 4β512β 4β 32β 4β 2β βββββΌββββΌββββΌββββΌββββΌββββ€ β512β128β 64β 4β 64β 32β βββββΌββββΌββββΌββββΌββββΌββββ€ β128β 32β 8β256β 4β 16β βββββΌββββΌββββΌββββΌββββΌββββ€ β 64β 4β 32β 64β 32β 8β βββββΌββββΌββββΌββββΌββββΌββββ€ β 32β 16β 64β 32β 8β 4β βββββΌββββΌββββΌββββΌββββΌββββ€ β 16β 8β 2β 8β 4β 2β βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ Timeout def strategy(board): return "W" Timeout Timeout Timeout Exception = list index out of range Timeout def strategy(board): # Simple strategy: always move up unless the board is empty. return "W" Timeout Timeout Steps = 1095 State = success def strategy(board): import random return random.choice(["W","A","S","D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β 2β 4β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β 16β 8β 64β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β2048β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β 8β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 32β 4β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1285 State = success def strategy(board): # Simple heuristic: try to move left if possible, otherwise random from random import choice moves = ["W", "A", "S", "D"] return choice(moves) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β 8β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 128β 16β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 2β 16β 4β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 4β .β 8β 64β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 16β 4β 4β 16β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 8β2048β 256β 2β 16β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): import random moves = "WASD" return random.choice(moves) Steps = 1181 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 4β .β .β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 128β 4β 16β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 8β 32β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 16β2048β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 4β 2β 16β 8β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout Timeout def strategy(board): # Define move application def move(mat, dir): def compress(row): new = [x for x in row if x != 0] res = [] skip = False for i in range(len(new)): if skip: skip = False continue if i+1 < len(new) and new[i] == new[i+1]: res.append(new[i]*2) skip = True else: res.append(new[i]) res += [0]*(len(row)-len(res)) return res def transpose(m): return [list(row) for row in zip(*m)] m = [r[:] for r in mat] if dir == 'W': # up m = transpose(m) m = [compress(row) for row in m] m = transpose(m) elif dir == 'S': # down m = transpose(m) m = [compress(row[::-1])[::-1] for row in m] m = transpose(m) elif dir == 'A': # left m = [compress(row) for row in m] elif dir == 'D': # right m = [compress(row[::-1])[::-1] for row in m] return m # Evaluate board by total merge value def score(mat): return sum(sum(row) for row in mat) best_move = None best_score = -1 for move_char in 'WASD': new_board = move(board, move_char) sc = score(new_board) if sc > best_score: best_score = sc best_move = move_char return best_move Timeout Timeout Steps = 1264 State = success def strategy(board): import random return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 4β 64β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β2048β 8β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 128β 64β 32β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 32β 2β 8β 64β 8β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1157 State = success def strategy(board): import random moves = ['W', 'A', 'S', 'D'] # Simple heuristic: choose a random valid move return random.choice(moves) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 2β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β 32β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 4β 64β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 16β2048β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 4β 32β 128β 16β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout def strategy(board): import random moves = ["W", "A", "S", "D"] return random.choice(moves) Steps = 1490 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 4β .β 16β 64β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 4β 16β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 128β2048β 16β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 32β 512β 4β 32β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 8β 2β 8β 2β 4β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1072 State = failed def strategy(board): import copy, random def move(b, d): # simple 2048 move implementation def slide(arr): arr = [x for x in arr if x!=0] res=[] skip=False for i in range(len(arr)): if skip: skip=False; continue if i+1<len(arr) and arr[i]==arr[i+1]: res.append(arr[i]*2); skip=True else: res.append(arr[i]) res+= [0]*(len(b)-len(res)) return res N=len(b) new=[[0]*N for _ in range(N)] if d=='W': for c in range(N): col=[b[r][c] for r in range(N)] col=slide(col) for r in range(N): new[r][c]=col[r] elif d=='S': for c in range(N): col=[b[r][c] for r in range(N)][::-1] col=slide(col)[::-1] for r in range(N): new[r][c]=col[r] elif d=='A': for r in range(N): row=slide(b[r]) new[r]=row elif d=='D': for r in range(N): row=slide(b[r][::-1])[::-1] new[r]=row return new directions='WASD' for d in directions: if move(board,d)!=board: return d return random.choice(directions) βββββ¬ββββ¬ββββ¬ββββ¬ββββ¬ββββ β256β 16β256β 64β 4β 2β βββββΌββββΌββββΌββββΌββββΌββββ€ β 2β512β 4β 16β 8β 4β βββββΌββββΌββββΌββββΌββββΌββββ€ β512β 8β256β 2β128β 32β βββββΌββββΌββββΌββββΌββββΌββββ€ β 64β 2β 64β 32β 64β 16β βββββΌββββΌββββΌββββΌββββΌββββ€ β 8β 4β 2β 8β 16β 4β βββββΌββββΌββββΌββββΌββββΌββββ€ β 4β 2β 8β 2β 4β 2β βββββ΄ββββ΄ββββ΄ββββ΄ββββ΄ββββ Timeout Timeout def strategy(board): import copy import random def move(b, dir): def slide_line(line): new = [i for i in line if i != 0] res = [] skip = False for i in range(len(new)): if skip: skip = False continue if i+1 < len(new) and new[i] == new[i+1]: res.append(new[i]*2) skip = True else: res.append(new[i]) res += [0]*(4-len(res)) return res def rotate(b, times): for _ in range(times): b = [list(row) for row in zip(*b[::-1])] return b rot_map = {'W':0,'D':1,'S':2,'A':3} b = rotate(copy.deepcopy(b), rot_map[dir]) for i in range(4): b[i] = slide_line(b[i]) b = rotate(b, (4-rot_map[dir])%4) return b def empty_count(b): return sum(row.count(0) for row in b) moves = ["W","A","S","D"] best = None best_empty = -1 for m in moves: new = move(board, m) if new != board: e = empty_count(new) if e > best_empty: best_empty = e best = m return best if best else random.choice(moves) Timeout Timeout Steps = 1372 State = success def strategy(board): import random # choose a random valid move from 'W', 'A', 'S', 'D' return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β 32β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β .β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 2β 32β 2β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 32β2048β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 256β 16β 4β 16β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β 4β 256β 2β 8β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1228 State = success def strategy(board): from random import choice # Possible moves moves = ['W', 'A', 'S', 'D'] # Simple heuristic: prefer moves that generate a merge or create a larger tile def can_merge(b, move): # Simulate a move and check if any tiles merge # For brevity, just return False for this example return False # Filter moves that could potentially merge valid_moves = [m for m in moves if can_merge(board, m)] if valid_moves: return choice(valid_moves) # Fallback: random move return choice(moves) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 32β 128β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β2048β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 16β 8β 128β 32β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 4β 2β 8β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout def strategy(board): import random # Simple heuristic: try moves in random order until one changes the board moves = ['W','A','S','D'] random.shuffle(moves) def can_move(b, dir): # simulate a move and check if board changes def rotate(b, times): for _ in range(times): b = [list(row) for row in zip(*b[::-1])] return b n = len(b) def merge_line(line): new = [x for x in line if x] merged = [] skip = False for i in range(len(new)): if skip: skip = False; continue if i+1 < len(new) and new[i] == new[i+1]: merged.append(new[i]*2) skip = True else: merged.append(new[i]) merged += [0]*(n-len(merged)) return merged def move(b, dir): if dir=='W': b = rotate(b,3) elif dir=='S': b = rotate(b,1) elif dir=='D': b = rotate(b,2) new_b = [] for row in b: new_b.append(merge_line(row)) # rotate back for _ in range((4-({'W':3,'S':1,'D':2,'A':0}[dir]))%4): new_b = [list(row) for row in zip(*new_b[::-1])] return new_b return move(b, dir) != b for m in moves: if can_move(board, m): return m return random.choice(moves) Steps = 1173 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 4β 2β 8β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β2048β 4β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 128β 16β 8β 4β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 16β 2β 256β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 2β 16β 4β 8β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1221 State = success def strategy(board): import random # Convert board to 4x4 grid size = len(board) directions = ["W", "A", "S", "D"] def move_possible(d): # Basic check: return True if at least one move in that direction changes board rowc, colc = 0, 0 for r in range(size): for c in range(size): if board[r][c] == 0: return True return False # Randomly pick a direction that is valid (here we just return a random choice) # In a real strategy you'd evaluate each move, but for brevity we pick random return random.choice(directions) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β .β 2β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 32β 8β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 8β 256β 8β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 32β2048β 8β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 16β 2β 4β 2β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout def strategy(board): # A very simple strategy: always move up ('W') unless it would lose immediately. # This is a placeholder; a more sophisticated strategy would evaluate moves. return "W" Timeout Exception = name 'copy' is not defined Timeout Timeout Timeout def strategy(board): # Simple heuristic: choose the move that yields the most empty tiles after simulation import random, copy moves = ['W', 'A', 'S', 'D'] def simulate(b, m): # naive clone and apply a single shift (no merging logic) return b # placeholder, as full 2048 logic is complex best = moves[0] best_empty = -1 for m in moves: b_copy = copy.deepcopy(board) # placeholder: pretend each move yields random empty count empty = random.randint(0, 15) # dummy if empty > best_empty: best_empty = empty best = m return best Steps = 1260 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 256β 4β 2β 8β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 16β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 128β 16β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β2048β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β .β 4β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 4β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout Steps = 1500 State = failed def strategy(board): # Simple heuristic: try moves in order, pick first that changes board def tilt(board, dir): n = len(board) def compress(row): new = [x for x in row if x] for i in range(len(new)-1): if new[i] == new[i+1]: new[i] *= 2 new[i+1] = 0 new = [x for x in new if x] return new + [0]*(n-len(new)) result = [[0]*n for _ in range(n)] for i in range(n): for j in range(n): result[i][j] = board[i][j] if dir == "W": for j in range(n): col = [result[i][j] for i in range(n)] newcol = compress(col) for i in range(n): result[i][j] = newcol[i] elif dir == "S": for j in range(n): col = [result[i][j] for i in range(n)][::-1] newcol = compress(col)[::-1] for i in range(n): result[i][j] = newcol[i] elif dir == "A": for i in range(n): newrow = compress(result[i]) result[i] = newrow elif dir == "D": for i in range(n): newrow = compress(result[i][::-1])[::-1] result[i] = newrow return result moves = ["W","A","S","D"] for m in moves: if tilt(board, m) != board: return m return moves[0] ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β1024β 2β 8β 32β 16β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 256β 64β 2β 512β 32β 64β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 2β 256β 2β 128β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 256β 128β 64β 16β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 128β 64β 32β 16β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 32β 16β 2β 4β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): # Simple MonteβCarlo style: try all moves, pick the one that gives the highest # immediate score (sum of merged tiles) in one step. def simulate_move(b, dir): import copy b2 = copy.deepcopy(b) size = len(b2) score = 0 def compress_line(line): nonlocal score new = [x for x in line if x != 0] merged = [] i = 0 while i < len(new): if i+1 < len(new) and new[i] == new[i+1]: merged.append(new[i]*2) score += new[i]*2 i += 2 else: merged.append(new[i]) i += 1 merged += [0]*(size-len(merged)) return merged def move_left(mat): for i in range(size): mat[i] = compress_line(mat[i]) def transpose(mat): return [list(row) for row in zip(*mat)] if dir == "W": move_left(transpose(b2)) transpose(b2) elif dir == "S": move_left(b2) elif dir == "A": move_left(b2) elif dir == "D": move_left(transpose(b2)) transpose(b2) return score best_dir = None best_score = -1 for d in "WASD": sc = simulate_move(board, d) if sc > best_score: best_score, best_dir = sc, d return best_dir if best_dir else "W" Timeout Timeout Timeout Timeout Timeout def strategy(board): import copy, random def move(b, dir): size=len(b) def rotate(b): return [list(row) for row in zip(*b[::-1])] if dir=='W': g=rotate(rotate(rotate(rotate(b)))) elif dir=='A': g=rotate(rotate(b)) elif dir=='D': g=rotate(b) else: g=b # slide rows left def slide(row): new=[i for i in row if i!=0] res=[] i=0 while i<len(new): if i+1<len(new) and new[i]==new[i+1]: res.append(new[i]*2) i+=2 else: res.append(new[i]) i+=1 res+= [0]*(size-len(res)) return res for i in range(size): g[i]=slide(g[i]) # rotate back if dir=='W': for _ in range(3): g=rotate(g) elif dir=='A': g=rotate(rotate(g)) elif dir=='D': g=rotate(g) return g best=None;bestdir=None;bestscore=-1 for d in "WASD": nxt=move(board,d) score=sum(sum(row) for row in nxt) if score>bestscore: bestscore=score;bestdir=d return bestdir or random.choice("WASD") Timeout Timeout Steps = 1427 State = success def strategy(board): import random return random.choice(["W","A","S","D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 4β 8β 2β 128β 2β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β2048β 8β 32β 128β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 64β 128β 8β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 8β 256β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β .β 32β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Steps = 1150 State = success def strategy(board): import random return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β 16β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 2β 8β 32β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β2048β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 4β 32β 32β 128β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β 8β 4β 8β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): # Simple strategy: always try to move up if possible, otherwise random # Since we don't have a full engine, just return "W". return "W" Timeout Timeout Timeout Timeout Steps = 1347 State = success def strategy(board): # Simple random move strategy import random return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β 256β 8β 8β 4β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 8β 2β 16β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 8β 256β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β2048β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 16β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): # Simple heuristic: try all moves and pick the one that results in the most merges def rotate_cw(b): # rotate the board 90Β° clockwise return [ [b[3-j][i] for j in range(4)] for i in range(4) ] def compress(b): new_b = [[0]*4 for _ in range(4)] score = 0 for i in range(4): pos = 0 last = 0 for j in range(4): val = b[i][j] if val: if val == last: new_b[i][pos-1] *= 2 score += new_b[i][pos-1] last = 0 else: last = val new_b[i][pos] = val pos += 1 return new_b, score def move_left(b): compressed, s = compress(b) return compressed, s def move(board, dir): b = [row[:] for row in board] for _ in range(dir): # 0:W,1:D,2:S,3:A b = rotate_cw(b) new_b, score = move_left(b) for _ in range(4-dir): new_b = rotate_cw(new_b) return new_b, score moves = ['W','A','S','D'] best_move = None best_score = -1 for i, m in enumerate(moves): _, s = move(board, i) if s > best_score: best_score, best_move = s, m return best_move Timeout Steps = 1363 State = success def strategy(board): import random return random.choice(["W","A","S","D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 16β 2β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 64β 2β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 8β 256β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β2048β 32β 4β 16β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 32β 16β 64β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 128β 4β 32β 2β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout def strategy(board): """ A very simple strategy for 2048: try moves in the order W, A, S, D and return the first one that changes the board state. """ def can_move(b, dir): # simulate a move and check if board changes from copy import deepcopy import numpy as np def move(b): # inner helper to shift and merge once def merge(arr): result = [] skip = False for i in range(len(arr)): if skip: skip = False continue if i+1 < len(arr) and arr[i] and arr[i] == arr[i+1]: result.append(arr[i]*2) skip = True else: result.append(arr[i]) return result + [0]*(len(arr)-len(result)) n = len(b) new_board = [[0]*n for _ in range(n)] if dir == 'W': for j in range(n): col = [b[i][j] for i in range(n)] moved = merge(col) for i in range(n): new_board[i][j] = moved[i] elif dir == 'S': for j in range(n): col = [b[i][j] for i in range(n)][::-1] moved = merge(col) moved = moved[::-1] for i in range(n): new_board[i][j] = moved[i] elif dir == 'A': for i in range(n): row = b[i] moved = merge(row) new_board[i] = moved elif dir == 'D': for i in range(n): row = b[i][::-1] moved = merge(row) new_board[i] = moved[::-1] return new_board new_board = move(b) return new_board != b for d in ['W','A','S','D']: if can_move(board, d): return d # if no move changes board, pick any (fallback) return 'W' Timeout Steps = 1136 State = success def strategy(board): import random # Simple heuristic: Random move that changes the board moves = ["W", "A", "S", "D"] def board_after_move(b, m): # simulate movement by creating a copy of board and applying move # This is a very simplified implementation: in actual 2048 logic # you'd need to merge tiles. For demo purposes, just return original. return b # placeholder random.shuffle(moves) for m in moves: if board_after_move(board, m) != board: return m return moves[0] ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β 4β 8β 128β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 4β 64β2048β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β .β .β 4β 32β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Timeout def strategy(board): import random # simple heuristic: pick a random valid move # Define helper to check if move is valid (changes board) def move_possible(b, move): # create a copy and apply move, compare import copy tmp = copy.deepcopy(b) # apply move on tmp def move_board(bd, dir): size = len(bd) # helper to compress row/col def compress(line): # shift and merge new = [v for v in line if v!=0] res = [] skip = False i = 0 while i < len(new): if i+1 < len(new) and new[i]==new[i+1]: res.append(new[i]*2) i+=2 else: res.append(new[i]) i+=1 res += [0]*(size-len(res)) return res if dir=="L": for i in range(size): bd[i]=compress(bd[i]) elif dir=="R": for i in range(size): bd[i]=list(reversed(compress(list(reversed(bd[i]))))) elif dir=="U": for j in range(size): col=[bd[i][j] for i in range(size)] col=compress(col) for i in range(size): bd[i][j]=col[i] elif dir=="D": for j in range(size): col=[bd[i][j] for i in range(size)] col=list(reversed(compress(list(reversed(col))))) for i in range(size): bd[i][j]=col[i] move_board(tmp, move) return tmp != b directions = ["W","A","S","D"] valid_moves = [m for m in directions if move_possible(board, m)] return random.choice(valid_moves) if valid_moves else "W" Timeout Timeout Timeout Steps = 2185 State = failed def strategy(board): def can_move(board, dir): def move(board, dir): def compress(line): new = [x for x in line if x] merged = [] skip = False i = 0 while i < len(new): if i+1 < len(new) and new[i] == new[i+1]: merged.append(new[i]*2) i += 2 else: merged.append(new[i]) i += 1 merged += [0]*(len(line)-len(merged)) return merged n = len(board) b = [row[:] for row in board] if dir=='W': for j in range(n): col = [b[i][j] for i in range(n)] merged = compress(col) for i in range(n): b[i][j] = merged[i] elif dir=='S': for j in range(n): col = [b[i][j] for i in range(n)][::-1] merged = compress(col) merged = merged[::-1] for i in range(n): b[i][j] = merged[i] elif dir=='A': for i in range(n): row = b[i] merged = compress(row) b[i] = merged elif dir=='D': for i in range(n): row = b[i][::-1] merged = compress(row) b[i] = merged[::-1] return b moved = move(board, dir) return moved != board for d in "WASD": if can_move(board, d): return d return "W" ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β1024β 64β 32β 4β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β1024β 4β 512β 256β 128β 32β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 512β 256β 16β 128β 64β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 256β 32β 2β 64β 32β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 128β 2β 64β 4β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 32β 16β 2β 8β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1127 State = success def strategy(board): import random # Simple heuristic: if any move possible, pick a random legal one. moves = ['W', 'A', 'S', 'D'] # Randomly shuffle and pick first that changes board (or return any if all invalid) def can_move(board, move): # Very simple check: ensure move is within bounds and not all cells zero return any(any(row) for row in board) random.shuffle(moves) return moves[0] ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β .β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β 64β 2β 64β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 4β 8β 4β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 2β 4β2048β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 16β 32β 8β 2β 8β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): # Simple heuristic: always try to move up ("W") if possible, otherwise down ("S") # We'll simulate a single move to see if any tile merges or moves. def can_move_up(b): for col in range(4): for row in range(1, 4): if b[row][col] != 0 and (b[row-1][col] == 0 or b[row-1][col] == b[row][col]): return True return False return "W" if can_move_up(board) else "S" Timeout Exception = list index out of range Steps = 1218 State = failed def strategy(board): import copy, random dirs = {"W": (-1, 0), "A": (0, -1), "S": (1, 0), "D": (0, 1)} def slide_line(line): new = [x for x in line if x != 0] res = [] i = 0 while i < len(new): if i + 1 < len(new) and new[i] == new[i + 1]: res.append(new[i] * 2) i += 2 else: res.append(new[i]) i += 1 res += [0] * (len(line) - len(res)) return res def move(b, d): n = len(b) bx = [row[:] for row in b] moved = False for i in range(n): if d[0] == -1 or d[0] == 1: line = [bx[j][i] for j in range(n)] if d[0] == 1: line.reverse() new = slide_line(line) if d[0] == 1: new.reverse() for j in range(n): if bx[j][i] != new[j]: moved = True bx[j][i] = new[j] else: line = bx[i][:] if d[1] == 1: line.reverse() new = slide_line(line) if d[1] == 1: new.reverse() if new != bx[i]: moved = True bx[i] = new return bx, moved def score(b): return sum(sum(row) for row in b) best = None best_score = -1 for key, d in dirs.items(): new_board, moved = move(board, d) if not moved: continue s = score(new_board) if s > best_score: best_score = s best = key return best if best else random.choice(list(dirs.keys())) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 64β 4β 256β 64β 32β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 8β 64β 16β 8β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 512β1024β 2β 8β 4β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 128β 256β 16β 4β 16β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 16β 8β 2β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 8β 2β 8β 4β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout def strategy(board): # Simple deterministic strategy: try to push upwards if possible otherwise right # Check if any merge or move possible upward, else right def can_move(board, drow, dcol): for r in range(4): for c in range(4): val = board[r][c] if val == 0: continue nr, nc = r + drow, c + dcol if 0 <= nr < 4 and 0 <= nc < 4: nb = board[nr][nc] if nb == 0 or nb == val: return True return False if can_move(board, -1, 0): return "W" if can_move(board, 0, 1): return "D" if can_move(board, 1, 0): return "S" if can_move(board, 0, -1): return "A" # if no moves possible, just return "W" return "W" Timeout Timeout Timeout Timeout def strategy(board): # Simple heuristic: always try to move 'W' (up) if possible, # otherwise pick the first available move among A, S, D. moves = ['W', 'A', 'S', 'D'] for m in moves: if can_move(board, m): return m return 'W' def can_move(b, m): # Check if a move in direction m is possible (board not empty) # This is a minimal placeholder that always returns True # for demonstration purposes. return True Exception = cannot access local variable 'can_move' where it is not associated with a value Steps = 1376 State = success def strategy(board): import random return random.choice(["W", "A", "S", "D"]) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 32β 8β 8β 32β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 16β 64β 8β 32β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 8β2048β 16β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 512β 8β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 32β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 4β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Steps = 1393 State = success def strategy(board): import random moves = ["W","A","S","D"] # Simple heuristic: prefer moves that don't change board, otherwise random def can_move(b, dir_): # simulate a move and check if board changes bcopy = [row[:] for row in b] # placeholder for actual move logic return True # assume any direction always possible for simplicity valid_moves = [m for m in moves if can_move(board, m)] return random.choice(valid_moves) ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 128β 8β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 64β .β 2β .β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 512β2048β 16β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 2β 8β 2β 8β 4β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout def strategy(board): # simulate a move in 2048 def move(b, dir): def transpose(m): return [list(x) for x in zip(*m)] def reverse(m): return [row[::-1] for row in m] def compress(row): new = [x for x in row if x] merged = [] skip = False for i in range(len(new)): if skip: skip = False; continue if i+1 < len(new) and new[i]==new[i+1]: merged.append(new[i]*2) skip = True else: merged.append(new[i]) return merged + [0]*(len(row)-len(merged)) def slide(m): return [compress(row) for row in m] m = [row[:] for row in b] if dir=="W": m = transpose(m) m = slide(m) m = transpose(m) elif dir=="S": m = transpose(m) m = reverse(m) m = slide(m) m = reverse(m) m = transpose(m) elif dir=="A": m = reverse(m) m = slide(m) m = reverse(m) elif dir=="D": m = slide(m) return m dirs = "WASD" for d in dirs: if move(board, d) != board: return d return dirs[0] Timeout Steps = 1343 State = success def strategy(board): import random moves = ["W", "A", "S", "D"] # Randomly pick a move that actually changes the board state board_str = str(board) for _ in range(10): mv = random.choice(moves) # Very simple: pretend every move is valid return mv ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 16β2048β 128β 4β 4β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 32β 256β 4β 32β .β 128β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 2β .β 4β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout Steps = 1013 State = success def strategy(board): import random # simple heuristic: try moves that combine tiles moves = ["W", "A", "S", "D"] best_move = random.choice(moves) # Simple check: if any move leads to a merge, prioritize that def can_merge(b, move): def copy_board(b): return [row[:] for row in b] def slide(row): tiles = [t for t in row if t] new = [] skip = False for i in range(len(tiles)): if skip: skip = False continue if i+1 < len(tiles) and tiles[i] == tiles[i+1]: new.append(tiles[i]*2) skip = True else: new.append(tiles[i]) return new + [0]*(len(row)-len(new)) def transpose(b): return [list(col) for col in zip(*b)] b2 = copy_board(b) if move == "W": b2 = transpose(b2) for i in range(4): b2[i] = slide(b2[i]) b2 = transpose(b2) elif move == "S": b2 = transpose(b2) for i in range(4): b2[i] = list(reversed(slide(list(reversed(b2[i]))))) b2 = transpose(b2) elif move == "A": for i in range(4): b2[i] = slide(b2[i]) elif move == "D": for i in range(4): b2[i] = list(reversed(slide(list(reversed(b2[i]))))) return b2 != b # evaluate moves for m in moves: if can_merge(board, m): best_move = m break return best_move ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β2048β 2β 16β 32β 4β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 16β 8β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 4β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 4β 8β .β .β 2β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 4β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 2β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ def strategy(board): import random moves = ["W", "A", "S", "D"] return random.choice(moves) Steps = 1620 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 2β 4β 16β 4β 8β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β2048β 16β 512β 32β 32β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 64β 256β 4β 256β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 16β 32β 2β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β .β .β .β .β .β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Steps = 1342 State = success def strategy(board): # Simple strategy: choose a random legal move from those that change the board import random, copy moves = ['W', 'A', 'S', 'D'] def apply_move(b, m): def rotate(b, times): for _ in range(times): b = [list(row) for row in zip(*b[::-1])] return b def slide_and_merge(b): new_b = [] for row in b: tiles = [t for t in row if t != 0] merged = [] skip = False for i in range(len(tiles)): if skip: skip = False continue if i+1 < len(tiles) and tiles[i] == tiles[i+1]: merged.append(tiles[i]*2) skip = True else: merged.append(tiles[i]) merged += [0]*(4-len(merged)) new_b.append(merged) return new_b rot = {'W':0,'A':1,'S':2,'D':3}[m] b_rot = rotate(b, rot) b_new = slide_and_merge(b_rot) b_final = rotate(b_new, (4-rot)%4) return b_final legal_moves = [] for m in moves: if apply_move(copy.deepcopy(board), m) != board: legal_moves.append(m) return random.choice(legal_moves) if legal_moves else 'W' ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β 4β 2β 4β 64β 2β 8β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 16β 32β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β .β .β 128β2048β 16β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β 16β 32β 16β 128β 2β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β 8β 128β 4β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β .β .β .β 8β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Timeout Timeout def strategy(board): import random moves = ["W","A","S","D"] # Prefer moves that combine tiles or create empty spaces def can_move(m): n = len(board) tmp = [row[:] for row in board] def move_left(b): changed = False for i in range(n): merged = [False]*n for j in range(1,n): if b[i][j]==0: continue k=j while k>0 and b[i][k-1]==0: b[i][k-1]=b[i][k]; b[i][k]=0; k-=1; changed=True if k>0 and b[i][k-1]==b[i][k]: b[i][k-1]*=2; b[i][k]=0; changed=True; merged[k-1]=True return changed def rotate(b, times): for _ in range(times): b = [list(row) for row in zip(*b[::-1])] return b def revert(b, times): for _ in range(times): b = [list(row) for row in zip(*b)][::-1] return b for t in range(4): tmp = rotate(tmp,1) if move_left(tmp): return True return False viable = [m for m in moves if can_move(m)] return random.choice(viable) if viable else random.choice(moves) Steps = 1201 State = success ββββββ¬βββββ¬βββββ¬βββββ¬βββββ¬βββββ β .β .β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β .β .β 2β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 32β .β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 16β 4β 4β .β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 8β 2β2048β 256β .β .β ββββββΌβββββΌβββββΌβββββΌβββββΌβββββ€ β 2β 32β 32β 2β 4β 2β ββββββ΄βββββ΄βββββ΄βββββ΄βββββ΄βββββ Exception = 'str' object does not support item assignment
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need a simple strategy. Use a basic heuristic: try moves in order W,A,S,D and pick first that changes board. Provide function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
# helper to copy board
def copy(b): return [row[:] for row in b]
# move simulation
def move(b, d):
def collapse(row):
new = [x for x in row if x]
res = []
skip = False
for i in range(len(new)):
if skip: skip = False; continue
if i+1 < len(new) and new[i]==new[i+1]:
res.append(new[i]*2); skip=True
else:
res.append(new[i])
res += [0]*(4-len(res))
return res
if d=='W' or d=='S':
trans = [list(col) for col in zip(*b)]
for i in range(4):
trans[i] = collapse(trans[i]) if d=='W' else collapse(trans[i][::-1])[::-1]
return [list(row) for row in zip(*trans)]
else:
for i in range(4):
row = collapse(b[i]) if d=='A' else collapse(b[i][::-1])[::-1]
b[i] = row
return b
def boards_equal(a,b): return all(a[i][j]==b[i][j] for i in range(4) for j in range(4))
for d in "WASD":
new = move(copy(board), d)
if not boards_equal(board, new):
return d
return "W"
```<|return|>
Saving to float16 or MXFP4
We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.
And we're done!
Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS
This notebook and all Unsloth notebooks are licensed LGPL-3.0.