Unsloth Gpt Oss (20B) Reinforcement Learning 2048 Game BF16

Gpt Oss (20B) Reinforcement Learning 2048 Game BF16

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / gpt_oss_(20B)_Reinforcement_Learning_2048_Game_BF16.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

Goal: Make GPT-OSS play games with Reinforcement Learning

Our goal is to make GPT-OSS play the 2048 game with reinforcement learning, or a variant of it called GRPO.

We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose. We then reward the model if it created a good strategy (winning the game), and we'll penalize it (negative reward) if the strategy was a bad one.

Installation

We'll be using Unsloth to do RL on GPT-OSS 20B. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!

[ ]

We'll load GPT-OSS 20B and set some parameters:

max_seq_length = 768 The maximum context length of the model. Increasing it will use more memory.
lora_rank = 4 The larger this number, the smarter the RL process, but the slower and more memory usageload_in_16bit will be faster but will need a 64GB GPU or more (MI300)
offload_embedding = True New Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.

[ ]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.10.4: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Num GPUs = 1. Max memory: 79.318 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.

Loading checkpoint shards:   0%|          | 0/9 [00:00<?, ?it/s]

To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.

[ ]

Unsloth: Making `model.base_model.model.model` require gradients

2048 game

We used GPT-5 to create a variant of the 2048 game. It should output the current game board state, and allow us to advance the game board state with 1 action (up, down, left, right).

[ ]

For example let's create a board of size 5 X 5 and set the target to 8 instead of 2048.

[NOTE] 2048 originally spawns a (4) 10% of the time! We can disable this for harder games. See Wikipedia page for more details.

[ ]

┌───┬───┬───┬───┬───┐
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  2│  .│  .│  .│  2│
└───┴───┴───┴───┴───┘ ongoing

[ ]

GameBoard(size=5, seed=42, target=8, probability_fours=0.1)

We'll use WASD for the action space:

   W
A  S  D

Also game.state() will say success if we succeeded in getting the target!

[ ]

┌───┬───┬───┬───┬───┐
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  2│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  4│  .│  .│  .│  .│
└───┴───┴───┴───┴───┘ ongoing

[ ]

┌───┬───┬───┬───┬───┐
│  4│  .│  2│  .│  .│
├───┼───┼───┼───┼───┤
│  2│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
└───┴───┴───┴───┴───┘ ongoing

[ ]

┌───┬───┬───┬───┬───┐
│  .│  .│  .│  4│  2│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  2│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  4│  .│  .│  .│  .│
└───┴───┴───┴───┴───┘ ongoing

[ ]

┌───┬───┬───┬───┬───┐
│  4│  .│  .│  4│  4│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  4│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
└───┴───┴───┴───┴───┘ ongoing

[ ]

┌───┬───┬───┬───┬───┐
│  .│  .│  2│  4│  8│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  4│
├───┼───┼───┼───┼───┤
│  .│  .│  .│  .│  .│
└───┴───┴───┴───┴───┘ success

If we do some other action that's not part of the action space, we will get an error, and the game will not accept anymore actions.

[ ]

┌───┬───┬───┐
│  .│  4│  .│
├───┼───┼───┤
│  .│  .│  2│
├───┼───┼───┤
│  .│  .│  .│
└───┴───┴───┘ failed

RL Environment Setup

We'll set up a function to accept some strategy that'll emit an action within WASD and check the game state.

We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!

[ ]

Let's make a generic strategy to just hit W. We should expect this generic strategy to fail:

[ ]

Timed out with error = Timed out after 2s

To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.

[ ]

Code Execution

To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.

For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:

[ ]

Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}

For the below piece of code, since we import numpy, we should not allow the execution:

[ ]

Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}

We also disallow global variable access. We'll use Unsloth's create_locked_down_function function

[ ]

name 'np' is not defined

[ ]

Data & RL task setup

We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.

[ ]

Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.

First, let's prompt GPT-OSS without RL and see how it goes:

[ ]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>Need simple heuristic: pick move that maximizes merged tiles. Score by sum of resulting board. Implement simulation of each move without random tiles. Return best.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
    import copy
    dirs = {'W':(-1,0),'S':(1,0),'A':(0,-1),'D':(0,1)}
    def mv(b, d):
        dr,dc=d
        n=len(b)
        res=[[0]*n for _ in range(n)]
        for i in range(n):
            line=[]
            for j in range(n):
                r,c=i+dr*j, j+dc*i if d in 'W' else j+dc*i
                if d in 'WA': r,c=j,i+dc*j
                val=b[r][c]
                if val: line.append(val)
            merged=[]
            skip=False
            for k,v in enumerate(line):
                if skip: skip=False; continue
                if k+1<len(line) and line[k+1]==v:
                    merged.append(v*2); skip=True
                else: merged.append(v)
            for k,v in enumerate(merged):
                if d=='W': res[i][k]=v
                elif d=='S': res[i][n-1-k]=v
                elif d=='A': res[k][i]=v
                else: res[n-1-k][i]=v
        return res
    best='W'; bestval=-1
    for k,v in dirs.items():
        try:
            val=sum(map(sum,mv(board,v)))
            if val>bestval: bestval=val; best=k
        except: pass
    return best
```<|return|>

Reward functions

We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.

And 3 reward functions:

function_works which rewards the model if the strategy is a valid Python function.
no_cheating which checks if the function imported other modules, and if it did, we penalize it.
strategy_succeeds which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.

[ ]

def strategy(board):
    return "W" # Example

Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:

[ ]

(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
,  'stdlib': [],
,  'non_stdlib': [],
,  'relative_imports': 0})

[ ]

no_cheating checks if the function cheated since it might have imported Numpy or other functions:

[ ]

Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "W" which would fail after a time limit of 10 seconds.

We also add a global PRINTER to print out the strategy and board state.

[ ]

We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.

[ ]

{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "W", "A", "S", "D" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n    return "W" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
,   'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'}

Train the model

Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.

[ ]

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[ ]

And let's train the model!

NOTE This might be quite slow! 600 steps takes ~5 hours or longer.

[ ]

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)
`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.

Streaming output truncated to the last 5000 lines.
┌────┬────┬────┬────┬────┬────┐
│   2│  64│   4│ 256│  16│   2│
├────┼────┼────┼────┼────┼────┤
│1024│   4│  32│   8│  64│   4│
├────┼────┼────┼────┼────┼────┤
│   8│ 128│ 256│ 128│  32│   8│
├────┼────┼────┼────┼────┼────┤
│   2│  64│  32│  64│  16│   2│
├────┼────┼────┼────┼────┼────┤
│  64│  16│   8│  16│   8│   4│
├────┼────┼────┼────┼────┼────┤
│  16│   8│   2│   8│   4│   2│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    import random
    return random.choice(["W", "A", "S", "D"])
Steps = 1582 State = success
┌────┬────┬────┬────┬────┬────┐
│  16│   4│   2│  32│   2│   4│
├────┼────┼────┼────┼────┼────┤
│   2│ 256│ 512│   4│  16│   .│
├────┼────┼────┼────┼────┼────┤
│2048│   2│  32│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   2│ 128│   8│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   4│   .│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│  64│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Exception = name 'random' is not defined
Timeout
Timeout
def strategy(board):
    import copy
    dirs = {'W':(-1,0),'S':(1,0),'A':(0,-1),'D':(0,1)}
    best_move = None
    best_score = -1
    def move(b, d):
        rows, cols = 4,4
        def slide(row):
            new = [x for x in row if x]
            res = []
            skip=False
            i=0
            while i < len(new):
                if i+1<len(new) and new[i]==new[i+1]:
                    res.append(new[i]*2)
                    i+=2
                else:
                    res.append(new[i])
                    i+=1
            res += [0]*(rows-len(res))
            return res
        b = copy.deepcopy(b)
        for _ in range(d[0]): b = b[::-1]
        for _ in range(d[1]): b = [list(x) for x in zip(*b[::-1])]
        for i in range(rows):
            b[i] = slide(b[i])
        for _ in range(-d[0]): b = b[::-1]
        for _ in range(-d[1]): b = [list(x) for x in zip(*b[::-1])]
        return b
    for m, (dy,dx) in dirs.items():
        nb = move(board,(dy,dx))
        score = sum(sum(row) for row in nb)
        if score>best_score:
            best_score=score
            best_move=m
    return best_move
Timeout
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    import random
    moves = ["W", "A", "S", "D"]
    # simple heuristic: prefer moves that increase score locally
    best = None
    best_score = -1
    def simulate(board, move):
        # copy board
        import copy, math
        b = copy.deepcopy(board)
        # apply move logic simplified: just return new board
        # without full game logic, just random for illustration
        return b, random.randint(0, 4)
    for m in moves:
        _, s = simulate(board, m)
        if s > best_score:
            best_score = s
            best = m
    return best or random.choice(moves)
Steps = 1310 State = success
┌────┬────┬────┬────┬────┬────┐
│   .│ 128│   2│  32│  64│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   2│  32│ 256│   4│  16│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   4│  16│   8│2048│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│   .│   .│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    import copy, itertools, random

    def merge(row):
        nonlocal score
        new = [x for x in row if x]
        merged = []
        i = 0
        while i < len(new):
            if i+1 < len(new) and new[i]==new[i+1]:
                merged.append(new[i]*2)
                score += new[i]*2
                i += 2
            else:
                merged.append(new[i])
                i += 1
        merged += [0]*(len(row)-len(merged))
        return merged

    def transform(b, dir):
        new = copy.deepcopy(b)
        if dir=='W':
            for c in range(4):
                col=[new[r][c] for r in range(4)]
                col=merge(col)
                for r in range(4): new[r][c]=col[r]
        if dir=='S':
            for c in range(4):
                col=[new[r][c] for r in range(4)][::-1]
                col=merge(col)
                col=col[::-1]
                for r in range(4): new[r][c]=col[r]
        if dir=='A':
            for r in range(4):
                new[r]=merge(new[r])
        if dir=='D':
            for r in range(4):
                new[r]=merge(new[r][::-1])[::-1]
        return new

    def board_score(b):
        return max(max(row) for row in b)

    best=None; best_score=-1
    dirs='WASD'
    for d in dirs:
        new=transform(board,d)
        if new!=board:
            s=board_score(new)
            if s>best_score:
                best_score=s; best=d
    return best if best else random.choice('WASD')
Timeout
Steps = 1129 State = success
def strategy(board):
    import random
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│   2│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   .│   .│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  32│   2│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│2048│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  32│   8│  64│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│  32│  16│   8│   4│   2│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
def strategy(board):
    # Simple heuristic: always push left
    return "A"
Timeout
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    return "W"  # simple fixed strategy
Timeout
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    return "W"  # Basic placeholder strategy: always move up
Timeout
Timeout
Steps = 1223 State = success
def strategy(board):
    # Simple heuristic: always try to move up if possible, otherwise random legal move
    from random import choice
    moves = ['W', 'A', 'S', 'D']
    return choice(moves)
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   4│   .│  16│
├────┼────┼────┼────┼────┼────┤
│   2│   .│  16│  64│   8│  32│
├────┼────┼────┼────┼────┼────┤
│   .│   2│  64│  32│2048│ 128│
├────┼────┼────┼────┼────┼────┤
│   4│  32│  16│   4│  16│   2│
└────┴────┴────┴────┴────┴────┘
Timeout
def strategy(board):
    """Very simple 2048 strategy: always try to move up if tiles can combine, else down, left, right."""
    # helper to check if any move is possible in a given direction
    def can_move(b, dir):
        n = len(b)
        # simulate slide in given direction, return True if any change
        temp = [row[:] for row in b]
        moved = False
        if dir == 'W':
            for c in range(n):
                col = [temp[r][c] for r in range(n)]
                merged, _ = slide(col)
                if merged != col:
                    moved = True
                    for r in range(n):
                        temp[r][c] = merged[r]
        elif dir == 'S':
            for c in range(n):
                col = [temp[r][c] for r in range(n)][::-1]
                merged, _ = slide(col)
                merged = merged[::-1]
                if merged != [temp[r][c] for r in range(n)]:
                    moved = True
                    for r in range(n):
                        temp[n-1-r][c] = merged[r]
        elif dir == 'A':
            for r in range(n):
                row = temp[r][:]
                merged, _ = slide(row)
                if merged != row:
                    moved = True
                    temp[r] = merged
        elif dir == 'D':
            for r in range(n):
                row = temp[r][::-1]
                merged, _ = slide(row)
                merged = merged[::-1]
                if merged != temp[r]:
                    moved = True
                    temp[r] = merged
        return moved

    def slide(line):
        """Slide non-zeros left and merge."""
        new = [x for x in line if x != 0]
        merged = []
        skip = False
        i = 0
        while i < len(new):
            if i+1 < len(new) and new[i] == new[i+1]:
                merged.append(new[i]*2)
                i += 2
            else:
                merged.append(new[i])
                i += 1
        merged += [0]*(len(line)-len(merged))
        return merged, merged != line

    for d in ['W','S','A','D']:
        if can_move(board, d):
            return d
    return 'W'  # fallback
Steps = 205 State = failed
┌───┬───┬───┬───┬───┬───┐
│  2│  8│  4│  2│  8│  2│
├───┼───┼───┼───┼───┼───┤
│ 64│  2│ 32│  4│ 64│ 32│
├───┼───┼───┼───┼───┼───┤
│  4│  8│  4│ 64│ 32│ 16│
├───┼───┼───┼───┼───┼───┤
│  2│  4│  2│ 32│  4│  8│
├───┼───┼───┼───┼───┼───┤
│  4│  2│  4│ 16│  8│  2│
├───┼───┼───┼───┼───┼───┤
│  2│  4│  2│  8│  2│  4│
└───┴───┴───┴───┴───┴───┘
Timeout
Timeout
Steps = 988 State = failed
def strategy(board):
    # Helper to simulate a move
    def move(board, key):
        def compress(row):
            new = [x for x in row if x]
            for i in range(len(new)-1):
                if new[i]==new[i+1]:
                    new[i]*=2
                    new[i+1]=0
            return [x for x in new if x]
        def transpose(b):
            return [list(i) for i in zip(*b)]
        def reverse(b):
            return [row[::-1] for row in b]
        def slide_left(b):
            return [compress(r) + [0]*(len(b[0])-len(compress(r))) for r in b]
        def apply(b, k):
            n=len(b)
            if k=='W':return transpose(slide_left(transpose(b)))
            if k=='S':return transpose(reverse(slide_left(reverse(transpose(b)))))
            if k=='A':return slide_left(b)
            if k=='D':return reverse(slide_left(reverse(b)))
        return apply(board, key)
    moves = "WASD"
    for m in moves:
        if move(board, m)!=board:
            return m
    return "W"
┌───┬───┬───┬───┬───┬───┐
│  4│512│  4│ 32│  4│  2│
├───┼───┼───┼───┼───┼───┤
│512│128│ 64│  4│ 64│ 32│
├───┼───┼───┼───┼───┼───┤
│128│ 32│  8│256│  4│ 16│
├───┼───┼───┼───┼───┼───┤
│ 64│  4│ 32│ 64│ 32│  8│
├───┼───┼───┼───┼───┼───┤
│ 32│ 16│ 64│ 32│  8│  4│
├───┼───┼───┼───┼───┼───┤
│ 16│  8│  2│  8│  4│  2│
└───┴───┴───┴───┴───┴───┘
Timeout
def strategy(board):
    return "W"
Timeout
Timeout
Timeout
Exception = list index out of range
Timeout
def strategy(board):
    # Simple strategy: always move up unless the board is empty.
    return "W"
Timeout
Timeout
Steps = 1095 State = success
def strategy(board):
    import random
    return random.choice(["W","A","S","D"])
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   2│   4│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│  16│   8│  64│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│2048│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   8│  16│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│  32│   4│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Steps = 1285 State = success
def strategy(board):
    # Simple heuristic: try to move left if possible, otherwise random
    from random import choice
    moves = ["W", "A", "S", "D"]
    return choice(moves)
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   8│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│ 128│  16│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   2│  16│   4│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   4│   .│   8│  64│   4│
├────┼────┼────┼────┼────┼────┤
│   .│  16│   4│   4│  16│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   8│2048│ 256│   2│  16│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    import random
    moves = "WASD"
    return random.choice(moves)
Steps = 1181 State = success
┌────┬────┬────┬────┬────┬────┐
│   4│   .│   .│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  32│ 128│   4│  16│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   8│  32│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  32│  16│2048│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│   4│   2│  16│   8│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    # Define move application
    def move(mat, dir):
        def compress(row):
            new = [x for x in row if x != 0]
            res = []
            skip = False
            for i in range(len(new)):
                if skip:
                    skip = False
                    continue
                if i+1 < len(new) and new[i] == new[i+1]:
                    res.append(new[i]*2)
                    skip = True
                else:
                    res.append(new[i])
            res += [0]*(len(row)-len(res))
            return res
        def transpose(m): return [list(row) for row in zip(*m)]
        m = [r[:] for r in mat]
        if dir == 'W':          # up
            m = transpose(m)
            m = [compress(row) for row in m]
            m = transpose(m)
        elif dir == 'S':        # down
            m = transpose(m)
            m = [compress(row[::-1])[::-1] for row in m]
            m = transpose(m)
        elif dir == 'A':        # left
            m = [compress(row) for row in m]
        elif dir == 'D':        # right
            m = [compress(row[::-1])[::-1] for row in m]
        return m

    # Evaluate board by total merge value
    def score(mat):
        return sum(sum(row) for row in mat)

    best_move = None
    best_score = -1
    for move_char in 'WASD':
        new_board = move(board, move_char)
        sc = score(new_board)
        if sc > best_score:
            best_score = sc
            best_move = move_char
    return best_move
Timeout
Timeout
Steps = 1264 State = success
def strategy(board):
    import random
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│   2│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   4│  64│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│  64│2048│   8│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│ 128│  64│  32│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  32│   2│   8│  64│   8│
└────┴────┴────┴────┴────┴────┘
Steps = 1157 State = success
def strategy(board):
    import random
    moves = ['W', 'A', 'S', 'D']
    # Simple heuristic: choose a random valid move
    return random.choice(moves)
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   2│   .│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│  32│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   4│  64│  16│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│  16│2048│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   4│  32│ 128│  16│   2│
└────┴────┴────┴────┴────┴────┘
Timeout
def strategy(board):
    import random
    moves = ["W", "A", "S", "D"]
    return random.choice(moves)
Steps = 1490 State = success
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   4│   .│  16│  64│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   4│  16│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│ 128│2048│  16│   2│
├────┼────┼────┼────┼────┼────┤
│  64│  32│ 512│   4│  32│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   8│   2│   8│   2│   4│
└────┴────┴────┴────┴────┴────┘
Steps = 1072 State = failed
def strategy(board):
    import copy, random
    def move(b, d):
        # simple 2048 move implementation
        def slide(arr):
            arr = [x for x in arr if x!=0]
            res=[]
            skip=False
            for i in range(len(arr)):
                if skip: skip=False; continue
                if i+1<len(arr) and arr[i]==arr[i+1]:
                    res.append(arr[i]*2); skip=True
                else:
                    res.append(arr[i])
            res+= [0]*(len(b)-len(res))
            return res
        N=len(b)
        new=[[0]*N for _ in range(N)]
        if d=='W':
            for c in range(N):
                col=[b[r][c] for r in range(N)]
                col=slide(col)
                for r in range(N): new[r][c]=col[r]
        elif d=='S':
            for c in range(N):
                col=[b[r][c] for r in range(N)][::-1]
                col=slide(col)[::-1]
                for r in range(N): new[r][c]=col[r]
        elif d=='A':
            for r in range(N):
                row=slide(b[r])
                new[r]=row
        elif d=='D':
            for r in range(N):
                row=slide(b[r][::-1])[::-1]
                new[r]=row
        return new
    directions='WASD'
    for d in directions:
        if move(board,d)!=board:
            return d
    return random.choice(directions)
┌───┬───┬───┬───┬───┬───┐
│256│ 16│256│ 64│  4│  2│
├───┼───┼───┼───┼───┼───┤
│  2│512│  4│ 16│  8│  4│
├───┼───┼───┼───┼───┼───┤
│512│  8│256│  2│128│ 32│
├───┼───┼───┼───┼───┼───┤
│ 64│  2│ 64│ 32│ 64│ 16│
├───┼───┼───┼───┼───┼───┤
│  8│  4│  2│  8│ 16│  4│
├───┼───┼───┼───┼───┼───┤
│  4│  2│  8│  2│  4│  2│
└───┴───┴───┴───┴───┴───┘
Timeout
Timeout
def strategy(board):
    import copy
    import random

    def move(b, dir):
        def slide_line(line):
            new = [i for i in line if i != 0]
            res = []
            skip = False
            for i in range(len(new)):
                if skip:
                    skip = False
                    continue
                if i+1 < len(new) and new[i] == new[i+1]:
                    res.append(new[i]*2)
                    skip = True
                else:
                    res.append(new[i])
            res += [0]*(4-len(res))
            return res

        def rotate(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b[::-1])]
            return b

        rot_map = {'W':0,'D':1,'S':2,'A':3}
        b = rotate(copy.deepcopy(b), rot_map[dir])
        for i in range(4):
            b[i] = slide_line(b[i])
        b = rotate(b, (4-rot_map[dir])%4)
        return b

    def empty_count(b):
        return sum(row.count(0) for row in b)

    moves = ["W","A","S","D"]
    best = None
    best_empty = -1
    for m in moves:
        new = move(board, m)
        if new != board:
            e = empty_count(new)
            if e > best_empty:
                best_empty = e
                best = m
    return best if best else random.choice(moves)
Timeout
Timeout
Steps = 1372 State = success
def strategy(board):
    import random
    # choose a random valid move from 'W', 'A', 'S', 'D'
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│  32│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│   .│   8│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   2│  32│   2│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│  32│2048│   8│   4│
├────┼────┼────┼────┼────┼────┤
│   .│ 256│  16│   4│  16│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   4│ 256│   2│   8│
└────┴────┴────┴────┴────┴────┘
Steps = 1228 State = success
def strategy(board):
    from random import choice
    
    # Possible moves
    moves = ['W', 'A', 'S', 'D']
    
    # Simple heuristic: prefer moves that generate a merge or create a larger tile
    def can_merge(b, move):
        # Simulate a move and check if any tiles merge
        # For brevity, just return False for this example
        return False
    
    # Filter moves that could potentially merge
    valid_moves = [m for m in moves if can_merge(board, m)]
    if valid_moves:
        return choice(valid_moves)
    
    # Fallback: random move
    return choice(moves)
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│  32│ 128│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│2048│   8│   4│
├────┼────┼────┼────┼────┼────┤
│   .│  16│   8│ 128│  32│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   4│   2│   8│   2│
└────┴────┴────┴────┴────┴────┘
Timeout
def strategy(board):
    import random
    # Simple heuristic: try moves in random order until one changes the board
    moves = ['W','A','S','D']
    random.shuffle(moves)
    def can_move(b, dir):
        # simulate a move and check if board changes
        def rotate(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b[::-1])]
            return b
        n = len(b)
        def merge_line(line):
            new = [x for x in line if x]
            merged = []
            skip = False
            for i in range(len(new)):
                if skip: skip = False; continue
                if i+1 < len(new) and new[i] == new[i+1]:
                    merged.append(new[i]*2)
                    skip = True
                else:
                    merged.append(new[i])
            merged += [0]*(n-len(merged))
            return merged
        def move(b, dir):
            if dir=='W':
                b = rotate(b,3)
            elif dir=='S':
                b = rotate(b,1)
            elif dir=='D':
                b = rotate(b,2)
            new_b = []
            for row in b:
                new_b.append(merge_line(row))
            # rotate back
            for _ in range((4-({'W':3,'S':1,'D':2,'A':0}[dir]))%4):
                new_b = [list(row) for row in zip(*new_b[::-1])]
            return new_b
        return move(b, dir) != b
    for m in moves:
        if can_move(board, m):
            return m
    return random.choice(moves)
Steps = 1173 State = success
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   4│   2│   8│   .│   2│
├────┼────┼────┼────┼────┼────┤
│  32│2048│   4│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│ 128│  16│   8│   4│  16│
├────┼────┼────┼────┼────┼────┤
│   2│  16│   2│ 256│   2│   .│
├────┼────┼────┼────┼────┼────┤
│  32│   2│  16│   4│   8│   .│
└────┴────┴────┴────┴────┴────┘
Steps = 1221 State = success
def strategy(board):
    import random
    # Convert board to 4x4 grid
    size = len(board)
    directions = ["W", "A", "S", "D"]
    
    def move_possible(d):
        # Basic check: return True if at least one move in that direction changes board
        rowc, colc = 0, 0
        for r in range(size):
            for c in range(size):
                if board[r][c] == 0:
                    return True
        return False
    
    # Randomly pick a direction that is valid (here we just return a random choice)
    # In a real strategy you'd evaluate each move, but for brevity we pick random
    return random.choice(directions)
┌────┬────┬────┬────┬────┬────┐
│   2│   .│   2│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│  32│   8│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│   8│ 256│   8│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│  32│2048│   8│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  16│   2│   4│   2│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
def strategy(board):
    # A very simple strategy: always move up ('W') unless it would lose immediately.
    # This is a placeholder; a more sophisticated strategy would evaluate moves.
    return "W"
Timeout
Exception = name 'copy' is not defined
Timeout
Timeout
Timeout
def strategy(board):
    # Simple heuristic: choose the move that yields the most empty tiles after simulation
    import random, copy

    moves = ['W', 'A', 'S', 'D']
    def simulate(b, m):
        # naive clone and apply a single shift (no merging logic)
        return b  # placeholder, as full 2048 logic is complex

    best = moves[0]
    best_empty = -1
    for m in moves:
        b_copy = copy.deepcopy(board)
        # placeholder: pretend each move yields random empty count
        empty = random.randint(0, 15)  # dummy
        if empty > best_empty:
            best_empty = empty
            best = m
    return best
Steps = 1260 State = success
┌────┬────┬────┬────┬────┬────┐
│ 256│   4│   2│   8│   2│   8│
├────┼────┼────┼────┼────┼────┤
│  16│  16│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  32│ 128│  16│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│2048│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   .│   4│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   4│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
Steps = 1500 State = failed
def strategy(board):
    # Simple heuristic: try moves in order, pick first that changes board
    def tilt(board, dir):
        n = len(board)
        def compress(row):
            new = [x for x in row if x]
            for i in range(len(new)-1):
                if new[i] == new[i+1]:
                    new[i] *= 2
                    new[i+1] = 0
            new = [x for x in new if x]
            return new + [0]*(n-len(new))
        result = [[0]*n for _ in range(n)]
        for i in range(n):
            for j in range(n):
                result[i][j] = board[i][j]
        if dir == "W":
            for j in range(n):
                col = [result[i][j] for i in range(n)]
                newcol = compress(col)
                for i in range(n): result[i][j] = newcol[i]
        elif dir == "S":
            for j in range(n):
                col = [result[i][j] for i in range(n)][::-1]
                newcol = compress(col)[::-1]
                for i in range(n): result[i][j] = newcol[i]
        elif dir == "A":
            for i in range(n):
                newrow = compress(result[i])
                result[i] = newrow
        elif dir == "D":
            for i in range(n):
                newrow = compress(result[i][::-1])[::-1]
                result[i] = newrow
        return result
    moves = ["W","A","S","D"]
    for m in moves:
        if tilt(board, m) != board:
            return m
    return moves[0]
┌────┬────┬────┬────┬────┬────┐
│1024│   2│   8│  32│  16│   4│
├────┼────┼────┼────┼────┼────┤
│ 256│  64│   2│ 512│  32│  64│
├────┼────┼────┼────┼────┼────┤
│  32│   2│ 256│   2│ 128│  16│
├────┼────┼────┼────┼────┼────┤
│   2│ 256│ 128│  64│  16│   8│
├────┼────┼────┼────┼────┼────┤
│ 128│  64│  32│  16│   8│   4│
├────┼────┼────┼────┼────┼────┤
│  64│  32│  16│   2│   4│   2│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    # Simple Monte‑Carlo style: try all moves, pick the one that gives the highest
    # immediate score (sum of merged tiles) in one step.  
    def simulate_move(b, dir):
        import copy
        b2 = copy.deepcopy(b)
        size = len(b2)
        score = 0

        def compress_line(line):
            nonlocal score
            new = [x for x in line if x != 0]
            merged = []
            i = 0
            while i < len(new):
                if i+1 < len(new) and new[i] == new[i+1]:
                    merged.append(new[i]*2)
                    score += new[i]*2
                    i += 2
                else:
                    merged.append(new[i])
                    i += 1
            merged += [0]*(size-len(merged))
            return merged

        def move_left(mat):
            for i in range(size):
                mat[i] = compress_line(mat[i])

        def transpose(mat):
            return [list(row) for row in zip(*mat)]

        if dir == "W":
            move_left(transpose(b2))
            transpose(b2)
        elif dir == "S":
            move_left(b2)
        elif dir == "A":
            move_left(b2)
        elif dir == "D":
            move_left(transpose(b2))
            transpose(b2)

        return score

    best_dir = None
    best_score = -1
    for d in "WASD":
        sc = simulate_move(board, d)
        if sc > best_score:
            best_score, best_dir = sc, d
    return best_dir if best_dir else "W"
Timeout
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    import copy, random
    def move(b, dir):
        size=len(b)
        def rotate(b):
            return [list(row) for row in zip(*b[::-1])]
        if dir=='W':
            g=rotate(rotate(rotate(rotate(b))))
        elif dir=='A':
            g=rotate(rotate(b))
        elif dir=='D':
            g=rotate(b)
        else:
            g=b
        # slide rows left
        def slide(row):
            new=[i for i in row if i!=0]
            res=[]
            i=0
            while i<len(new):
                if i+1<len(new) and new[i]==new[i+1]:
                    res.append(new[i]*2)
                    i+=2
                else:
                    res.append(new[i])
                    i+=1
            res+= [0]*(size-len(res))
            return res
        for i in range(size):
            g[i]=slide(g[i])
        # rotate back
        if dir=='W':
            for _ in range(3): g=rotate(g)
        elif dir=='A':
            g=rotate(rotate(g))
        elif dir=='D':
            g=rotate(g)
        return g
    best=None;bestdir=None;bestscore=-1
    for d in "WASD":
        nxt=move(board,d)
        score=sum(sum(row) for row in nxt)
        if score>bestscore:
            bestscore=score;bestdir=d
    return bestdir or random.choice("WASD")
Timeout
Timeout
Steps = 1427 State = success
def strategy(board):
    import random
    return random.choice(["W","A","S","D"])
┌────┬────┬────┬────┬────┬────┐
│   4│   8│   2│ 128│   2│   4│
├────┼────┼────┼────┼────┼────┤
│   2│2048│   8│  32│ 128│   .│
├────┼────┼────┼────┼────┼────┤
│   8│  64│ 128│   8│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  32│   8│ 256│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│   .│  32│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Steps = 1150 State = success
def strategy(board):
    import random
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│  16│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   2│   8│  32│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│2048│  16│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   4│  32│  32│ 128│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│   8│   4│   8│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    # Simple strategy: always try to move up if possible, otherwise random
    # Since we don't have a full engine, just return "W".
    return "W"
Timeout
Timeout
Timeout
Timeout
Steps = 1347 State = success
def strategy(board):
    # Simple random move strategy
    import random
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│   2│ 256│   8│   8│   4│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   8│   2│  16│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   8│ 256│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│   2│2048│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  64│  16│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    # Simple heuristic: try all moves and pick the one that results in the most merges
    def rotate_cw(b):  # rotate the board 90° clockwise
        return [ [b[3-j][i] for j in range(4)] for i in range(4) ]

    def compress(b):
        new_b = [[0]*4 for _ in range(4)]
        score = 0
        for i in range(4):
            pos = 0
            last = 0
            for j in range(4):
                val = b[i][j]
                if val:
                    if val == last:
                        new_b[i][pos-1] *= 2
                        score += new_b[i][pos-1]
                        last = 0
                    else:
                        last = val
                        new_b[i][pos] = val
                        pos += 1
        return new_b, score

    def move_left(b):
        compressed, s = compress(b)
        return compressed, s

    def move(board, dir):
        b = [row[:] for row in board]
        for _ in range(dir):  # 0:W,1:D,2:S,3:A
            b = rotate_cw(b)
        new_b, score = move_left(b)
        for _ in range(4-dir):
            new_b = rotate_cw(new_b)
        return new_b, score

    moves = ['W','A','S','D']
    best_move = None
    best_score = -1
    for i, m in enumerate(moves):
        _, s = move(board, i)
        if s > best_score:
            best_score, best_move = s, m
    return best_move
Timeout
Steps = 1363 State = success
def strategy(board):
    import random
    return random.choice(["W","A","S","D"])
┌────┬────┬────┬────┬────┬────┐
│  16│   2│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  64│   2│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   8│ 256│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│2048│  32│   4│  16│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  32│  16│  64│   8│   4│
├────┼────┼────┼────┼────┼────┤
│   4│ 128│   4│  32│   2│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
def strategy(board):
    """
    A very simple strategy for 2048: try moves in the order W, A, S, D
    and return the first one that changes the board state.
    """
    def can_move(b, dir):
        # simulate a move and check if board changes
        from copy import deepcopy
        import numpy as np

        def move(b):
            # inner helper to shift and merge once
            def merge(arr):
                result = []
                skip = False
                for i in range(len(arr)):
                    if skip:
                        skip = False
                        continue
                    if i+1 < len(arr) and arr[i] and arr[i] == arr[i+1]:
                        result.append(arr[i]*2)
                        skip = True
                    else:
                        result.append(arr[i])
                return result + [0]*(len(arr)-len(result))

            n = len(b)
            new_board = [[0]*n for _ in range(n)]
            if dir == 'W':
                for j in range(n):
                    col = [b[i][j] for i in range(n)]
                    moved = merge(col)
                    for i in range(n):
                        new_board[i][j] = moved[i]
            elif dir == 'S':
                for j in range(n):
                    col = [b[i][j] for i in range(n)][::-1]
                    moved = merge(col)
                    moved = moved[::-1]
                    for i in range(n):
                        new_board[i][j] = moved[i]
            elif dir == 'A':
                for i in range(n):
                    row = b[i]
                    moved = merge(row)
                    new_board[i] = moved
            elif dir == 'D':
                for i in range(n):
                    row = b[i][::-1]
                    moved = merge(row)
                    new_board[i] = moved[::-1]
            return new_board

        new_board = move(b)
        return new_board != b

    for d in ['W','A','S','D']:
        if can_move(board, d):
            return d
    # if no move changes board, pick any (fallback)
    return 'W'
Timeout
Steps = 1136 State = success
def strategy(board):
    import random
    # Simple heuristic: Random move that changes the board
    moves = ["W", "A", "S", "D"]
    
    def board_after_move(b, m):
        # simulate movement by creating a copy of board and applying move
        # This is a very simplified implementation: in actual 2048 logic
        # you'd need to merge tiles. For demo purposes, just return original.
        return b  # placeholder
    
    random.shuffle(moves)
    for m in moves:
        if board_after_move(board, m) != board:
            return m
    return moves[0]
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   4│   8│ 128│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   4│  64│2048│
├────┼────┼────┼────┼────┼────┤
│   4│   .│   .│   4│  32│  16│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Timeout
def strategy(board):
    import random
    # simple heuristic: pick a random valid move
    # Define helper to check if move is valid (changes board)
    def move_possible(b, move):
        # create a copy and apply move, compare
        import copy
        tmp = copy.deepcopy(b)
        # apply move on tmp
        def move_board(bd, dir):
            size = len(bd)
            # helper to compress row/col
            def compress(line):
                # shift and merge
                new = [v for v in line if v!=0]
                res = []
                skip = False
                i = 0
                while i < len(new):
                    if i+1 < len(new) and new[i]==new[i+1]:
                        res.append(new[i]*2)
                        i+=2
                    else:
                        res.append(new[i])
                        i+=1
                res += [0]*(size-len(res))
                return res
            if dir=="L":
                for i in range(size):
                    bd[i]=compress(bd[i])
            elif dir=="R":
                for i in range(size):
                    bd[i]=list(reversed(compress(list(reversed(bd[i])))))
            elif dir=="U":
                for j in range(size):
                    col=[bd[i][j] for i in range(size)]
                    col=compress(col)
                    for i in range(size):
                        bd[i][j]=col[i]
            elif dir=="D":
                for j in range(size):
                    col=[bd[i][j] for i in range(size)]
                    col=list(reversed(compress(list(reversed(col)))))
                    for i in range(size):
                        bd[i][j]=col[i]
        move_board(tmp, move)
        return tmp != b

    directions = ["W","A","S","D"]
    valid_moves = [m for m in directions if move_possible(board, m)]
    return random.choice(valid_moves) if valid_moves else "W"
Timeout
Timeout
Timeout
Steps = 2185 State = failed
def strategy(board):
    def can_move(board, dir):
        def move(board, dir):
            def compress(line):
                new = [x for x in line if x]
                merged = []
                skip = False
                i = 0
                while i < len(new):
                    if i+1 < len(new) and new[i] == new[i+1]:
                        merged.append(new[i]*2)
                        i += 2
                    else:
                        merged.append(new[i])
                        i += 1
                merged += [0]*(len(line)-len(merged))
                return merged
            n = len(board)
            b = [row[:] for row in board]
            if dir=='W':
                for j in range(n):
                    col = [b[i][j] for i in range(n)]
                    merged = compress(col)
                    for i in range(n): b[i][j] = merged[i]
            elif dir=='S':
                for j in range(n):
                    col = [b[i][j] for i in range(n)][::-1]
                    merged = compress(col)
                    merged = merged[::-1]
                    for i in range(n): b[i][j] = merged[i]
            elif dir=='A':
                for i in range(n):
                    row = b[i]
                    merged = compress(row)
                    b[i] = merged
            elif dir=='D':
                for i in range(n):
                    row = b[i][::-1]
                    merged = compress(row)
                    b[i] = merged[::-1]
            return b
        moved = move(board, dir)
        return moved != board
    for d in "WASD":
        if can_move(board, d):
            return d
    return "W"
┌────┬────┬────┬────┬────┬────┐
│   2│1024│  64│  32│   4│   8│
├────┼────┼────┼────┼────┼────┤
│1024│   4│ 512│ 256│ 128│  32│
├────┼────┼────┼────┼────┼────┤
│ 512│ 256│  16│ 128│  64│  16│
├────┼────┼────┼────┼────┼────┤
│ 256│  32│   2│  64│  32│   2│
├────┼────┼────┼────┼────┼────┤
│ 128│   2│  64│   4│   2│   8│
├────┼────┼────┼────┼────┼────┤
│  64│  32│  16│   2│   8│   2│
└────┴────┴────┴────┴────┴────┘
Steps = 1127 State = success
def strategy(board):
    import random
    # Simple heuristic: if any move possible, pick a random legal one.
    moves = ['W', 'A', 'S', 'D']
    # Randomly shuffle and pick first that changes board (or return any if all invalid)
    def can_move(board, move):
        # Very simple check: ensure move is within bounds and not all cells zero
        return any(any(row) for row in board)
    random.shuffle(moves)
    return moves[0]
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   2│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│   .│   2│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   2│  64│   2│  64│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   4│   8│   4│   2│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   2│   4│2048│  16│
├────┼────┼────┼────┼────┼────┤
│   .│  16│  32│   8│   2│   8│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    # Simple heuristic: always try to move up ("W") if possible, otherwise down ("S")
    # We'll simulate a single move to see if any tile merges or moves.
    def can_move_up(b):
        for col in range(4):
            for row in range(1, 4):
                if b[row][col] != 0 and (b[row-1][col] == 0 or b[row-1][col] == b[row][col]):
                    return True
        return False
    return "W" if can_move_up(board) else "S"
Timeout
Exception = list index out of range
Steps = 1218 State = failed
def strategy(board):
    import copy, random

    dirs = {"W": (-1, 0), "A": (0, -1), "S": (1, 0), "D": (0, 1)}

    def slide_line(line):
        new = [x for x in line if x != 0]
        res = []
        i = 0
        while i < len(new):
            if i + 1 < len(new) and new[i] == new[i + 1]:
                res.append(new[i] * 2)
                i += 2
            else:
                res.append(new[i])
                i += 1
        res += [0] * (len(line) - len(res))
        return res

    def move(b, d):
        n = len(b)
        bx = [row[:] for row in b]
        moved = False
        for i in range(n):
            if d[0] == -1 or d[0] == 1:
                line = [bx[j][i] for j in range(n)]
                if d[0] == 1: line.reverse()
                new = slide_line(line)
                if d[0] == 1: new.reverse()
                for j in range(n):
                    if bx[j][i] != new[j]:
                        moved = True
                    bx[j][i] = new[j]
            else:
                line = bx[i][:]
                if d[1] == 1: line.reverse()
                new = slide_line(line)
                if d[1] == 1: new.reverse()
                if new != bx[i]:
                    moved = True
                bx[i] = new
        return bx, moved

    def score(b):
        return sum(sum(row) for row in b)

    best = None
    best_score = -1
    for key, d in dirs.items():
        new_board, moved = move(board, d)
        if not moved: continue
        s = score(new_board)
        if s > best_score:
            best_score = s
            best = key
    return best if best else random.choice(list(dirs.keys()))
┌────┬────┬────┬────┬────┬────┐
│  64│   4│ 256│  64│  32│   8│
├────┼────┼────┼────┼────┼────┤
│   2│   8│  64│  16│   8│   2│
├────┼────┼────┼────┼────┼────┤
│ 512│1024│   2│   8│   4│  16│
├────┼────┼────┼────┼────┼────┤
│ 128│ 256│  16│   4│  16│   8│
├────┼────┼────┼────┼────┼────┤
│  64│  16│   8│   2│   8│   4│
├────┼────┼────┼────┼────┼────┤
│  32│   8│   2│   8│   4│   2│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
def strategy(board):
    # Simple deterministic strategy: try to push upwards if possible otherwise right
    # Check if any merge or move possible upward, else right
    def can_move(board, drow, dcol):
        for r in range(4):
            for c in range(4):
                val = board[r][c]
                if val == 0: continue
                nr, nc = r + drow, c + dcol
                if 0 <= nr < 4 and 0 <= nc < 4:
                    nb = board[nr][nc]
                    if nb == 0 or nb == val: return True
        return False

    if can_move(board, -1, 0):
        return "W"
    if can_move(board, 0, 1):
        return "D"
    if can_move(board, 1, 0):
        return "S"
    if can_move(board, 0, -1):
        return "A"
    # if no moves possible, just return "W"
    return "W"
Timeout
Timeout
Timeout
Timeout
def strategy(board):
    # Simple heuristic: always try to move 'W' (up) if possible,
    # otherwise pick the first available move among A, S, D.
    moves = ['W', 'A', 'S', 'D']
    for m in moves:
        if can_move(board, m):
            return m
    return 'W'

    def can_move(b, m):
        # Check if a move in direction m is possible (board not empty)
        # This is a minimal placeholder that always returns True
        # for demonstration purposes.
        return True
Exception = cannot access local variable 'can_move' where it is not associated with a value
Steps = 1376 State = success
def strategy(board):
    import random
    return random.choice(["W", "A", "S", "D"])
┌────┬────┬────┬────┬────┬────┐
│  32│   8│   8│  32│   2│   8│
├────┼────┼────┼────┼────┼────┤
│   8│  16│  64│   8│  32│   .│
├────┼────┼────┼────┼────┼────┤
│   4│   8│2048│  16│   2│   .│
├────┼────┼────┼────┼────┼────┤
│   4│ 512│   8│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  32│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   4│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Steps = 1393 State = success
def strategy(board):
    import random
    moves = ["W","A","S","D"]
    # Simple heuristic: prefer moves that don't change board, otherwise random
    def can_move(b, dir_):
        # simulate a move and check if board changes
        bcopy = [row[:] for row in b]
        # placeholder for actual move logic
        return True  # assume any direction always possible for simplicity
    
    valid_moves = [m for m in moves if can_move(board, m)]
    return random.choice(valid_moves)
┌────┬────┬────┬────┬────┬────┐
│   2│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│ 128│   8│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  64│   .│   2│   .│   4│
├────┼────┼────┼────┼────┼────┤
│   4│ 512│2048│  16│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   2│   8│   2│   8│   4│
└────┴────┴────┴────┴────┴────┘
Timeout
def strategy(board):
    # simulate a move in 2048
    def move(b, dir):
        def transpose(m): return [list(x) for x in zip(*m)]
        def reverse(m): return [row[::-1] for row in m]
        def compress(row):
            new = [x for x in row if x]
            merged = []
            skip = False
            for i in range(len(new)):
                if skip: skip = False; continue
                if i+1 < len(new) and new[i]==new[i+1]:
                    merged.append(new[i]*2)
                    skip = True
                else:
                    merged.append(new[i])
            return merged + [0]*(len(row)-len(merged))
        def slide(m):
            return [compress(row) for row in m]
        m = [row[:] for row in b]
        if dir=="W":
            m = transpose(m)
            m = slide(m)
            m = transpose(m)
        elif dir=="S":
            m = transpose(m)
            m = reverse(m)
            m = slide(m)
            m = reverse(m)
            m = transpose(m)
        elif dir=="A":
            m = reverse(m)
            m = slide(m)
            m = reverse(m)
        elif dir=="D":
            m = slide(m)
        return m
    dirs = "WASD"
    for d in dirs:
        if move(board, d) != board:
            return d
    return dirs[0]
Timeout
Steps = 1343 State = success
def strategy(board):
    import random
    moves = ["W", "A", "S", "D"]
    # Randomly pick a move that actually changes the board state
    board_str = str(board)
    for _ in range(10):
        mv = random.choice(moves)
        # Very simple: pretend every move is valid
        return mv
┌────┬────┬────┬────┬────┬────┐
│  16│2048│ 128│   4│   4│   2│
├────┼────┼────┼────┼────┼────┤
│  32│ 256│   4│  32│   .│ 128│
├────┼────┼────┼────┼────┼────┤
│   2│   2│   .│   4│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
Steps = 1013 State = success
def strategy(board):
    import random
    # simple heuristic: try moves that combine tiles
    moves = ["W", "A", "S", "D"]
    best_move = random.choice(moves)
    # Simple check: if any move leads to a merge, prioritize that
    def can_merge(b, move):
        def copy_board(b):
            return [row[:] for row in b]
        def slide(row):
            tiles = [t for t in row if t]
            new = []
            skip = False
            for i in range(len(tiles)):
                if skip:
                    skip = False
                    continue
                if i+1 < len(tiles) and tiles[i] == tiles[i+1]:
                    new.append(tiles[i]*2)
                    skip = True
                else:
                    new.append(tiles[i])
            return new + [0]*(len(row)-len(new))
        def transpose(b):
            return [list(col) for col in zip(*b)]
        b2 = copy_board(b)
        if move == "W":
            b2 = transpose(b2)
            for i in range(4):
                b2[i] = slide(b2[i])
            b2 = transpose(b2)
        elif move == "S":
            b2 = transpose(b2)
            for i in range(4):
                b2[i] = list(reversed(slide(list(reversed(b2[i])))))
            b2 = transpose(b2)
        elif move == "A":
            for i in range(4):
                b2[i] = slide(b2[i])
        elif move == "D":
            for i in range(4):
                b2[i] = list(reversed(slide(list(reversed(b2[i])))))
        return b2 != b
    # evaluate moves
    for m in moves:
        if can_merge(board, m):
            best_move = m
            break
    return best_move
┌────┬────┬────┬────┬────┬────┐
│2048│   2│  16│  32│   4│   .│
├────┼────┼────┼────┼────┼────┤
│  64│  16│   8│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   4│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   4│   8│   .│   .│   2│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   4│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   2│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
def strategy(board):
    import random
    moves = ["W", "A", "S", "D"]
    return random.choice(moves)
Steps = 1620 State = success
┌────┬────┬────┬────┬────┬────┐
│   2│   4│  16│   4│   8│   4│
├────┼────┼────┼────┼────┼────┤
│2048│  16│ 512│  32│  32│   .│
├────┼────┼────┼────┼────┼────┤
│  64│ 256│   4│ 256│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│  16│  32│   2│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│   .│   .│   .│   .│   .│
└────┴────┴────┴────┴────┴────┘
Steps = 1342 State = success
def strategy(board):
    # Simple strategy: choose a random legal move from those that change the board
    import random, copy
    moves = ['W', 'A', 'S', 'D']
    def apply_move(b, m):
        def rotate(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b[::-1])]
            return b
        def slide_and_merge(b):
            new_b = []
            for row in b:
                tiles = [t for t in row if t != 0]
                merged = []
                skip = False
                for i in range(len(tiles)):
                    if skip:
                        skip = False
                        continue
                    if i+1 < len(tiles) and tiles[i] == tiles[i+1]:
                        merged.append(tiles[i]*2)
                        skip = True
                    else:
                        merged.append(tiles[i])
                merged += [0]*(4-len(merged))
                new_b.append(merged)
            return new_b
        rot = {'W':0,'A':1,'S':2,'D':3}[m]
        b_rot = rotate(b, rot)
        b_new = slide_and_merge(b_rot)
        b_final = rotate(b_new, (4-rot)%4)
        return b_final
    legal_moves = []
    for m in moves:
        if apply_move(copy.deepcopy(board), m) != board:
            legal_moves.append(m)
    return random.choice(legal_moves) if legal_moves else 'W'
┌────┬────┬────┬────┬────┬────┐
│   4│   2│   4│  64│   2│   8│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│  16│  32│  16│
├────┼────┼────┼────┼────┼────┤
│   2│   .│   .│ 128│2048│  16│
├────┼────┼────┼────┼────┼────┤
│   .│  16│  32│  16│ 128│   2│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   8│ 128│   4│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   .│   .│   .│   8│
└────┴────┴────┴────┴────┴────┘
Timeout
Timeout
def strategy(board):
    import random
    moves = ["W","A","S","D"]
    # Prefer moves that combine tiles or create empty spaces
    def can_move(m):
        n = len(board)
        tmp = [row[:] for row in board]
        def move_left(b):
            changed = False
            for i in range(n):
                merged = [False]*n
                for j in range(1,n):
                    if b[i][j]==0: continue
                    k=j
                    while k>0 and b[i][k-1]==0:
                        b[i][k-1]=b[i][k]; b[i][k]=0; k-=1; changed=True
                    if k>0 and b[i][k-1]==b[i][k]:
                        b[i][k-1]*=2; b[i][k]=0; changed=True; merged[k-1]=True
            return changed
        def rotate(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b[::-1])]
            return b
        def revert(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b)][::-1]
            return b
        for t in range(4):
            tmp = rotate(tmp,1)
            if move_left(tmp):
                return True
        return False
    viable = [m for m in moves if can_move(m)]
    return random.choice(viable) if viable else random.choice(moves)
Steps = 1201 State = success
┌────┬────┬────┬────┬────┬────┐
│   .│   .│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   .│   .│   2│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│  32│   .│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│  16│   4│   4│   .│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   8│   2│2048│ 256│   .│   .│
├────┼────┼────┼────┼────┼────┤
│   2│  32│  32│   2│   4│   2│
└────┴────┴────┴────┴────┴────┘
Exception = 'str' object does not support item assignment

Inference

Now let's try the model we just trained!

[ ]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-17

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "W", "A", "S", "D" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "W" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need a simple strategy. Use a basic heuristic: try moves in order W,A,S,D and pick first that changes board. Provide function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
    # helper to copy board
    def copy(b): return [row[:] for row in b]
    # move simulation
    def move(b, d):
        def collapse(row):
            new = [x for x in row if x]
            res = []
            skip = False
            for i in range(len(new)):
                if skip: skip = False; continue
                if i+1 < len(new) and new[i]==new[i+1]:
                    res.append(new[i]*2); skip=True
                else:
                    res.append(new[i])
            res += [0]*(4-len(res))
            return res
        if d=='W' or d=='S':
            trans = [list(col) for col in zip(*b)]
            for i in range(4):
                trans[i] = collapse(trans[i]) if d=='W' else collapse(trans[i][::-1])[::-1]
            return [list(row) for row in zip(*trans)]
        else:
            for i in range(4):
                row = collapse(b[i]) if d=='A' else collapse(b[i][::-1])[::-1]
                b[i] = row
            return b

    def boards_equal(a,b): return all(a[i][j]==b[i][j] for i in range(4) for j in range(4))
    for d in "WASD":
        new = move(copy(board), d)
        if not boards_equal(board, new):
            return d
    return "W"
```<|return|>

Saving to float16 or `MXFP4`

We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[ ]

And we're done!

Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS

This notebook and all Unsloth notebooks are licensed LGPL-3.0.