Unsloth OpenEnv Gpt Oss (20B) Reinforcement Learning 2048 Game

OpenEnv Gpt Oss (20B) Reinforcement Learning 2048 Game

unsloth-notebooksunslothnb

alph-notebooks/unsloth-notebooks / OpenEnv_gpt_oss_(20B)_Reinforcement_Learning_2048_Game.ipynb

Export

Run Notebooks

Contents

No cells yet

Add cells to see them here

OpenEnv: Agentic Execution Environments

We're using the new OpenEnv library which has over 2000+ environments for RL!

To run this, press "Runtime" and press "Run all" on a free Tesla T4 Google Colab instance!

Join Discord if you need help + ⭐ Star us on Github ⭐

To install Unsloth on your local device, follow our guide.

Goal: Make gpt-oss play games with Reinforcement Learning

Our goal is to make OpenAI's open model gpt-oss 20b play the 2048 game with reinforcement learning. We want the model to devise a strategy to play 2048, and we will run this strategy until we win or lose.

Installation

We'll be using Unsloth to do RL on GPT-OSS 20B, and OpenEnv for the environment interactions. Unsloth saves 70% VRAM usage and makes reinforcement learning 2 to 6x faster!

[1]

We will then install OpenEnv from source:

[3]

We'll load GPT-OSS 20B and set some parameters:

max_seq_length = 768 The maximum context length of the model. Increasing it will use more memory.
lora_rank = 4 The larger this number, the smarter the RL process, but the slower and more memory usageload_in_16bit will be faster but will need a 64GB GPU or more (MI300)
offload_embedding = True New Unsloth optimization which moves the embedding to CPU RAM, reducing VRAM by 1GB.

[4]

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.11.3: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.5.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gpt_oss won't work! Using float32.

model.safetensors.index.json: 0.00B [00:00, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.37G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.16G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/165 [00:00<?, ?B/s]

Unsloth: Offloading embeddings to RAM to save 1.08 GB.

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/27.9M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/446 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

To do efficient RL, we will use LoRA, which allows us to only add 1 to 5% of extra weights to the model for finetuning purposes. This allows us to save memory usage by over 60%, and yet it retains good accuracy. Read Unsloth's GPT-OSS RL Guide for more details.

[5]

Unsloth: Making `model.base_model.model.model` require gradients

2048 game environment with OpenEnv

We first launch an OpenEnv process and import it! This will allows us to see how the 2048 implementation looks like!

[6]

We'll be using Unsloth's OpenEnv implementation and wrapping the launch_openenv with some setup arguments:

[7]

Let's see how the current 2048 game state looks like:

[8]

Unsloth: Creating new OpenEnv process at port = 12724.....................

OpenSpielObservation(done=False, reward=None, metadata={}, info_state=[0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], legal_actions=[0, 1, 2, 3], game_phase='initial', current_player_id=0, opponent_last_action=None)

First let's convert the state into a list of list of numbers!

[9]

([[0, 0, 0, 2], [0, 0, 2, 0], [0, 0, 0, 0], [0, 0, 0, 0]], 4)

We also want to pretty print the game board!

[10]

[11]

┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘

We can see the legal_actions ie what you can take as [0, 1, 2, 3] Let's try doing the action 0.

[12]

┌───┬───┬───┬───┐
│  .│  .│  2│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘

So it looks like 0 is a move up action! Let's try 1.

[13]

┌───┬───┬───┬───┐
│  .│  .│  .│  4│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
└───┴───┴───┴───┘

1 is a move right action. And 2:

[14]

┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  4│
├───┼───┼───┼───┤
│  2│  2│  .│  2│
└───┴───┴───┴───┘

2 is a move down. And I guess 3 is just move left!

[15]

┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  4│  .│  .│  .│
├───┼───┼───┼───┤
│  4│  2│  .│  2│
└───┴───┴───┴───┘

We can also print the game status which indicates if no more moves are possible, and also the possible actions you can take!

[16]

False
[0, 1, 3]

RL Environment Setup

We'll set up a function to accept some strategy that'll emit an action within 0123 and check the game state.

We'll also add a timer to only execute the strategy for 2 seconds maximum, otherwise it might never terminate!

[17]

Let's make a generic strategy to just hit 3. We should expect this generic strategy to fail:

[18]

(3, False)

To allow longer strategies for GPT-OSS Reinforcement Learning, we shall allow a 5 second timer.

[19]

Code Execution

To execute and create a new Python function, we first have to check if the function does not call other global variables or cheat. This is called countering reward hacking since we don't want the function to cheat.

For example the below piece of code is fine, since it only imports Python level functions. We use check_python_modules:

[20]

Only Python imports? True
{'stdlib': ['math', 'typing'], 'non_stdlib': [], 'relative_imports': 0}

For the below piece of code, since we import numpy, we should not allow the execution:

[21]

Only Python imports? False
{'stdlib': [], 'non_stdlib': ['numpy'], 'relative_imports': 0}

We also disallow global variable access. We'll use Unsloth's create_locked_down_function function

[22]

name 'np' is not defined

[23]

Data & RL task setup

We now have to create a prompt to tell the model to create a strategy for the 2048 game. You can customize this to some other task for another RL task.

[24]

Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.

First, let's prompt GPT-OSS without RL and see how it goes:

[25]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-25

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to provide a short function. Probably simple heuristic: choose move with lowest collision? Use sum? Just a placeholder.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
    scores = [0]*4
    for move in range(4):
        r = 0
        for row in board:
            for val in row:
                r += val * (move==0 and 1 or move==1 and 1 or move==2 and 1 or move==3 and 1)
        scores[move] = r
    return str(scores.index(max(scores)))
```
<|return|>

Reward functions

We now design a extract_function function which simply extracts the function wrapped in 3 back ticks.

And 3 reward functions:

function_works which rewards the model if the strategy is a valid Python function.
no_cheating which checks if the function imported other modules, and if it did, we penalize it.
strategy_succeeds which checks if the game strategy actually succeeds in attaining 2048 after running the auto-generated strategy.

[26]

def strategy(board):
    return "0" # Example

Below is our function_works reward function which uses Python's exec but guarded by not allowing leakage of local and global variables. We can also use check_python_modules first to check if there are errors before even executing the function:

[27]

(False,
, {'error': "SyntaxError: expected '(' (<unknown>, line 1)",
,  'stdlib': [],
,  'non_stdlib': [],
,  'relative_imports': 0})

[28]

no_cheating checks if the function cheated since it might have imported Numpy or other functions:

[29]

Next strategy_succeeds checks if the strategy actually allows the game to terminate. Imagine if the strategy simply returned "0" which would fail after a time limit of 10 seconds.

We also add a global PRINTER to print out the strategy and board state.

[30]

We'll now create the dataset which includes a replica of our prompt. Remember to add a reasoning effort of low! You can choose high reasoning mode, but this'll only work on more memory GPUs like MI300s.

[31]

{'prompt': [{'content': 'Create a new short 2048 strategy using only native Python code.\nYou are given a list of list of numbers for the current board state.\nOutput one action for "0", "1", "2", "3" on what is the optimal next step.\nOutput your new short function in backticks using the format below:\n```python\ndef strategy(board):\n    return "0" # Example\n```\nAll helper functions should be inside def strategy. Only output the short function `strategy`.',
,   'role': 'user'}],
, 'answer': 0,
, 'reasoning_effort': 'low'}

Train the model

Now set up GRPO Trainer and all configurations! We also support GSPO, GAPO, Dr GRPO and more! Go the Unsloth Reinforcement Learning Docs for more options.

We're also using TrackIO which allows you to visualize all training metrics straight inside the notebook fully locally!

[32]

Unsloth: We now expect `per_device_train_batch_size` * `gradient_accumulation_steps` * `world_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the reward column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

Step	reward	reward_std	completion_length	kl
1	0.125000	0.000000	200.000000	0.000000
2	0.072375	0.248112	200.000000	0.000000
3	-0.079000	0.163776	182.500000	0.000005

[33]

Unsloth: Switching to float32 training since model cannot work with float16

And let's train the model! NOTE This might be quite slow! 600 steps takes ~5 hours or longer.

TrackIO might be a bit slow to load - wait 2 minutes until the graphs pop up!

[34]

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 600
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 1,990,656 of 20,916,747,840 (0.01% trained)

* Running on public URL: https://93f7f59fd9e1813cea.gradio.live
* Trackio project initialized: huggingface
* Trackio metrics logged to: /root/.cache/huggingface/trackio
* Created new run: dainty-sunset-0

`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.

def strategy(board):
    # Simulate each move and choose one maximizing the largest tile after the move
    moves = ["0", "1", "2", "3"]  # 0:left, 1:right, 2:up, 3:down
    best_move, best_max = None, -1
    
    for m in moves:
        nxt = [row[:] for row in board]
        if m == "0":
            for row in nxt:
                l = [x for x in row if x]
                l += [0]*(4-len(l))
                for i in range(4):
                    row[i] = l[i]
        elif m == "1":
            for row in nxt:
                r = [x for x in row if x][::-1]
                r += [0]*(4-len(r))
                for i in range(4):
                    row[i] = r[::-1][i]
        elif m == "2":
            cols = list(zip(*nxt))
            for i, col in enumerate(cols):
                l = [x for x in col if x]
                l += [0]*(4-len(l))
                for j in range(4):
                    nxt[j][i] = l[j]
        else:  # m == "3"
            cols = list(zip(*nxt))
            for i, col in enumerate(cols):
                r = [x for x in col if x][::-1]
                r += [0]*(4-len(r))
                for j in range(4):
                    nxt[j][i] = r[::-1][j]
        max_tile = max(max(row) for row in nxt)
        if max_tile > best_max:
            best_max, best_move = max_tile, m
    return best_move
Steps = 1 If Done = False
┌───┬───┬───┬───┐
│  2│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 13 If Done = False
def strategy(board):
    # simulate a move in the given direction and return new board
    def move(b, d):
        w, h = len(b), len(b[0])
        res = [[0]*h for _ in range(w)]
        for i in range(w):
            line = [b[i][j] for j in range(h) if b[i][j]]
            if d == 1: line = line[::-1]
            merged = []
            skip = False
            for k in range(len(line)):
                if skip:
                    skip = False
                    continue
                if k+1 < len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2)
                    skip = True
                else:
                    merged.append(line[k])
            if d == 1: merged = merged[::-1]
            for j in range(len(merged)):
                res[i][j] = merged[j]
        return res

    # evaluate a board by its total sum
    def score(b):
        return sum(sum(row) for row in b)

    best_dir = None
    best_score = -1
    for d in range(4):
        new_board = move(board, d)
        s = score(new_board)
        if s > best_score:
            best_score = s
            best_dir = str(d)
    return best_dir
┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Unsloth: Will smartly offload gradients to save VRAM!

Exception = list indices must be integers or slices, not list
def strategy(board):
    size=len(board)
    def add_row(row):
        return [x for x in row if x]
    def compress(row):
        row+=[0]*(size-len(row))
        return row
    def merge(row):
        new=[]
        skip=False
        for i,x in enumerate(row):
            if skip:
                skip=False
                continue
            if i+1<len(row) and row[i]==row[i+1]:
                new.append(x*2)
                skip=True
            else:
                new.append(x)
        return new
    best, best_dir=None,None
    for dir in "0123":
        vec=[]
        if dir=="0":
            for r in board:vec+=add_row(r)
        elif dir=="1":
            for c in range(size):
                vec+=add_row([board[r][c] for r in range(size)])
        elif dir=="2":
            for r in range(size):
                vec+=add_row(board[size-1-r][::-1])
        else:
            for c in range(size):
                vec+=add_row([board[r][c] for r in range(size-1,-1,-1)])
        if vec:
            new=merge(compress(vec))
            score=sum(new)
            if best is None or score>best:
                best,best_dir=score,dir
    return best_dir
Steps = 2 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 3 If Done = False
def strategy(board):
    def move(board, dir):
        n=len(board); m=n
        def slide(line):
            new=[x for x in line if x>0]
            res=[]
            i=0
            while i<len(new):
                if i+1<len(new) and new[i]==new[i+1]:
                    res.append(new[i]*2)
                    i+=2
                else:
                    res.append(new[i]); i+=1
            result=res+[0]*(m-len(res))
            return result
        new=[[0]*m for _ in range(n)]
        if dir==0: # up
            for j in range(m):
                col=[board[i][j] for i in range(n)]
                new_col=slide(col)
                for i in range(n): new[i][j]=new_col[i]
        elif dir==1: # down
            for j in range(m):
                col=[board[i][j] for i in range(n)][::-1]
                new_col=slide(col)
                for i in range(n): new[n-1-i][j]=new_col[i]
        elif dir==2: # left
            for i in range(n):
                new[i]=slide(board[i])
        else: # right
            for i in range(n):
                rev=board[i][::-1]
                new_rev=slide(rev)
                new[i]=new_rev[::-1]
        return new

    def score(b):
        s=0
        for i in range(len(b)):
            for j in range(len(b)):
                s+=b[i][j]* (1 if (i+j)%2==0 else 0.5)
        return s

    best=0; bestS=score(move(board,best))
    for d in [1,2,3]:
        s=score(move(board,d))
        if s>bestS:
            bestS=s; best=d
    return str(best)
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 11 If Done = False
def strategy(board):
    # board is a 4x4 list of lists
    # try to move to the best direction by evaluating a simple heuristic
    def can_move(d):
        b = board
        for i in range(4):
            for j in range(4):
                val = b[i][j]
                if val == 0:
                    continue
                if d == 0 and j > 0 and b[i][j-1] in (0, val):
                    return True
                if d == 1 and j < 3 and b[i][j+1] in (0, val):
                    return True
                if d == 2 and i > 0 and b[i-1][j] in (0, val):
                    return True
                if d == 3 and i < 3 and b[i+1][j] in (0, val):
                    return True
        return False

    # Prefer moves that merge the largest tile or keep the board from splitting
    best = None
    for d in range(4):
        if can_move(d):
            best = d
            break
    return str(best if best is not None else 0)
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 3 If Done = False
def strategy(board):
    # Simple heuristic: always move up if possible, else right
    for i, row in enumerate(board):
        if row[i] == 0:
            return "0"  # move up
    return "3"  # otherwise, move down
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    def move(b, dir):  # 0=up,1=down,2=left,3=right
        size = len(b)
        new, changed = [ [0]*size for _ in range(size) ], False
        for i in range(size):
            line = []
            for j in range(size):
                val = b[j][i] if dir==0 else b[j][i] if dir==1 else b[i][j] if dir==2 else b[i][j]
                if val:
                    line.append(val)
            merged = []
            k=0
            while k < len(line):
                if k+1 < len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2); k+=2
                else:
                    merged.append(line[k]); k+=1
            for k in range(len(merged)):
                if dir==0:
                    new[merged.index(merged[k])][i] = merged[k]
                elif dir==1:
                    new[size-1-merged.index(merged[k])][i] = merged[k]
                elif dir==2:
                    new[i][merged.index(merged[k])] = merged[k]
                elif dir==3:
                    new[i][size-1-merged.index(merged[k])] = merged[k]
        return new
    best, best_dir = 0, 0
    for d in range(4):
        moved = move(board, d)
        score = sum(sum(row) for row in moved)
        if score > best:
            best, best_dir = score, d
    return str(best_dir)
Steps = 10 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
Exception = list index out of range
None
Steps = 19 If Done = False
def strategy(board):
    # Board is 4x4 list of lists; 0 denotes empty
    # Return direction: 0=up, 1=down, 2=left, 3=right
    def weight(b):
        w = 0
        for i in range(4):
            for j in range(4):
                if b[i][j]:
                    w += (b[i][j] == 2) * (i + j)
        return w
    best_dir = 0
    best_w = -1
    dirs = [(0,1,2,3)]  # dummy to avoid unused list
    for d in ["up","down","left","right"]:
        new = board[::-1] if d=="up" else board
        # simulate simple move
        # (in practice would use a copy and slide logic)
        if weight(new) > best_w:
            best_w = weight(new)
            best_dir = {"up":0,"down":1,"left":2,"right":3}[d]
    return str(best_dir)
┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 11 If Done = False
def strategy(board):
    # Helper to simulate move and count merges
    def slide(row):
        new = [x for x in row if x]
        i = 0
        while i < len(new)-1:
            if new[i]==new[i+1]:
                new[i]*=2
                del new[i+1]
            i+=1
        return new+[0]*(len(row)-len(new))

    def can_move(b):
        for r in range(4):
            for c in range(4):
                if board[r][c]==0: return True
                if r<3 and board[r][c]==board[r+1][c]: return True
                if c<3 and board[r][c]==board[r][c+1]: return True
        return False

    if not can_move(board): return "0"
    best, best_score = -1, -1
    for action in "0123":
        new_board = [row[:] for row in board]
        if action=="0":  # up
            for c in range(4):
                col = [new_board[r][c] for r in range(4)]
                col = slide(col)
                for r in range(4): new_board[r][c]=col[r]
        elif action=="1":  # down
            for c in range(4):
                col = [new_board[r][c] for r in range(4)][::-1]
                col = slide(col)
                col = col[::-1]
                for r in range(4): new_board[r][c]=col[r]
        elif action=="2":  # left
            for r in range(4):
                new_board[r] = slide(new_board[r])
        else:  # right
            for r in range(4):
                new_board[r] = slide(new_board[r][::-1])[::-1]
        score = sum(sum(row) for row in new_board)
        if score>best_score:
            best_score=score; best=action
    return best
┌───┬───┬───┬───┐
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘
Steps = 4 If Done = False
def strategy(board):
    # Count empty tiles (0) for each direction
    best_dir, best_score = None, -1
    dirs = {'0': (0, -1), '1': (0, 1), '2': (-1, 0), '3': (1, 0)}
    n = len(board)
    for d, (dx, dy) in dirs.items():
        # simulate move
        temp = [[0]*n for _ in range(n)]
        moved = False
        for i in range(n):
            for j in range(n):
                val = board[i][j]
                if val == 0: continue
                x, y = i, j
                while 0 <= x + dx < n and 0 <= y + dy < n and (temp[x+dx][y+dy] == 0 or temp[x+dx][y+dy] == val):
                    if temp[x+dx][y+dy] == val:
                        temp[x+dx][y+dy] *= 2
                        moved = True
                        break
                    x += dx; y += dy
                if not (0 <= x + dx < n and 0 <= y + dy < n) and not moved:
                    temp[x][y] = val
        # evaluate heuristic: fewer empty tiles after move
        empty = sum(row.count(0) for row in temp)
        if empty > best_score:
            best_score = empty
            best_dir = d
    return best_dir
┌───┬───┬───┬───┐
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  4│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    # prioritize moves that merge tiles
    best = 0
    best_score = -1
    for move in range(4):  # 0: up, 1: down, 2: left, 3: right
        new_board = []
        for row in board:
            new_row = row[:]
            # shift non-zero tiles
            merged = []
            for val in new_row:
                if val != 0:
                    merged.append(val)
            merged += [0] * (len(row) - len(merged))
            # merge adjacent equal values
            i = 0
            while i < len(merged) - 1:
                if merged[i] == merged[i + 1] and merged[i] != 0:
                    merged[i] += merged[i + 1]
                    merged[i + 1] = 0
                    i += 1
                i += 1
            new_row = merged
            new_board.append(new_row)
        # evaluate score: higher empty tiles = better
        empties = sum(row.count(0) for row in new_board)
        if empties > best_score:
            best_score = empties
            best = move
    return str(best)
Steps = 2 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 3 If Done = False
def strategy(board):
    # Quick heuristic: move right if possible, else down, else left, else up
    rows, cols = len(board), len(board[0])
    # check if can move right
    for i in range(rows):
        for j in range(cols-1):
            if board[i][j] == 0 and board[i][j+1] == 0:
                return "0"  # right
    # down
    for i in range(rows-1):
        for j in range(cols):
            if board[i][j] == 0 and board[i+1][j] == 0:
                return "1"  # down
    # left
    for i in range(rows):
        for j in range(1, cols):
            if board[i][j] == 0 and board[i][j-1] == 0:
                return "2"  # left
    return "3"  # up
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  4│  .│
└───┴───┴───┴───┘
Steps = 2 If Done = False
def strategy(board):
    # board is a 2D list of ints, 0 means empty.
    
    def move(board, d):
        # simulate moving all tiles in direction d (0=up,1=down,2=left,3=right)
        n = len(board)
        new_board = [[0]*n for _ in range(n)]
        for i in range(n):
            if d == 2 or d == 3:  # horizontal
                line = board[i] if d == 2 else board[i][::-1]
            else:  # vertical
                line = [board[j][i] for j in range(n)]
                if d == 3:  # right -> reverse
                    line = line[::-1]

            merged = []
            prev = None
            for v in line:
                if v == 0: continue
                if prev is None:
                    prev = v
                elif prev == v:
                    merged.append(prev*2)
                    prev = None
                else:
                    merged.append(prev)
                    prev = v
            if prev is not None:
                merged.append(prev)

            # pad with zeros
            while len(merged) < n:
                merged.append(0)
            # write back
            if d == 2 or d == 3:
                new_board[i] = merged if d == 2 else merged[::-1]
            else:
                for j in range(n):
                    new_board[j][i] = merged[j] if d == 0 else merged[::-1][j]
        return new_board

    best_dir = None
    best_score = -1
    for d in range(4):
        new = move(board, d)
        # simple score: sum of tiles
        score = sum(sum(row) for row in new)
        if score > best_score:
            best_score = score
            best_dir = d
    return str(best_dir)
┌───┬───┬───┬───┐
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 7 If Done = False
def strategy(board):
    moves = []
    for i, row in enumerate(board):
        for j, val in enumerate(row):
            if val is None:
                continue
            moves.append((i, j, val))
    # Simple heuristic: move the largest tile towards the top-left corner
    best = None
    best_score = -1
    for m in moves:
        # The score is the sum of the tile value and its distance from (0,0)
        score = m[2] - (m[0] + m[1])
        if score > best_score:
            best_score = score
            best = str(m[1])  # use column index as action
    if best is None:
        return "0"
    return best
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘
def strategy(board):
    import random
    # try each move, keep the one with largest score (sum of tiles)
    best = None
    best_score = -1
    for move in "0123":
        b = [row[:] for row in board]
        # apply move
        def compress(row):
            nonz = [x for x in row if x]
            res = []
            i=0
            while i < len(nonz):
                if i+1 < len(nonz) and nonz[i]==nonz[i+1]:
                    res.append(nonz[i]*2); i+=2
                else:
                    res.append(nonz[i]); i+=1
            return res + [0]*(len(row)-len(res))
        if move=="0": # left
            for r in b:
                r[:]=compress(r)
        elif move=="1": # down
            for c in range(len(b[0])):
                col=[b[r][c] for r in range(len(b))]
                col=compress(col)
                for r in range(len(b)):
                    b[r][c]=col[r]
        elif move=="2": # right
            for r in b:
                r[:] = compress(r[::-1])[::-1]
        else: # up
            for c in range(len(b[0])):
                col=[b[r][c] for r in range(len(b))]
                col=compress(col[::-1])[::-1]
                for r in range(len(b)):
                    b[r][c]=col[r]
        score=sum(sum(row) for row in b)
        if score>best_score:
            best_score=score; best=move
    return "0" if best is None else best
Steps = 17 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
└───┴───┴───┴───┘
Steps = 12 If Done = False
def strategy(board):
    # simple priority: if up possible, else left, else down, else right
    def can(move):
        if move == 0:  # up
            for j in range(4):
                for i in range(1,4):
                    if board[i][j] and board[i-1][j] == 0: return True
        elif move == 1:  # down
            for j in range(4):
                for i in range(3):
                    if board[i][j] and board[i+1][j] == 0: return True
        elif move == 2:  # left
            for i in range(4):
                for j in range(1,4):
                    if board[i][j] and board[i][j-1] == 0: return True
        elif move == 3:  # right
            for i in range(4):
                for j in range(3):
                    if board[i][j] and board[i][j+1] == 0: return True
        return False
    for m in range(4):
        if can(m): return str(m)
    return "0"
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 6 If Done = False
def strategy(board):
    # board: list of 4 lists, each with 4 ints
    moves = []
    # simulate a move by sliding
    def slide(b, dir):
        result = [[0]*4 for _ in range(4)]
        for i in range(4):
            line = []
            for j in range(4):
                if dir == 0:  # up
                    val = b[j][i]
                elif dir == 1:  # left
                    val = b[i][j]
                elif dir == 2:  # down
                    val = b[3-j][i]
                else:  # right
                    val = b[i][3-j]
                if val: line.append(val)
            if not line: 
                return None
            merged = []
            skip = False
            for k in range(len(line)):
                if skip: 
                    skip = False
                    continue
                if k+1 < len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2)
                    skip = True
                else:
                    merged.append(line[k])
            # pad with zeros
            merged += [0]*(4-len(merged))
            # place back
            for j in range(4):
                if dir == 0:
                    result[j][i] = merged[j]
                elif dir == 1:
                    result[i][j] = merged[j]
                elif dir == 2:
                    result[3-j][i] = merged[j]
                else:
                    result[i][3-j] = merged[j]
        return result

    best_score = -1
    best_dir = 0
    for d in range(4):
        new_board = slide(board, d)
        if new_board is None:
            continue
        # heuristic: sum of squares, encouraging high values
        score = sum(sum(v*v for v in row) for row in new_board)
        if score > best_score:
            best_score = score
            best_dir = d
    return str(best_dir)
┌───┬───┬───┬───┐
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 8 If Done = False
def strategy(board):
    # Calculate scores for each direction based on simple heuristic
    def score_move(dx, dy):
        new_board = [[0]*4 for _ in range(4)]
        for i in range(4):
            for j in range(4):
                val = board[i][j]
                if val == 0:
                    continue
                x, y = i, j
                while True:
                    nx, ny = x+dx, y+dy
                    if 0 <= nx < 4 and 0 <= ny < 4:
                        if board[nx][ny] == 0 or board[nx][ny] == val:
                            x, y = nx, ny
                            if board[nx][ny] == val:
                                val *= 2
                        else:
                            break
                    else:
                        break
                new_board[x][y] = val
        # Heuristic: prefer moves that do not increase empty tiles
        empty = sum(1 for i in range(4) for j in range(4) if new_board[i][j] == 0)
        return empty

    moves = [(0, -1), (1, 0), (0, 1), (-1, 0)]  # Up, Right, Down, Left
    scores = [score_move(dx, dy) for dx, dy in moves]
    best = min(range(4), key=lambda i: scores[i])  # choose move with fewest empty tiles
    return str(best)
┌───┬───┬───┬───┐
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    # Find any move that merges or moves a tile
    def can_move(board, dir):
        n = len(board)
        if dir == 0:  # up
            for j in range(n):
                prev = -1
                for i in range(n):
                    v = board[i][j]
                    if v == 0: continue
                    if v == prev: return True
                    prev = v
        if dir == 1:  # down
            for j in range(n):
                prev = -1
                for i in range(n-1, -1, -1):
                    v = board[i][j]
                    if v == 0: continue
                    if v == prev: return True
                    prev = v
        if dir == 2:  # left
            for i in range(n):
                prev = -1
                for j in range(n):
                    v = board[i][j]
                    if v == 0: continue
                    if v == prev: return True
                    prev = v
        if dir == 3:  # right
            for i in range(n):
                prev = -1
                for j in range(n-1, -1, -1):
                    v = board[i][j]
                    if v == 0: continue
                    if v == prev: return True
                    prev = v
        return False

    for d in range(4):
        if can_move(board, d):
            return str(d)
    return "0"
Steps = 18 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
None
Steps = 10 If Done = False
def strategy(board):
    # Evaluate simple heuristic: compare sums for each move
    def move_score(b, dir):
        moves = []
        size = len(b)
        for i in range(size):
            for j in range(size):
                if b[i][j] == 0: continue
                r, c = i, j
                value = b[i][j]
                while True:
                    if dir == 0:   # up
                        if r == 0 or b[r-1][c] != 0 and b[r-1][c] != value: break
                        r -= 1
                    elif dir == 1: # down
                        if r == size-1 or b[r+1][c] != 0 and b[r+1][c] != value: break
                        r += 1
                    elif dir == 2: # left
                        if c == 0 or b[r][c-1] != 0 and b[r][c-1] != value: break
                        c -= 1
                    else:          # right
                        if c == size-1 or b[r][c+1] != 0 and b[r][c+1] != value: break
                        c += 1
                if b[r][c] == 0:
                    moves.append((r, c, value))
        return len(moves) + sum(v for _,_,v in moves)

    scores = [move_score(board, d) for d in range(4)]
    return str(scores.index(max(scores)))
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    # board is a list of list of ints, 4x4, 0 for empty
    # Simpler heuristic: move toward the corners
    # Preference: up (0) > left (1) > down (2) > right (3)
    # Count empty cells
    empty = sum(1 for row in board for v in row if v == 0)
    # If many empties, favor up to increase space
    if empty >= 12:
        return "0"  # up
    # Otherwise, move towards the largest tile
    max_val = max(max(row) for row in board)
    for i, row in enumerate(board):
        for j, val in enumerate(row):
            if val == max_val:
                if i < 2:
                    return "0"  # up
                if j < 2:
                    return "1"  # left
                if i > 1:
                    return "2"  # down
                if j > 1:
                    return "3"  # right
    return "1"  # default to left
Steps = 4 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 58 If Done = False
def strategy(board):
    # Simple heuristic: if any row has two candidates with same sum, prioritize that
    best=None
    for i in range(4):
        for j in range(4):
            if board[i][j]==0: continue
            for d in ("0","1","2","3"):
                if d=="0":
                    ni,nj=i,i+1
                elif d=="1":
                    ni,nj=i,i-1
                elif d=="2":
                    ni,nj=i+1,j
                else:
                    ni,nj=i-1,j
                if 0<=ni<4 and 0<=nj<4 and board[ni][nj]==0:
                    return d
    return "0"
┌───┬───┬───┬───┐
│  .│  .│  4│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
└───┴───┴───┴───┘
Steps = 2 If Done = False
def strategy(board):
    # evaluate all four moves and pick the one with the highest expected score
    def move(board, dir):
        # apply move using 2048 rules
        size = len(board)
        def compress(row):
            new = [x for x in row if x != 0]
            i = 0
            merged = []
            while i < len(new):
                if i+1 < len(new) and new[i]==new[i+1]:
                    merged.append(new[i]*2)
                    i += 2
                else:
                    merged.append(new[i])
                    i += 1
            merged += [0]*(size-len(merged))
            return merged
        if dir == 0:  # up
            res = [[0]*size for _ in range(size)]
            for j in range(size):
                col = [board[i][j] for i in range(size)]
                newcol = compress(col)
                for i in range(size):
                    res[i][j] = newcol[i]
            return res
        if dir == 1:  # down
            res = [[0]*size for _ in range(size)]
            for j in range(size):
                col = [board[i][j] for i in range(size)][::-1]
                newcol = compress(col)[::-1]
                for i in range(size):
                    res[i][j] = newcol[i]
            return res
        if dir == 2:  # left
            res = []
            for row in board:
                newrow = compress(row)
                res.append(newrow)
            return res
        if dir == 3:  # right
            res = []
            for row in board:
                newrow = compress(row[::-1])[::-1]
                res.append(newrow)
            return res
    best_score = -1
    best_dir = 0
    for d in range(4):
        new = move(board, d)
        score = sum(max(row) for row in new)
        if score > best_score:
            best_score = score
            best_dir = d
    return str(best_dir)
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
└───┴───┴───┴───┘
Steps = 4 If Done = False
def strategy(board):
    def move(board, dir):
        def rotate(b, times):
            for _ in range(times):
                b = [list(row) for row in zip(*b[::-1])]
            return b
        if dir==0: board=rotate(board,0)
        if dir==1: board=rotate(board,1)
        if dir==2: board=rotate(board,2)
        if dir==3: board=rotate(board,3)
        new=[]
        for row in board:
            merged=[]
            skip=0
            for i in range(len(row)):
                if skip: skip=0; continue
                if row[i]!=0:
                    val=row[i]
                    if i+1<len(row) and row[i+1]==val:
                        val*=2; skip=1
                    merged.append(val)
            merged+=[0]*(len(row)-len(merged))
            new.append(merged)
        return rotate(new, (4-dir)%4)

    best_move=None; best_score=-1
    for m in range(4):
        nxt=move(board,m)
        score=sum(sum(row) for row in nxt)
        if score>best_score:
            best_score=score; best_move=m
    return str(best_move)
┌───┬───┬───┬───┐
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 2 If Done = False
def strategy(board):
    # scores: 0=up,1=down,2=left,3=right
    def mutate(b, move):
        import copy
        b = copy.deepcopy(b)
        for i in range(4):
            line = []
            for j in range(4):
                if move in (0,2):
                    val = b[j][i] if move == 0 else b[i][j]
                else:
                    val = b[i][j] if move == 1 else b[j][i]
                if val: line.append(val)
            merged = []
            skip = False
            for k in range(len(line)):
                if skip:
                    skip=False
                    continue
                if k+1 < len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2)
                    skip=True
                else:
                    merged.append(line[k])
            merged += [0]*(4-len(merged))
            for k in range(4):
                if move in (0,2):
                    if move == 0:
                        b[k][i] = merged[k]
                    else:
                        b[i][k] = merged[k]
                else:
                    if move == 1:
                        b[i][k] = merged[k]
                    else:
                        b[k][i] = merged[k]
        return b
    best = -1
    bestmove = 0
    for m in range(4):
        nb = mutate(board, m)
        val = sum(sum(row) for row in nb)
        if val > best:
            best = val
            bestmove = m
    return str(bestmove)
┌───┬───┬───┬───┐
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    # simple heuristic: move that preserves most tiles
    best = None
    max_keep = -1
    for move in range(4):
        b = [row[:] for row in board]
        # apply move
        if move==0:  # up
            for c in range(4):
                col=[b[r][c] for r in range(4) if b[r][c]!=0]
                for i in range(len(col)-1):
                    if col[i]==col[i+1]:
                        col[i]+=col.pop(i+1)
                for r in range(4): b[r][c]=col[r] if r<len(col) else 0
        elif move==1:  # down
            for c in range(4):
                col=[b[r][c] for r in range(4) if b[r][c]!=0]
                for i in range(len(col)-1,0,-1):
                    if col[i]==col[i-1]:
                        col[i-1]+=col.pop(i)
                for r in range(3,-1,-1): b[r][c]=col[3-r] if r>=4-len(col) else 0
        elif move==2:  # left
            for r in range(4):
                row=[b[r][c] for c in range(4) if b[r][c]!=0]
                for i in range(len(row)-1):
                    if row[i]==row[i+1]:
                        row[i]+=row.pop(i+1)
                for c in range(4): b[r][c]=row[c] if c<len(row) else 0
        else:  # right
            for r in range(4):
                row=[b[r][c] for c in range(4) if b[r][c]!=0]
                for i in range(len(row)-1,0,-1):
                    if row[i]==row[i-1]:
                        row[i-1]+=row.pop(i)
                for c in range(3,-1,-1): b[r][c]=row[3-c] if c>=4-len(row) else 0
        # count non-zero tiles
        keep=sum(v!=0 for row in b for v in row)
        if keep>max_keep:
            max_keep=keep
            best=move
    return str(best)
Exception = list index out of range
Steps = 5 If Done = False
def strategy(board):
    # Scan for any possible merge, prefer larger tiles
    best = None
    for i in range(4):
        for j in range(4):
            tile = board[i][j]
            if tile == 0: continue
            # Check right
            if j < 3 and board[i][j+1] == tile:
                return "2"
            # Check left
            if j > 0 and board[i][j-1] == tile:
                return "3"
            # Check down
            if i < 3 and board[i+1][j] == tile:
                return "1"
            # Check up
            if i > 0 and board[i-1][j] == tile:
                return "0"
    # If no merge, push towards corner
    if board[0][0] == 0: return "0"
    if board[0][3] == 0: return "1"
    if board[3][0] == 0: return "2"
    return "3"
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 19 If Done = False
def strategy(board):
    # helper to simulate a move
    def move(b, dir):
        n=len(b)
        res=[[0]*n for _ in range(n)]
        for i in range(n):
            line=[]
            for j in range(n):
                if dir==0:  # left
                    if j>=i: line.append(b[i][j])
                elif dir==1:  # right
                    if j<=n-i-1: line.append(b[i][j])
                elif dir==2:  # up
                    if i>=j: line.append(b[i][j])
                else:  # down
                    if i<=n-j-1: line.append(b[i][j])
            merged=[]
            k=0
            while k<len(line):
                if k+1<len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2); k+=2
                else:
                    merged.append(line[k]); k+=1
            # fill result
            for j in range(n):
                if dir==0:
                    if j<i: res[i][j]=merged[j] if j< len(merged) else 0
                elif dir==1:
                    if j>=n-i: res[i][j]=merged[j-(n-i)] if j-(n-i)<len(merged) else 0
                elif dir==2:
                    if j<i: res[j][i]=merged[j] if j<len(merged) else 0
                else:
                    if j>=n-i: res[j][i]=merged[j-(n-i)] if j-(n-i)<len(merged) else 0
        return res

    # evaluate move
    best=None; best_score=-1; best_move=None
    for m in range(4):
        new=move(board,m)
        # simple score: sum of tiles
        score=sum(sum(row) for row in new)
        if score>best_score:
            best_score=score; best_move=str(m)
    return best_move
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 6 If Done = False
def strategy(board):
    import copy
    def simulate(b,dir):
        n=len(b); new=[[0]*n for _ in range(n)]
        for i in range(n):
            line=[b[i][j] for j in range(n) if b[i][j]!=0]
            if dir==0:  # up
                line=line[::-1]
            if dir==1:  # right
                pass
            # (simplified: only count empty after move)
            return line
    best=0; max_empty=-1
    for d in range(4):
        line=simulate(board,d)
        empty=sum(1 for v in line if v==0)
        if empty>max_empty:
            max_empty=empty; best=d
    return str(best)
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
└───┴───┴───┴───┘
None
Steps = 3 If Done = False
def strategy(board):
    import random
    moves = []
    for i in range(4):
        for j in range(4):
            if board[i][j] == 0:
                return "0"
    return random.choice(["0","1","2","3"])
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 17 If Done = False
def strategy(board):
    """
    Very simple strategy: choose the move that merges the most tiles.
    """
    import operator

    def count_merges(dir_move):
        rows = len(board)
        cols = len(board[0])
        count = 0
        for i in range(rows):
            for j in range(cols):
                if board[i][j] == 0:
                    continue
                ni, nj = i, j
                if dir_move == 0:  # up
                    while ni > 0 and board[ni-1][nj] == 0:
                        ni -= 1
                elif dir_move == 1:  # down
                    while ni < rows-1 and board[ni+1][nj] == 0:
                        ni += 1
                elif dir_move == 2:  # left
                    while nj > 0 and board[ni][nj-1] == 0:
                        nj -= 1
                else:  # right
                    while nj < cols-1 and board[ni][nj+1] == 0:
                        nj += 1
                if ni != i or nj != j:
                    count += 1
        return count

    best_move = 0
    best_score = -1
    for m in range(4):
        score = count_merges(m)
        if score > best_score:
            best_score = score
            best_move = m
    return str(best_move)
┌───┬───┬───┬───┐
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  4│  .│
└───┴───┴───┴───┘
Steps = 28 If Done = False
def strategy(board):
    """
    Mine: A tiny 2048 strategy that plays a deterministic
    first opportunity that keeps tile movement to the top‑left.
    The strategy selects an action that moves tiles towards the
    upper left corner when possible.
    """
    # board is a 4x4 list of lists
    # Directions: 0=up, 1=down, 2=left, 3=right
    import copy

    def can_move(b, d):
        # simulate move and see if board changes
        new = copy.deepcopy(b)
        if d == 0:   # up
            for c in range(4):
                col = [new[r][c] for r in range(4)]
                merged = merge(col)
                for r in range(4):
                    new[r][c] = merged[r]
        elif d == 1: # down
            for c in range(4):
                col = [new[r][c] for r in reversed(range(4))]
                merged = merge(col)
                for i, r in enumerate(reversed(range(4))):
                    new[r][c] = merged[i]
        elif d == 2: # left
            for r in range(4):
                row = new[r]
                merged = merge(row)
                new[r] = merged
        else:        # right
            for r in range(4):
                row = list(reversed(new[r]))
                merged = merge(row)
                new[r] = list(reversed(merged))
        return new != b

    def merge(line):
        # compress non-zeros
        filtered = [x for x in line if x]
        merged = []
        i = 0
        while i < len(filtered):
            if i+1 < len(filtered) and filtered[i] == filtered[i+1]:
                merged.append(filtered[i]*2)
                i += 2
            else:
                merged.append(filtered[i])
                i += 1
        merged += [0]*(len(line)-len(merged))
        return merged

    # try up, left, down, right in that order
    for d in [0, 2, 1, 3]:
        if can_move(board, d):
            return str(d)
    return "0"  # default if no moves
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  2│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
def strategy(board):
    import random, copy
    moves = ["0","1","2","3"]
    best = None
    best_score = -1
    for m in moves:
        new_b = copy.deepcopy(board)
        # simulate move (placeholder: no actual logic, just random)
        score = sum(sum(row) for row in new_b) + random.random()
        if score > best_score:
            best_score = score
            best = m
    return best
Steps = 11 If Done = False
┌───┬───┬───┬───┐
│  .│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 3 If Done = False
def strategy(board):
    # Simple heuristic: try to move left if possible, otherwise random.
    from random import choice
    # Check if any left move is valid
    for r in range(len(board)):
        for c in range(1, len(board[0])):
            if board[r][c] == 0 and board[r][c-1] != 0:
                return "0"
    return str(choice([0, 1, 2, 3]))
┌───┬───┬───┬───┐
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  2│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 10 If Done = False
def strategy(board):
    # Utility functions
    def move(board, dir):
        # Merge logic for one move
        n = len(board)
        def compress(row):
            new_row = [x for x in row if x != 0]
            merged = []
            i = 0
            while i < len(new_row):
                if i+1 < len(new_row) and new_row[i] == new_row[i+1]:
                    merged.append(new_row[i]*2)
                    i += 2
                else:
                    merged.append(new_row[i])
                    i += 1
            merged += [0]*(n-len(merged))
            return merged

        if dir == 0:            # up
            for j in range(n):
                col = [board[i][j] for i in range(n)]
                new_col = compress(col)
                for i in range(n):
                    board[i][j] = new_col[i]
        elif dir == 1:          # down
            for j in range(n):
                col = [board[i][j] for i in range(n)][::-1]
                new_col = compress(col)[::-1]
                for i in range(n):
                    board[i][j] = new_col[i]
        elif dir == 2:          # left
            for i in range(n):
                row = board[i]
                board[i] = compress(row)
        else:                   # right
            for i in range(n):
                row = board[i][::-1]
                new_row = compress(row)[::-1]
                board[i] = new_row

    def empty_cells(b):
        return sum(cell == 0 for row in b for cell in row)

    best_dir, best_score = None, -1
    for d in range(4):
        b_copy = [row[:] for row in board]
        move(b_copy, d)
        score = empty_cells(b_copy)
        if score > best_score:
            best_score = score
            best_dir = str(d)
    return best_dir
┌───┬───┬───┬───┐
│  2│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  2│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘
Steps = 1 If Done = False
def strategy(board):
    from copy import deepcopy
    def move(board, dir):
        N=len(board); new=[[0]*N for _ in range(N)]
        for i in range(N):
            line=[]
            for j in range(N):
                if dir==0:  # left
                    val=board[i][j]
                elif dir==1:  # right
                    val=board[i][N-1-j]
                elif dir==2:  # up
                    val=board[j][i]
                else:  # down
                    val=board[N-1-j][i]
                if val:
                    line.append(val)
            merged=[]
            skip=False
            for k in range(len(line)):
                if skip: 
                    skip=False; continue
                if k+1<len(line) and line[k]==line[k+1]:
                    merged.append(line[k]*2)
                    skip=True
                else:
                    merged.append(line[k])
            for idx, val in enumerate(merged):
                if dir==0: new[i][idx]=val
                elif dir==1: new[i][N-1-idx]=val
                elif dir==2: new[idx][i]=val
                else: new[N-1-idx][i]=val
        return new
    best=None; bestScore=-1; bestDir=None
    for d in range(4):
        new=move(board,d)
        if any(any(cell!=0 for cell in row) for row in new):
            score=sum(sum(row) for row in new)
            if score>bestScore:
                bestScore=score; bestDir=d
    return str(bestDir) if bestDir is not None else "0"
┌───┬───┬───┬───┐
│  2│  .│  .│  2│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
├───┼───┼───┼───┤
│  .│  .│  .│  .│
└───┴───┴───┴───┘

Inference

Now let's try the model we just trained!

[35]

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-25

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|><|start|>user<|message|>Create a new short 2048 strategy using only native Python code.
You are given a list of list of numbers for the current board state.
Output one action for "0", "1", "2", "3" on what is the optimal next step.
Output your new short function in backticks using the format below:
```python
def strategy(board):
    return "0" # Example
```
All helper functions should be inside def strategy. Only output the short function `strategy`.<|end|><|start|>assistant<|channel|>analysis<|message|>We need to write a short function that, given board (list of lists), outputs optimal move. Use simple heuristic: count possible moves, prioritize. Just code minimal logic.

Let's produce function.<|end|><|start|>assistant<|channel|>final<|message|>```python
def strategy(board):
    # Board is a list of 4 lists of 4 ints, 0 means empty.
    # Simple heuristic: try moves that merge at least one pair and avoid wall moves.
    dirs = ["0","1","2","3"]  # up, down, left, right
    def can_merge(d):
        n = len(board)
        for i in range(n):
            for j in range(n):
                if board[i][j]==0: continue
                # check neighbor in direction d
                if d==0:  # up
                    if i>0 and board[i-1][j]==board[i][j]:
                        return True
                if d==1:  # down
                    if i<n-1 and board[i+1][j]==board[i][j]:
                        return True
                if d==2:  # left
                    if j>0 and board[i][j-1]==board[i][j]:
                        return True
                if d==3:  # right
                    if j<n-1 and board[i][j+1]==board[i][j]:
                        return True
        return False

    for d in range(4):
        if can_merge(d):
            return str(d)
    return "0"
```<|return|>

Saving to float16 or `MXFP4`

We also support saving to float16 directly. Select merged_16bit for float16 or mxfp4 for MXFP4 (OpenAI's GPT-OSS native precision). We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

[36]

And we're done!

Congratulations you just learned how to do reinforcement learning with GPT-OSS! There were some advanced topics explained in this notebook - to learn more about GPT-OSS and RL, there are more docs in Unsloth's Reinforcement Learning Guide with GPT-OSS

This notebook and all Unsloth notebooks are licensed LGPL-3.0.