Notebooks
A
Arize AI
Optimize Cline Plan

Cline Prompt Learning Optimization on SWE-bench

This notebook demonstrates how we used Prompt Learning to optimize Cline's performance on the SWE-bench dataset. Cline is a popular and powerful open-source coding agent. We look to improve its performance on SWE-bench by optimizing its rules, which are user specified instructions that Cline appends to its system prompt.

More on Cline

More on Prompt Learning

Plan Mode (for now)

This is a primitive stage of optimization. We are just looking at Plan Mode for Cline, which generates a plan for a given query, referencing the files in the codebase. We are then using an LLM-as-Judge evaluator to judge the generated plan. Therefore the accuracies should be taken lightly - they are not a perfect reflection of Cline's performance because Cline is not actually editing the codebase, and we're not actually running the SWE Bench tests to verify whether its edits are correct.

We are still working on running Cline in Act Mode, and allowing it actually edit the codebase. Then we can use the tests in SWE bench to compute a firm accuracy of whether Cline made the right edits. Stay tuned.

Setup

Please visit README.md and complete all the Setup before running this notebook!

Important Note

Running this notebook can be computationally intensive and expensive as it involves multiple API calls to Claude for each SWE-bench instance. Consider adjusting the training and test set sizes based on your requirements and budget constraints.

[ ]
[3]
/opt/anaconda3/envs/cline/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
2025-10-02 11:20:03,823 - phoenix.config - INFO - 📋 Ensuring phoenix working directory: /Users/priyanjindal/.phoenix
2025-10-02 11:20:03,831 - phoenix.inferences.inferences - INFO - Dataset: phoenix_inferences_cb9a1318-712e-423a-b9f9-d3ea8d0c37a6 initialized
[ ]

Configuration

  • OPTIMIZATION_LOOPS: number of Prompt Learning loops. How many times you want to optimize your prompt.
  • TRAIN_SIZE: size of training set.
  • TEST_SIZE: size of test set.
  • MAX_WORKERS: SWE Bench is set up to run in parallel, with however many workers you specify. Set this relative to your machine and your Claude rate limits.
  • RULES: base starting ruleset. I suggest keeping the rule regarding resume_task, as I've noticed using the resume_task tool leads to unstable behavior.
[1]

Train/Test Datasets

This code splits SWE-Bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.

[4]

Upload Datasets to Arize

[6]

Created train dataset with ID: RGF0YXNldDozMTc1MzM6bEpOcw==
Created test dataset with ID: RGF0YXNldDozMTc1MzQ6TUs4OA==

This helper function will help us convert our Cline runs on SWE Bench into data we can evaluate.

[ ]

Helper: Running Cline on a dataset

This helper function runs Cline on a dataset. It is meant to be used to run Cline on either your train or test split.

It runs Cline in parallel, spinning up MAX_WORKERS # of Cline servers at a time, each server running on a specific row of SWE Bench.

It also then evaluates the plans generated by Cline using our LLM-as-judge eval. We simply provide an LLM with the problem statement, the test patch, the ground truth patch, and Cline's generated plan, and ask it if the generated plan seems right. We use this to compute a rough measure of Cline's accuracy.

[ ]

Helper: Log experiments to Arize

We'll be logging Cline results at every iteration of optimization to Arize, so we can visualize and keep track of our results.

[ ]

Ruleset Optimization

This code optimizes our ruleset for Cline. Here are the steps:

Repeats OPTIMIZATION_LOOPS # of times:

  1. Run Cline, with the current ruleset, on the training set, and compute training accuracy.
  2. Run Cline, with the current ruleset, on the test set, and compute test accuracy.
  3. Use the results on the training set to optimize the ruleset, using `PromptLearningOptimizer'
  4. Update the current ruleset to be the optimized ruleset, for the next iteration.
[ ]