Notebooks
T
Together
GEPA Optimization

GEPA Summarization Optimization with LLM Judge Evaluation

Open In Colab

Introduction

This notebook demonstrates how to optimize summarization prompts using GEPA (Generate, Evaluate, Propose, Adapt) with the Together Evaluations API. We'll:

  1. Load the CNN/DailyMail dataset containing news articles
  2. Start with a baseline summarization prompt
  3. Use an optimizer LLM to iteratively improve the prompt
  4. Compare prompts head-to-head using a judge model
  5. Track improvement over multiple iterations

Concepts Covered:

  • GEPA Optimization: Iterative prompt engineering using LLM feedback, see this paper for more details
  • LLM-as-a-Judge: Using a language model to evaluate and compare outputs
  • Batch Evaluation: Efficient comparison of multiple summaries
  • Prompt Engineering: Systematic improvement of instruction prompts

📦 Setup and Installation

[1]
[2]

⚙️ Configuration

Set up your API key and configure the models we'll use:

  • Summarizer Model: Generates the summaries
  • Judge Model: Evaluates which summary is better
  • Optimizer Model: Proposes improvements to the prompt
[3]
✓ API key loaded from Colab secrets
✓ Configuration complete

📝 Baseline and Judge Prompts

We start with a simple baseline prompt for summarization. The GEPA process will iteratively improve this prompt based on performance feedback.

[4]
Baseline Prompt:
Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news event
- Key people or organizations involved
- Important details or outcomes
- Any significant context

Keep it to 3-5 sentences total.

Judge Prompt:
Compare these two summaries of the same news article.

Which summary better:
- Captures the main news story
- Includes important details
- Is clear and concise
- Avoids unnecessary information

Choose A or B and explain why briefly.

📂 Loading the CNN/DailyMail Dataset

The CNN/DailyMail dataset contains news articles paired with human-written highlights. We'll use the articles as our source text and split the data into train, validation, and test sets.

Dataset Structure:

  • article: The full news article text
  • highlights: Human-written bullet-point summary
  • We'll use the articles for summarization and evaluate our generated summaries
[5]

================================================================================
📂 LOADING DATA
================================================================================
Loading CNN/DailyMail dataset...
✓ Loaded 11490 examples
  Sample article: (CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Cour...
  Sample highlights: Membership gives the ICC jurisdiction over alleged crimes committed in Palestinian territories since...
✓ Converted to 11490 items
✓ Split: Train=150, Val=300, Test=300

🤖 Summarization Module

We create a DSPy module that wraps our summarization task. This module can be configured with different instruction prompts, which is key to the GEPA optimization process.

[6]
✓ Summarization module defined

📊 Batch Summary Generation

This function generates summaries for a batch of articles using a given prompt. It includes error handling and progress tracking.

[7]
✓ Batch generation function defined

🧠 Optimizer LLM Wrapper

This wrapper allows us to use an LLM to propose improvements to our summarization prompt based on current performance.

[8]
✓ Optimizer LLM wrapper defined

🤔 Reflection and Prompt Improvement

This function uses the optimizer LLM to analyze the current prompt and performance, then propose an improved version.

Key Constraints:

  • Keep prompts under 150 words for clarity
  • Focus on simple, direct instructions
  • Target 4-6 sentence summaries
  • Avoid overly complex requirements
[9]
✓ Reflection function defined

🔄 Head-to-Head Prompt Comparison

This function compares two prompts by:

  1. Generating summaries with both prompts
  2. Creating a comparison dataset
  3. Using the Together AI evaluation API with a judge model
  4. Computing win rates

The evaluation uses a two-pass approach to eliminate position bias.

[10]
✓ Comparison function defined

🧬 GEPA Optimization Loop

This is the main optimization loop that implements the GEPA algorithm:

  1. Generate: Create summaries with current prompt
  2. Evaluate: Compare against baseline using judge model
  3. Propose: Use optimizer LLM to suggest improvements
  4. Adapt: Accept improvements that increase win rate

The process repeats for multiple iterations, tracking the best prompt found.

[11]
✓ GEPA optimization function defined

🚀 Run the Optimization

Now we'll execute the full GEPA optimization process. This will:

  1. Set up the summarizer and optimizer models
  2. Run multiple iterations of prompt improvement
  3. Evaluate the final optimized prompt on the test set
  4. Display comprehensive results
[12]
================================================================================
🎯 GEPA SUMMARIZATION - TOGETHER AI BATCH EVAL
================================================================================

================================================================================
🧬 MANUAL GEPA OPTIMIZATION
================================================================================

================================================================================
ITERATION 1/5
================================================================================
Iteration 0: Establishing baseline (no comparison yet)

================================================================================
ITERATION 2/5
================================================================================

🤔 REFLECTION (Iteration 1)
✓ Generated new prompt (63 words)
✓ Generated candidate prompt (404 chars)

================================================================================
🔄 COMPARING PROMPTS: iter1_val
================================================================================
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...
Prompt A: 100%|██████████| 300/300 [14:30<00:00,  2.90s/it]
Generating summaries with Prompt B...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the f...
Prompt B: 100%|██████████| 300/300 [17:16<00:00,  3.46s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter1_val_20251222_170518.jsonl: 100%|██████████| 1.59M/1.59M [00:00<00:00, 2.82MB/s]
🚀 Launching comparison...
⏳ Waiting (ID: eval-94eb-1766423120)...
✓ Results: Prompt A wins=29, Prompt B wins=35, Ties=236
✓ Prompt A win rate: 45.31%

  Current best: 45.31%
  New candidate: 54.69%
  🎉 New best! (+4.69pp)

================================================================================
ITERATION 3/5
================================================================================

🤔 REFLECTION (Iteration 2)
✓ Generated new prompt (58 words)
✓ Generated candidate prompt (389 chars)

================================================================================
🔄 COMPARING PROMPTS: iter2_val
================================================================================
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:39<00:00,  7.68it/s]
Generating summaries with Prompt B...
  Using prompt: Write a 4-6 sentence summary of this news article, prioritizing clarity and accuracy. 

Clearly stat...
Prompt B: 100%|██████████| 300/300 [15:55<00:00,  3.18s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter2_val_20251222_173300.jsonl: 100%|██████████| 1.62M/1.62M [00:00<00:00, 3.48MB/s]
🚀 Launching comparison...
⏳ Waiting (ID: eval-6faf-1766424783)...
✓ Results: Prompt A wins=34, Prompt B wins=29, Ties=237
✓ Prompt A win rate: 53.97%

  Current best: 53.97%
  New candidate: 46.03%
  No improvement

================================================================================
ITERATION 4/5
================================================================================

🤔 REFLECTION (Iteration 3)
✓ Generated new prompt (87 words)
✓ Generated candidate prompt (578 chars)

================================================================================
🔄 COMPARING PROMPTS: iter3_val
================================================================================
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:37<00:00,  8.08it/s]
Generating summaries with Prompt B...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on the most important facts. Provide a clear ...
Prompt B: 100%|██████████| 300/300 [15:51<00:00,  3.17s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter3_val_20251222_181544.jsonl: 100%|██████████| 1.65M/1.65M [00:00<00:00, 2.48MB/s]
🚀 Launching comparison...
⏳ Waiting (ID: eval-1788-1766427347)...
✓ Results: Prompt A wins=44, Prompt B wins=22, Ties=234
✓ Prompt A win rate: 66.67%

  Current best: 66.67%
  New candidate: 33.33%
  No improvement

================================================================================
ITERATION 5/5
================================================================================

🤔 REFLECTION (Iteration 4)
✓ Generated new prompt (77 words)
✓ Generated candidate prompt (547 chars)

================================================================================
🔄 COMPARING PROMPTS: iter4_val
================================================================================
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the f...
Prompt A: 100%|██████████| 300/300 [00:40<00:00,  7.47it/s]
Generating summaries with Prompt B...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on accuracy, brevity, and clarity.

Clearly s...
Prompt B: 100%|██████████| 300/300 [16:34<00:00,  3.32s/it]
📤 Uploading for comparison...
Uploading file temp_compare_iter4_val_20251222_184909.jsonl: 100%|██████████| 1.62M/1.62M [00:00<00:00, 1.77MB/s]
🚀 Launching comparison...
⏳ Waiting (ID: eval-1e94-1766429353)...
✓ Results: Prompt A wins=45, Prompt B wins=33, Ties=222
✓ Prompt A win rate: 57.69%

  Current best: 57.69%
  New candidate: 42.31%
  No improvement

================================================================================
📊 FINAL TEST EVALUATION
================================================================================

⏱️  OPTIMIZATION TIME:
  Total: 2h 31m 48s

================================================================================
🔄 COMPARING PROMPTS: final_test
================================================================================
Generating summaries with Prompt A...
  Using prompt: Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news even...
Prompt A: 100%|██████████| 300/300 [16:05<00:00,  3.22s/it]
Generating summaries with Prompt B...
  Using prompt: Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the f...
Prompt B: 100%|██████████| 300/300 [18:27<00:00,  3.69s/it]
📤 Uploading for comparison...
Uploading file temp_compare_final_test_20251222_193951.jsonl: 100%|██████████| 1.57M/1.57M [00:00<00:00, 2.74MB/s]
🚀 Launching comparison...
⏳ Waiting (ID: eval-ff84-1766432395)...
✓ Results: Prompt A wins=25, Prompt B wins=41, Ties=234
✓ Prompt A win rate: 37.88%

================================================================================
🎉 FINAL RESULTS
================================================================================

TEST SET:
  Baseline prompt:  37.88%
  Optimized prompt: 62.12%
  Improvement:      +12.12pp from neutral

💾 Saved to: results/prompts_20251222_195058.txt

✅ Complete!

📊 Analyzing the Results

Let's examine the optimized prompt and compare it to the baseline.

[13]
================================================================================
📝 PROMPT COMPARISON
================================================================================

BASELINE PROMPT:
--------------------------------------------------------------------------------
Summarize this news article in 3-5 key points.

Write a brief summary covering:
- The main news event
- Key people or organizations involved
- Important details or outcomes
- Any significant context

Keep it to 3-5 sentences total.


OPTIMIZED PROMPT:
--------------------------------------------------------------------------------
Summarize this news article in 4-6 sentences, focusing on clarity and concision.

Please cover the following key aspects:
- What is the main news event being reported?
- Who are the key people or organizations involved?
- What are the most important details or outcomes of the event?

Provide relevant background information if necessary, but prioritize the essential facts and avoid unnecessary details.


PERFORMANCE COMPARISON:
--------------------------------------------------------------------------------
Baseline Win Rate:  37.88%
Optimized Win Rate: 62.12%
Improvement:        +12.12 percentage points from neutral

🔑 Key Findings

GEPA Optimization Process:

  • Iteratively improves prompts through LLM-guided reflection
  • Uses head-to-head comparisons with a judge model
  • Tracks and accepts only improvements over baseline

Benefits of This Approach:

  1. Automated: No manual prompt engineering required
  2. Data-driven: Decisions based on actual performance metrics
  3. Scalable: Can optimize for any task with appropriate data
  4. Transparent: Clear tracking of improvements across iterations

Next Steps:

  • Try with different datasets or domains
  • Experiment with different judge criteria
  • Adjust the optimizer's reflection prompt
  • Increase iterations for potentially better results