Notebooks
A
Anthropic
Generate Test Cases

Generate Test Cases

Generate Synthetic Test Data for Your Prompt Template

Imagine you have a prompt roughly along these lines:

"""Here's some things I want you to analyze:

{{thing1}} {{thing2}}

These things are [description of things]. Please read them carefully and [do some task]."""

Here we'd call thing1 and thing2 the "variables" -- and you want your prompt to behave well for many different possible values of thing1 and thing2.

How can you test this prompt template? Maybe you have some real-life values you can substitute in. But maybe you don't, or maybe you aren't allowed to test on the ones you do have for privacy reasons. No worries -- Claude can make them up! This cookbook demonstrates how to generate synthetic test data for your prompts using Claude & the Claude API. It includes functions for extracting variables from templates, constructing example blocks, generating test cases, and iteratively refining the results. The benefits of this are twofold:

  1. Prompt Evaluation You can use these test cases to see how Claude will perform on realistic examples.

  2. Prompt Improvement with Multishot Examples Giving Claude examples is perhaps the best way to improve its performance. This notebook can help you generate realistic inputs which is half the battle in getting ideal input/output pairs.

[ ]
[1]

Let's start by defining some helper functions that we'll use throughout this notebook.

[2]

Prompt Template for Generating the Data

The general idea of these prompt templates is to take a user-submitted prompt template with variables, and construct some values for the variables to fill the template.

There are actually two prompt templates below; one is formatted assuming that the user has already provided example variable values, and one does not assume that.

What they have in common is that both templates start by giving Claude context about the situation, and directing Claude to carefully think through the specs of each variable individually as well as the user-provided prompt template as a whole before outputting the test cases.

[3]

Next, another quick helper function for filling in the appropriate prompt template and calling Claude.

[4]
[5]

Now we can start to put the pieces together. To start, enter your prompt template here.

[6]
Identified variables:
- DOCUMENTS
- QUESTION

Next, if you have any "golden examples" of inputs and ideal outputs, you can enter those. The code is commented out for now.

[7]

Next, we can get the test case generation prompt template filled out with this information, and get a test case!

[8]

Now, let's take a look at both the test case and the planning that Claude used to generate it.

[10]
~~~~~~~~~~~
Generated test case:
~~~~~~~~~~~
DOCUMENTS:
Return Policy
- Items may be returned within 30 days of purchase with original receipt
- Items must be unused and in original packaging
- Shipping costs are non-refundable
- Gift cards are non-returnable

Shipping Information
- Standard shipping (5-7 business days): Free on orders over $50
- Express shipping (2-3 business days): $12.99
- Overnight shipping (next business day): $24.99
- We ship to continental US only
- Alaska and Hawaii orders incur additional $15 fee

Payment Methods
- We accept Visa, Mastercard, American Express, and PayPal
- Payment is processed at time of order
- Gift cards cannot be used for partial payment

QUESTION:
Hi, I ordered a sweater last week but it doesn't fit right. Can I return it? And will I get refunded for the shipping I paid? Thanks!

~~~~~~~~~~~
Planning:
~~~~~~~~~~~
<planning>
1. Prompt Template Summary:
This template creates a customer service chatbot for Acme Corporation that answers customer questions based on official company policies/FAQ documents. The goal is to ensure consistent, policy-compliant responses to customer inquiries.

2. Variable Analysis:

DOCUMENTS:
- Would likely be maintained by Acme's policy/legal team
- Stored in a knowledge base or content management system
- Formatted as structured FAQ entries or policy statements
- Professional, formal tone
- Multiple paragraphs covering different topics
- Clear headers and categories
- Length: Several paragraphs (300-500 words)

QUESTION:
- Written by end users/customers
- Informal, conversational tone
- Usually 1-2 sentences
- Often includes context about their specific situation
- May contain typos or casual language
- Length: 20-50 words
</planning>

From here, there are a few ways we can go. We could generate more test cases, or we could edit Claude's planning logic. Let's edit Claude's planning logic a little bit. Maybe we know that ACME's documentation uses numbered lines. Some other realistic changes could be:

  • Have Claude tell itself to make the documents longer and more detailed.
  • Have Claude tell itself to make the customer support query more or less formal.
[11]

Let's reset our examples, but use this planning text as a prefill. (This saves a little bit of sampling time.)

[12]

Now let's see the new results.

[13]
~~~~~~~~~~~
Generated test case:
~~~~~~~~~~~
DOCUMENTS:
Return Policy
- Items may be returned within 30 days of purchase with original receipt
- Items must be unused and in original packaging
- Shipping costs are non-refundable
- Store credit will be issued for items returned without receipt

Shipping Information
- Standard shipping (5-7 business days): $5.99
- Express shipping (2-3 business days): $12.99
- Free standard shipping on orders over $50
- We currently ship only within the continental United States
- Alaska and Hawaii orders subject to additional fees

Payment Methods
- We accept Visa, Mastercard, American Express, and PayPal
- Gift cards cannot be used for online purchases
- Payment is processed at time of order
- All prices are in USD

QUESTION:
Hi, I ordered a sweater last week but it doesn't fit right. Can I return it? I still have the tags on it but I threw away the receipt. Thanks!

~~~~~~~~~~~
Planning:
~~~~~~~~~~~
<planning>
1. Prompt Template Summary:
This template creates a customer service chatbot for Acme Corporation that answers customer questions based on official company policies/FAQ documents. The goal is to ensure consistent, policy-compliant responses to customer inquiries.

2. Variable Analysis:

DOCUMENTS:
- Would likely be maintained by Acme's policy/legal team
- Stored in a knowledge base or content management system
- Formatted as structured FAQ entries or policy statements
- Professional, formal tone
- Multiple paragraphs covering different topics
- Clear headers and categories
- Length: Several paragraphs (300-500 words)

QUESTION:
- Written by end users/customers
- Informal, conversational tone
- Usually 1-2 sentences
- Often includes context about their specific situation
- May contain typos or casual language
- Length: 20-50 words
</planning>

Great, it did the numbered Q and A!

Let's make another example. This one will use the example we already have, so hopefully it will be interestingly different.

[14]
[15]
~~~~~~~~~~~
Generated test case:
~~~~~~~~~~~
DOCUMENTS:
Product Warranty
- All electronics come with a 1-year limited manufacturer warranty
- Warranty covers defects in materials and workmanship
- Warranty does not cover accidental damage or misuse
- Extended warranty available for purchase within 30 days

Price Match Policy
- We match prices from authorized retailers
- Item must be identical model/color/specification
- Must be in stock at competitor's store
- Online retailers excluded from price matching
- Price match requests must be made at time of purchase

Order Cancellation
- Orders can be cancelled within 2 hours of placement
- Once order is shipped, cancellation not possible
- Cancelled orders refunded to original payment method
- Processing time for refunds: 3-5 business days
- Contact customer service for cancellation requests

QUESTION:
Hello, I bought a laptop from your store 3 weeks ago and it keeps shutting down randomly. It's still under warranty, right? What do I need to do to get it fixed? Thanks in advance!

~~~~~~~~~~~
Planning:
~~~~~~~~~~~
<planning>
1. Prompt Template Summary:
This template creates a customer service chatbot for Acme Corporation that answers customer questions based on official company policies/FAQ documents. The goal is to ensure consistent, policy-compliant responses to customer inquiries.

2. Variable Analysis:

DOCUMENTS:
- Would likely be maintained by Acme's policy/legal team
- Stored in a knowledge base or content management system
- Formatted as structured FAQ entries or policy statements
- Professional, formal tone
- Multiple paragraphs covering different topics
- Clear headers and categories
- Length: Several paragraphs (300-500 words)

QUESTION:
- Written by end users/customers
- Informal, conversational tone
- Usually 1-2 sentences
- Often includes context about their specific situation
- May contain typos or casual language
- Length: 20-50 words
</planning>

Still about ACME corporation, but the question is different and so is the knowledge base.

From here, the world is your oyster -- you can generate more test cases by running the code in a loop, edit the planning more, evaluate Claude on these test cases, and put the test cases you make along with golden answers into your prompt as multishot examples.

To get golden answers, you can either write them yourself from scratch, or have Claude write an answer and then edit it to taste. With the advent of prompt caching, there's never been a better time to add tons of examples to your prompt to improve performance.