Anthropic Speculative Prompt Caching

Speculative Prompt Caching

Export

Run Notebooks

idle

Contents

No cells yet

Add cells to see them here

Speculative Prompt Caching

This cookbook demonstrates "Speculative Prompt Caching" - a pattern that reduces time-to-first-token (TTFT) by warming up the cache while users are still formulating their queries.

Without Speculative Caching:

User types their question (3 seconds)
User submits question
API loads context into cache AND generates response

With Speculative Caching:

User starts typing (cache warming begins immediately)
User continues typing (cache warming completes in background)
User submits question
API uses warm cache to generate response

Setup

First, let's install the required packages:

[82]

Note: you may need to restart the kernel to use updated packages.

[83]

Helper Functions

Let's set up the functions to download our large context and prepare messages:

[84]

Example 1: Standard Prompt Caching (Without Speculative Caching)

First, let's see how standard prompt caching works. The user types their question, then we send the entire context + question to the API:

[85]

[86]

Downloading SQLite source files...
Successfully downloaded btree.h
Successfully downloaded btree.c
Downloaded 2 files in 0.30 seconds
User is typing their question...
User submitted: What is the purpose of the BtShared structure?

Sending request to API...

🕐 Time to first token: 20.87 seconds
Total response time: 28.32 seconds

Standard Caching query statistics:
	Input tokens: 22
	Output tokens: 362
	Cache read input tokens: 0
	Cache creation input tokens: 151629

Example 2: Speculative Prompt Caching

Now let's see how speculative prompt caching improves TTFT by warming the cache while the user is typing:

[87]

[88]

Downloading SQLite source files...
Successfully downloaded btree.h
Successfully downloaded btree.c
Downloaded 2 files in 0.36 seconds
User is typing their question...
🔥 Starting cache warming in background...
User submitted: What is the purpose of the BtShared structure?
✅ Cache warming completed!

Sending request to API (with warm cache)...

🚀 Time to first token: 1.94 seconds
Total response time: 8.40 seconds

Speculative Caching query statistics:
	Input tokens: 22
	Output tokens: 330
	Cache read input tokens: 151629
	Cache creation input tokens: 0

Performance Comparison

Let's compare the results to see the benefit of speculative caching:

[89]

============================================================
PERFORMANCE COMPARISON
============================================================

Standard Prompt Caching:
  Time to First Token: 20.87 seconds
  Total Response Time: 28.32 seconds

Speculative Prompt Caching:
  Time to First Token: 1.94 seconds
  Total Response Time: 8.40 seconds

🎯 IMPROVEMENTS:
  TTFT Improvement: 90.7% (18.93s faster)
  Total Time Improvement: 70.4% (19.92s faster)

Key Takeaways

Speculative caching dramatically reduces TTFT by warming the cache while users are typing
The pattern is most effective with large contexts (>1000 tokens) that are reused across queries
Implementation is simple - just send a 1-token request while the user is typing
Cache warming happens in parallel with user input, effectively "hiding" the cache creation time

Best Practices

Start cache warming as early as possible (e.g., when a user focuses an input field)
Use exactly the same context for warming and actual requests to ensure cache hits
Monitor cache_read_input_tokens to verify cache hits
Add timestamps to prevent unwanted cache sharing across sessions