Speculative Prompt Caching
Export
Speculative Prompt Caching
This cookbook demonstrates "Speculative Prompt Caching" - a pattern that reduces time-to-first-token (TTFT) by warming up the cache while users are still formulating their queries.
Without Speculative Caching:
- User types their question (3 seconds)
- User submits question
- API loads context into cache AND generates response
With Speculative Caching:
- User starts typing (cache warming begins immediately)
- User continues typing (cache warming completes in background)
- User submits question
- API uses warm cache to generate response
Setup
First, let's install the required packages:
[82]
Note: you may need to restart the kernel to use updated packages.
[83]
Helper Functions
Let's set up the functions to download our large context and prepare messages:
[84]
Example 1: Standard Prompt Caching (Without Speculative Caching)
First, let's see how standard prompt caching works. The user types their question, then we send the entire context + question to the API:
[85]
[86]
Downloading SQLite source files... Successfully downloaded btree.h Successfully downloaded btree.c Downloaded 2 files in 0.30 seconds User is typing their question... User submitted: What is the purpose of the BtShared structure? Sending request to API... š Time to first token: 20.87 seconds Total response time: 28.32 seconds Standard Caching query statistics: Input tokens: 22 Output tokens: 362 Cache read input tokens: 0 Cache creation input tokens: 151629
Example 2: Speculative Prompt Caching
Now let's see how speculative prompt caching improves TTFT by warming the cache while the user is typing:
[87]
[88]
Downloading SQLite source files... Successfully downloaded btree.h Successfully downloaded btree.c Downloaded 2 files in 0.36 seconds User is typing their question... š„ Starting cache warming in background... User submitted: What is the purpose of the BtShared structure? ā Cache warming completed! Sending request to API (with warm cache)... š Time to first token: 1.94 seconds Total response time: 8.40 seconds Speculative Caching query statistics: Input tokens: 22 Output tokens: 330 Cache read input tokens: 151629 Cache creation input tokens: 0
Performance Comparison
Let's compare the results to see the benefit of speculative caching:
[89]
============================================================ PERFORMANCE COMPARISON ============================================================ Standard Prompt Caching: Time to First Token: 20.87 seconds Total Response Time: 28.32 seconds Speculative Prompt Caching: Time to First Token: 1.94 seconds Total Response Time: 8.40 seconds šÆ IMPROVEMENTS: TTFT Improvement: 90.7% (18.93s faster) Total Time Improvement: 70.4% (19.92s faster)
Key Takeaways
- Speculative caching dramatically reduces TTFT by warming the cache while users are typing
- The pattern is most effective with large contexts (>1000 tokens) that are reused across queries
- Implementation is simple - just send a 1-token request while the user is typing
- Cache warming happens in parallel with user input, effectively "hiding" the cache creation time
Best Practices
- Start cache warming as early as possible (e.g., when a user focuses an input field)
- Use exactly the same context for warming and actual requests to ensure cache hits
- Monitor
cache_read_input_tokensto verify cache hits - Add timestamps to prevent unwanted cache sharing across sessions