A UK public sector organisation came to us after their LLM costs tripled overnight. They'd added a 12,000-token policy document to their system prompt—sensible for compliance—but every one of 8,000 daily queries was reprocessing that entire document. Within a week of implementing prompt caching, their bill dropped by 82%.
Prompt caching is one of the highest-impact, lowest-effort optimisations available for production LLM applications. This guide explains how it works, when to use it, and how to avoid the pitfalls we see in the field.
on cached tokens (Claude)
per month at scale
tokens for Claude
Prompt caching is an API-level optimisation that stores the processed state of your prompt prefix, allowing subsequent requests with the same prefix to skip redundant computation. When you send a prompt to an LLM, the model processes each token through multiple transformer layers—a computationally expensive operation. Prompt caching stores this intermediate state (called the KV cache) so it doesn't need to be recomputed.
Think of it like preparing ingredients for a recipe. Without caching, you chop vegetables from scratch every time you cook. With caching, you prepare them once and store them ready to use—you only need to add the final fresh ingredients each time.
Here's the technical flow of prompt caching:
Prompt caching stores the processed system prompt, so subsequent requests only need to process new user queries.
Major LLM providers offer prompt caching with different specifications:
Prompt caching delivers the most value in these scenarios:
Prompt caching isn't always the right solution:
To maximise cache hits, structure your prompts with static content first:
[#ff7b72]">class=[#ff7b72]">class="text-[#a5d6ff]">"text-[#8b949e]">// Optimal structure [#ff7b72]">for prompt caching[#ff7b72]">const prompt = { [#79c0ff]">system: ` [STATIC - CACHEABLE] You are a customer support agent [#ff7b72]">for TechCorp. COMPANY [#79c0ff]">POLICIES: - Returns accepted within [#79c0ff]">30 days - Premium members get priority support - ... (thousands of tokens of policies) PRODUCT [#79c0ff]">CATALOGUE: - ... (product details) RESPONSE [#79c0ff]">GUIDELINES: - Be professional and empathetic - Always verify customer identity first `, [#ff7b72]">class=[#ff7b72]">class="text-[#a5d6ff]">"text-[#8b949e]">// Dynamic content comes last [#79c0ff]">user: ` [DYNAMIC - NOT CACHED] [#79c0ff]">Customer: [#ff7b72]">class="text-[#a5d6ff]">"I want to [#ff7b72]">return my laptop" `};The key principle: static content first, dynamic content last. The cache matches from the beginning of your prompt, so any variation in the prefix will invalidate the cache.
Let's calculate the savings for a typical enterprise scenario:
Scenario: Customer support chatbot
System prompt: 10,000 tokens
Average user query: 200 tokens
Queries per day: 5,000
Cache hit rate: 95%
Without Caching (Claude Sonnet):
Daily input tokens: 10,200 x 5,000 = 51,000,000 tokens
Cost: ~$153/day
With Caching:
Cold cache (5%): 2,550,000 tokens at full price + write premium
Warm cache (95%): 48,450,000 tokens at 10% price
Cost: ~$22/day
Savings: ~$131/day (85% reduction)
Audit prompt prefix for dynamic content
Timestamps, session IDs, and user names invalidate cache
Verify prefix exceeds minimum threshold
Claude needs 1,024+ tokens, OpenAI needs 128+
Set up cache hit rate monitoring
You can't optimise what you don't measure
Check request frequency vs TTL
5-minute gaps mean cold cache every time
Use explicit cache_control for Claude
Automatic caching may not align with your boundaries
It's important to distinguish prompt caching from response caching—they're complementary techniques:
| Aspect | Prompt Caching | Response Caching |
|---|---|---|
| Where | Provider's infrastructure | Your application layer |
| What's cached | Computed KV pairs from prompt processing | Final generated responses |
| When useful | Same prefix, different completions | Identical queries expecting same answer |
| Savings | 50-90% on input token costs | 100% (no API call needed) |
| Limitations | TTL-based, prefix matching only | Requires exact match, staleness risk |
For optimal cost efficiency, consider using both: prompt caching for the API layer and response caching (perhaps with semantic similarity matching) at the application layer for frequently repeated queries.
Prompt caching is one of the easiest wins in LLM cost optimisation. For applications with consistent system prompts and high request volumes, it can reduce costs by 80%+ with minimal code changes.
The key is intentional prompt architecture: place static content at the beginning, keep dynamic elements at the end, and monitor your cache hit rates to continuously optimise.
Struggling with LLM costs at scale? We audit AI infrastructure and implement cost controls as part of our delivery work.
Learn about our method →