Understanding Prompt Caching: Reduce AI Costs and Latency

A UK public sector organisation came to us after their LLM costs tripled overnight. They'd added a 12,000-token policy document to their system prompt—sensible for compliance—but every one of 8,000 daily queries was reprocessing that entire document. Within a week of implementing prompt caching, their bill dropped by 82%.

Prompt caching is one of the highest-impact, lowest-effort optimisations available for production LLM applications. This guide explains how it works, when to use it, and how to avoid the pitfalls we see in the field.

Key Takeaways

Prompt caching can reduce LLM API costs by up to 90% on cached input tokens

Structure prompts with static content first, dynamic content last

Monitor cache hit rates—you can't optimise what you don't measure

90%

Cost Reduction

on cached tokens (Claude)

£2-5k

Typical Savings

per month at scale

1,024

Min Prefix

tokens for Claude

What is Prompt Caching?

Prompt caching is an API-level optimisation that stores the processed state of your prompt prefix, allowing subsequent requests with the same prefix to skip redundant computation. When you send a prompt to an LLM, the model processes each token through multiple transformer layers—a computationally expensive operation. Prompt caching stores this intermediate state (called the KV cache) so it doesn't need to be recomputed.

Think of it like preparing ingredients for a recipe. Without caching, you chop vegetables from scratch every time you cook. With caching, you prepare them once and store them ready to use—you only need to add the final fresh ingredients each time.

How Prompt Caching Works

Here's the technical flow of prompt caching:

First Request (Cold Cache): The full prompt is processed through all transformer layers. The computed key-value pairs for the prompt prefix are stored in cache.
Cache Storage: The provider stores the KV cache associated with your prompt prefix, typically for a limited time (TTL).
Subsequent Requests (Warm Cache): When you send a request with the same prefix, the cached KV pairs are loaded directly. Only the new tokens (your dynamic content) need full processing.
Cost Reduction: You pay a reduced rate for cached tokens since they require less computation.

Prompt caching stores the processed system prompt, so subsequent requests only need to process new user queries.

Provider Implementations

Major LLM providers offer prompt caching with different specifications:

Anthropic Claude

Minimum prefix: 1,024 tokens (approximately 4,000 characters)
Cache TTL: 5 minutes (extended with each cache hit)
Cost savings: Cached input tokens cost 10% of standard input pricing (90% reduction)
Write cost: 25% premium on first write to cache
Activation: Automatic for eligible prompts, or explicit with cache_control parameter

OpenAI

Minimum prefix: 128 tokens for GPT-4 Turbo and later models
Cache TTL: Variable (typically 5-10 minutes)
Cost savings: 50% reduction on cached input tokens
Activation: Automatic—no code changes required

From the Trenches

We've seen prompt caching fail silently when teams add timestamps, user IDs, or session tokens at the start of their prompts. Always audit your prompt prefix byte-by-byte—a single character difference means a cache miss.

When to Use Prompt Caching

Prompt caching delivers the most value in these scenarios:

Long System Prompts: Applications with detailed instructions, personas, or policies that remain constant across requests.
RAG Applications: When retrieved context documents are reused across multiple user queries within the cache window.
Multi-turn Conversations: Chat applications where conversation history grows but earlier messages remain unchanged.
Code Analysis: Tools that include large codebases or documentation as context.
Batch Processing: When processing many items with the same instructions or context.

When NOT to Use Prompt Caching

Prompt caching isn't always the right solution:

Short prompts: If your system prompt is under 1,024 tokens (Claude) or 128 tokens (OpenAI), caching won't activate.
Highly dynamic prefixes: If every request has unique context at the start (e.g., personalised user profiles), cache hit rates will be near zero.
Low request frequency: With 5-minute TTLs, infrequent requests mean your cache expires before the next hit.
One-off analysis: For single queries against unique documents, there's no reuse opportunity.

Structuring Prompts for Caching

To maximise cache hits, structure your prompts with static content first:

prompt-structure.js

[#ff7b72]">class=[#ff7b72]">class="text-[#a5d6ff]">"text-[#8b949e]">// Optimal structure [#ff7b72]">for prompt caching
[#ff7b72]">const prompt = {
  [#79c0ff]">system: `
    [STATIC - CACHEABLE]
    You are a customer support agent [#ff7b72]">for TechCorp.
 
    COMPANY [#79c0ff]">POLICIES:
    - Returns accepted within [#79c0ff]">30 days
    - Premium members get priority support
    - ... (thousands of tokens of policies)
 
    PRODUCT [#79c0ff]">CATALOGUE:
    - ... (product details)
 
    RESPONSE [#79c0ff]">GUIDELINES:
    - Be professional and empathetic
    - Always verify customer identity first
  `,
 
  [#ff7b72]">class=[#ff7b72]">class="text-[#a5d6ff]">"text-[#8b949e]">// Dynamic content comes last
  [#79c0ff]">user: `
    [DYNAMIC - NOT CACHED]
    [#79c0ff]">Customer: [#ff7b72]">class="text-[#a5d6ff]">"I want to [#ff7b72]">return my laptop"
  `
};

The key principle: static content first, dynamic content last. The cache matches from the beginning of your prompt, so any variation in the prefix will invalidate the cache.

Cost Analysis Example

Let's calculate the savings for a typical enterprise scenario:

Scenario: Customer support chatbot
System prompt: 10,000 tokens
Average user query: 200 tokens
Queries per day: 5,000
Cache hit rate: 95%

Without Caching (Claude Sonnet):
Daily input tokens: 10,200 x 5,000 = 51,000,000 tokens
Cost: ~$153/day

With Caching:
Cold cache (5%): 2,550,000 tokens at full price + write premium
Warm cache (95%): 48,450,000 tokens at 10% price
Cost: ~$22/day

Savings: ~$131/day (85% reduction)

Production Checklist

Audit prompt prefix for dynamic content

Watch out

Timestamps, session IDs, and user names invalidate cache

Verify prefix exceeds minimum threshold

Watch out

Claude needs 1,024+ tokens, OpenAI needs 128+

Set up cache hit rate monitoring

Watch out

You can't optimise what you don't measure

Check request frequency vs TTL

Watch out

5-minute gaps mean cold cache every time

Use explicit cache_control for Claude

Watch out

Automatic caching may not align with your boundaries

Prompt Caching vs Response Caching

It's important to distinguish prompt caching from response caching—they're complementary techniques:

Aspect	Prompt Caching	Response Caching
Where	Provider's infrastructure	Your application layer
What's cached	Computed KV pairs from prompt processing	Final generated responses
When useful	Same prefix, different completions	Identical queries expecting same answer
Savings	50-90% on input token costs	100% (no API call needed)
Limitations	TTL-based, prefix matching only	Requires exact match, staleness risk

For optimal cost efficiency, consider using both: prompt caching for the API layer and response caching (perhaps with semantic similarity matching) at the application layer for frequently repeated queries.

Writing Effective Prompts for LLMs – Structure prompts optimally for both quality and caching
Understanding Generative Search (RAG) – How prompt caching benefits retrieval-augmented applications

Conclusion

Prompt caching is one of the easiest wins in LLM cost optimisation. For applications with consistent system prompts and high request volumes, it can reduce costs by 80%+ with minimal code changes.

The key is intentional prompt architecture: place static content at the beginning, keep dynamic elements at the end, and monitor your cache hit rates to continuously optimise.

Struggling with LLM costs at scale? We audit AI infrastructure and implement cost controls as part of our delivery work.

Learn about our method →

Last updated: December 2024

Understanding Prompt Caching: Reduce AI Costs and Latency

Understanding Prompt Caching: Reduce AI Costs and Latency

Cost Reduction

Typical Savings

Min Prefix

What is Prompt Caching?

How Prompt Caching Works

Provider Implementations

Anthropic Claude

OpenAI

When to Use Prompt Caching

When NOT to Use Prompt Caching

Structuring Prompts for Caching

Cost Analysis Example

Prompt Caching vs Response Caching

Related Articles

Conclusion

Understanding Prompt Caching: Reduce AI Costs and Latency

Cost Reduction

Typical Savings

Min Prefix

What is Prompt Caching?

How Prompt Caching Works

Provider Implementations

Anthropic Claude

OpenAI

When to Use Prompt Caching

When NOT to Use Prompt Caching

Structuring Prompts for Caching

Cost Analysis Example

Prompt Caching vs Response Caching

Related Articles

Conclusion