LLM Prompt Caching Best Practices 2026: Reduce API Costs by 90%
Prompt caching is the single most effective cost optimization technique for LLM API usage. This guide shows you exactly how to implement it for both OpenAI and Anthropic Claude APIs, with real code examples and benchmarks.
What is Prompt Caching?
Prompt caching allows LLM providers to reuse computations from previous API calls when prompts share common prefixes. Instead of reprocessing the same system instructions, documentation, or context on every request, the server caches the intermediate key-value (KV) states and reuses them.
This technique is especially valuable for:
- Multi-turn conversations — chat history stays cached
- Agent applications — system prompts and tool definitions are reused
- RAG systems — retrieved documents can be cached
- Code review tools — repository context remains stable
Cost Savings Benchmark
Here's what the major providers offer:
| Provider | Cost Reduction | Latency Reduction | Cache Hit Pricing |
|---|---|---|---|
| Anthropic Claude | Up to 90% | Up to 85% | ~10% of input token price |
| OpenAI | Up to 75% | Up to 80% | ~25% of input token price |
Real-world example: A customer service bot with a 5,000-token system prompt that handles 10,000 requests per day. Without caching: ~$50/day on input tokens. With 90% cache hit rate: ~$5/day. That's $1,350/month in savings.
How Prompt Caching Works
The caching mechanism is based on prefix matching:
- Your request includes a
cache_controlmarker at a specific position - The server computes a hash of everything before that marker
- If a matching cache entry exists, it reuses the KV states
- Only the new content after the cache breakpoint is processed
┌─────────────────────────────────────┐
│ System Prompt (cached) │
│ + Tool Definitions (cached) │
│ + Conversation History (cached) │
│ ───────── cache breakpoint ─────── │
│ + New User Message (processed) │
└─────────────────────────────────────┘
Anthropic Claude Implementation
Claude offers two caching modes: automatic and explicit breakpoints.
Automatic Caching (Recommended)
The simplest approach — add a single cache_control field at the top level:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
# Automatic caching - system handles breakpoint placement
cache_control={"type": "ephemeral"},
system="""You are a technical documentation assistant.
Your role is to explain complex topics clearly.
Always include code examples when relevant.
Format responses in markdown.""",
messages=[
{"role": "user", "content": "Explain prompt caching"}
]
)
# Check cache performance
print(f"Cache read: {response.usage.cache_read_input_tokens}")
print(f"Cache write: {response.usage.cache_write_input_tokens}")
Explicit Cache Breakpoints
For fine-grained control, place cache_control on specific content blocks:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a code review assistant...",
"cache_control": {"type": "ephemeral"} # Cache breakpoint here
}
],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Review this code:\n```python\n...",
"cache_control": {"type": "ephemeral"} # Another breakpoint
}
]
}
]
)
Claude Caching Thresholds
| Minimum Tokens | Cache Lifetime | Best For |
|---|---|---|
| 1,024 tokens | 5 minutes (ephemeral) | Multi-turn conversations |
| 1,024 tokens | 1 hour (extended) | Long-running agents |
OpenAI Implementation
OpenAI's prompt caching is automatic for supported models. No code changes required — just structure your prompts correctly.
Supported Models
- gpt-4o and gpt-4o-mini
- gpt-4o-realtime-preview
- gpt-4-turbo (limited support)
Best Practices for OpenAI Caching
from openai import OpenAI
client = OpenAI()
# Structure prompts with static content FIRST
SYSTEM_PROMPT = """You are a helpful assistant with the following rules:
1. Be concise
2. Use examples
3. Format in markdown
[Long documentation or context here...]"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # Static - will be cached
{"role": "user", "content": "What is prompt caching?"} # Dynamic
]
)
# Check usage for cache indicators
print(response.usage)
Key Requirements for Cache Hits
- Prefix must match exactly — same text, same order
- Minimum 1,024 tokens before the dynamic portion
- Same model — cache is model-specific
- Same tools/images — these are part of the prefix hash
Claude vs OpenAI Prompt Caching
Anthropic Claude
- ✅ 90% cost reduction (higher savings)
- ✅ Explicit control via cache_control
- ✅ Automatic mode available
- ✅ Extended cache lifetime (1 hour)
- ❌ Requires code changes
OpenAI
- ✅ 75% cost reduction
- ✅ Fully automatic (no code change)
- ✅ Simple prefix-matching
- ❌ Less control over caching
- ❌ Shorter cache lifetime (~5 min)
Practical Implementation Patterns
Pattern 1: Multi-Turn Conversation
# Store conversation history and cache it
conversation_history = []
def chat(user_message: str):
conversation_history.append({
"role": "user",
"content": user_message
})
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system=SYSTEM_PROMPT,
messages=conversation_history
)
conversation_history.append({
"role": "assistant",
"content": response.content[0].text
})
return response
Pattern 2: RAG with Document Caching
def rag_query(query: str, documents: list[str]):
# Documents are cached, query is dynamic
doc_text = "\n\n".join(documents)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": f"Reference documents:\n{doc_text}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": "Answer questions based only on the reference documents."
}
],
messages=[{"role": "user", "content": query}]
)
return response
Pattern 3: Agent with Tool Definitions
TOOLS = [
{
"name": "search",
"description": "Search for information",
"input_schema": {...} # Large schema
},
# More tools...
]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
cache_control={"type": "ephemeral"},
system="You are a helpful agent.",
tools=TOOLS, # Tool definitions are cached
messages=[{"role": "user", "content": "..."}]
)
Common Mistakes to Avoid
- Placing dynamic content early — breaks cache matching. Put user-specific data at the END of prompts.
- Changing prompt formatting — even whitespace changes break the cache. Use templates.
- Not checking cache hit rate — monitor
cache_read_input_tokensto verify caching is working. - Cache write costs — the first request pays a "cache write" premium (~25% extra input tokens). This pays off after 2+ requests.
Monitoring Cache Performance
def analyze_cache_performance(responses: list):
total_input = 0
total_cache_read = 0
total_cache_write = 0
for r in responses:
usage = r.usage
total_input += usage.input_tokens
total_cache_read += getattr(usage, 'cache_read_input_tokens', 0)
total_cache_write += getattr(usage, 'cache_write_input_tokens', 0)
hit_rate = total_cache_read / total_input if total_input > 0 else 0
print(f"Total input tokens: {total_input:,}")
print(f"Cache read tokens: {total_cache_read:,}")
print(f"Cache write tokens: {total_cache_write:,}")
print(f"Cache hit rate: {hit_rate:.1%}")
print(f"Estimated savings: ${calculate_savings(hit_rate):.2f}")
Conclusion
Prompt caching is a must-have optimization for any production LLM application. Key takeaways:
- Start with automatic caching — Claude's one-line
cache_controlis the easiest path - Structure prompts correctly — static content first, dynamic content last
- Monitor cache hit rates — aim for 80%+ in multi-turn scenarios
- Calculate ROI — cache write costs pay off after 2+ similar requests
Bottom Line
For Claude: Add cache_control={"type": "ephemeral"} to your requests. For OpenAI: Structure prompts with static prefixes. Both can reduce your API costs by 75-90% with minimal code changes.