LLM Prompt Caching Best Practices 2026: Reduce API Costs by 90%

Prompt caching is the single most effective cost optimization technique for LLM API usage. This guide shows you exactly how to implement it for both OpenAI and Anthropic Claude APIs, with real code examples and benchmarks.

What is Prompt Caching?

Prompt caching allows LLM providers to reuse computations from previous API calls when prompts share common prefixes. Instead of reprocessing the same system instructions, documentation, or context on every request, the server caches the intermediate key-value (KV) states and reuses them.

This technique is especially valuable for:

Multi-turn conversations — chat history stays cached
Agent applications — system prompts and tool definitions are reused
RAG systems — retrieved documents can be cached
Code review tools — repository context remains stable

Cost Savings Benchmark

Here's what the major providers offer:

Provider	Cost Reduction	Latency Reduction	Cache Hit Pricing
Anthropic Claude	Up to 90%	Up to 85%	~10% of input token price
OpenAI	Up to 75%	Up to 80%	~25% of input token price

Real-world example: A customer service bot with a 5,000-token system prompt that handles 10,000 requests per day. Without caching: ~$50/day on input tokens. With 90% cache hit rate: ~$5/day. That's $1,350/month in savings.

How Prompt Caching Works

The caching mechanism is based on prefix matching:

Your request includes a cache_control marker at a specific position
The server computes a hash of everything before that marker
If a matching cache entry exists, it reuses the KV states
Only the new content after the cache breakpoint is processed

┌─────────────────────────────────────┐

│ System Prompt (cached) │

│ + Tool Definitions (cached) │

│ + Conversation History (cached) │

│ ───────── cache breakpoint ─────── │

│ + New User Message (processed) │

└─────────────────────────────────────┘

Anthropic Claude Implementation

Claude offers two caching modes: automatic and explicit breakpoints.

Automatic Caching (Recommended)

The simplest approach — add a single cache_control field at the top level:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    # Automatic caching - system handles breakpoint placement
    cache_control={"type": "ephemeral"},
    system="""You are a technical documentation assistant.
    Your role is to explain complex topics clearly.
    Always include code examples when relevant.
    Format responses in markdown.""",
    messages=[
        {"role": "user", "content": "Explain prompt caching"}
    ]
)

# Check cache performance
print(f"Cache read: {response.usage.cache_read_input_tokens}")
print(f"Cache write: {response.usage.cache_write_input_tokens}")

Explicit Cache Breakpoints

For fine-grained control, place cache_control on specific content blocks:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a code review assistant...",
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint here
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Review this code:\n```python\n...",
                    "cache_control": {"type": "ephemeral"}  # Another breakpoint
                }
            ]
        }
    ]
)

Claude Caching Thresholds

Minimum Tokens	Cache Lifetime	Best For
1,024 tokens	5 minutes (ephemeral)	Multi-turn conversations
1,024 tokens	1 hour (extended)	Long-running agents

OpenAI Implementation

OpenAI's prompt caching is automatic for supported models. No code changes required — just structure your prompts correctly.

Supported Models

gpt-4o and gpt-4o-mini
gpt-4o-realtime-preview
gpt-4-turbo (limited support)

Best Practices for OpenAI Caching

from openai import OpenAI

client = OpenAI()

# Structure prompts with static content FIRST
SYSTEM_PROMPT = """You are a helpful assistant with the following rules:
1. Be concise
2. Use examples
3. Format in markdown

[Long documentation or context here...]"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},  # Static - will be cached
        {"role": "user", "content": "What is prompt caching?"}  # Dynamic
    ]
)

# Check usage for cache indicators
print(response.usage)

Key Requirements for Cache Hits

Prefix must match exactly — same text, same order
Minimum 1,024 tokens before the dynamic portion
Same model — cache is model-specific
Same tools/images — these are part of the prefix hash

Claude vs OpenAI Prompt Caching

Anthropic Claude

✅ 90% cost reduction (higher savings)
✅ Explicit control via cache_control
✅ Automatic mode available
✅ Extended cache lifetime (1 hour)
❌ Requires code changes

OpenAI

✅ 75% cost reduction
✅ Fully automatic (no code change)
✅ Simple prefix-matching
❌ Less control over caching
❌ Shorter cache lifetime (~5 min)

Practical Implementation Patterns

Pattern 1: Multi-Turn Conversation

# Store conversation history and cache it
conversation_history = []

def chat(user_message: str):
    conversation_history.append({
        "role": "user", 
        "content": user_message
    })
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        cache_control={"type": "ephemeral"},
        system=SYSTEM_PROMPT,
        messages=conversation_history
    )
    
    conversation_history.append({
        "role": "assistant",
        "content": response.content[0].text
    })
    
    return response

Pattern 2: RAG with Document Caching

def rag_query(query: str, documents: list[str]):
    # Documents are cached, query is dynamic
    doc_text = "\n\n".join(documents)
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": f"Reference documents:\n{doc_text}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": "Answer questions based only on the reference documents."
            }
        ],
        messages=[{"role": "user", "content": query}]
    )
    return response

Pattern 3: Agent with Tool Definitions

TOOLS = [
    {
        "name": "search",
        "description": "Search for information",
        "input_schema": {...}  # Large schema
    },
    # More tools...
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are a helpful agent.",
    tools=TOOLS,  # Tool definitions are cached
    messages=[{"role": "user", "content": "..."}]
)

Common Mistakes to Avoid

Placing dynamic content early — breaks cache matching. Put user-specific data at the END of prompts.
Changing prompt formatting — even whitespace changes break the cache. Use templates.
Not checking cache hit rate — monitor cache_read_input_tokens to verify caching is working.
Cache write costs — the first request pays a "cache write" premium (~25% extra input tokens). This pays off after 2+ requests.

Monitoring Cache Performance

def analyze_cache_performance(responses: list):
    total_input = 0
    total_cache_read = 0
    total_cache_write = 0
    
    for r in responses:
        usage = r.usage
        total_input += usage.input_tokens
        total_cache_read += getattr(usage, 'cache_read_input_tokens', 0)
        total_cache_write += getattr(usage, 'cache_write_input_tokens', 0)
    
    hit_rate = total_cache_read / total_input if total_input > 0 else 0
    
    print(f"Total input tokens: {total_input:,}")
    print(f"Cache read tokens: {total_cache_read:,}")
    print(f"Cache write tokens: {total_cache_write:,}")
    print(f"Cache hit rate: {hit_rate:.1%}")
    print(f"Estimated savings: ${calculate_savings(hit_rate):.2f}")

Conclusion

Prompt caching is a must-have optimization for any production LLM application. Key takeaways:

Start with automatic caching — Claude's one-line cache_control is the easiest path
Structure prompts correctly — static content first, dynamic content last
Monitor cache hit rates — aim for 80%+ in multi-turn scenarios
Calculate ROI — cache write costs pay off after 2+ similar requests

Bottom Line

For Claude: Add cache_control={"type": "ephemeral"} to your requests. For OpenAI: Structure prompts with static prefixes. Both can reduce your API costs by 75-90% with minimal code changes.

DevTools