Tutorial Cost Optimization

LLM Prompt Caching Best Practices 2026: Reduce API Costs by 90%

Last updated: May 7, 2026 · 12 min read

Prompt caching is the single most effective cost optimization technique for LLM API usage. This guide shows you exactly how to implement it for both OpenAI and Anthropic Claude APIs, with real code examples and benchmarks.

What is Prompt Caching?

Prompt caching allows LLM providers to reuse computations from previous API calls when prompts share common prefixes. Instead of reprocessing the same system instructions, documentation, or context on every request, the server caches the intermediate key-value (KV) states and reuses them.

This technique is especially valuable for:

  • Multi-turn conversations — chat history stays cached
  • Agent applications — system prompts and tool definitions are reused
  • RAG systems — retrieved documents can be cached
  • Code review tools — repository context remains stable

Cost Savings Benchmark

Here's what the major providers offer:

Provider Cost Reduction Latency Reduction Cache Hit Pricing
Anthropic Claude Up to 90% Up to 85% ~10% of input token price
OpenAI Up to 75% Up to 80% ~25% of input token price
Real-world example: A customer service bot with a 5,000-token system prompt that handles 10,000 requests per day. Without caching: ~$50/day on input tokens. With 90% cache hit rate: ~$5/day. That's $1,350/month in savings.

How Prompt Caching Works

The caching mechanism is based on prefix matching:

  1. Your request includes a cache_control marker at a specific position
  2. The server computes a hash of everything before that marker
  3. If a matching cache entry exists, it reuses the KV states
  4. Only the new content after the cache breakpoint is processed

┌─────────────────────────────────────┐

│ System Prompt (cached) │

│ + Tool Definitions (cached) │

│ + Conversation History (cached) │

│ ───────── cache breakpoint ─────── │

│ + New User Message (processed) │

└─────────────────────────────────────┘

Anthropic Claude Implementation

Claude offers two caching modes: automatic and explicit breakpoints.

Automatic Caching (Recommended)

The simplest approach — add a single cache_control field at the top level:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    # Automatic caching - system handles breakpoint placement
    cache_control={"type": "ephemeral"},
    system="""You are a technical documentation assistant.
    Your role is to explain complex topics clearly.
    Always include code examples when relevant.
    Format responses in markdown.""",
    messages=[
        {"role": "user", "content": "Explain prompt caching"}
    ]
)

# Check cache performance
print(f"Cache read: {response.usage.cache_read_input_tokens}")
print(f"Cache write: {response.usage.cache_write_input_tokens}")

Explicit Cache Breakpoints

For fine-grained control, place cache_control on specific content blocks:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a code review assistant...",
            "cache_control": {"type": "ephemeral"}  # Cache breakpoint here
        }
    ],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Review this code:\n```python\n...",
                    "cache_control": {"type": "ephemeral"}  # Another breakpoint
                }
            ]
        }
    ]
)

Claude Caching Thresholds

Minimum Tokens Cache Lifetime Best For
1,024 tokens 5 minutes (ephemeral) Multi-turn conversations
1,024 tokens 1 hour (extended) Long-running agents

OpenAI Implementation

OpenAI's prompt caching is automatic for supported models. No code changes required — just structure your prompts correctly.

Supported Models

  • gpt-4o and gpt-4o-mini
  • gpt-4o-realtime-preview
  • gpt-4-turbo (limited support)

Best Practices for OpenAI Caching

from openai import OpenAI

client = OpenAI()

# Structure prompts with static content FIRST
SYSTEM_PROMPT = """You are a helpful assistant with the following rules:
1. Be concise
2. Use examples
3. Format in markdown

[Long documentation or context here...]"""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},  # Static - will be cached
        {"role": "user", "content": "What is prompt caching?"}  # Dynamic
    ]
)

# Check usage for cache indicators
print(response.usage)

Key Requirements for Cache Hits

  1. Prefix must match exactly — same text, same order
  2. Minimum 1,024 tokens before the dynamic portion
  3. Same model — cache is model-specific
  4. Same tools/images — these are part of the prefix hash

Claude vs OpenAI Prompt Caching

Anthropic Claude

  • ✅ 90% cost reduction (higher savings)
  • ✅ Explicit control via cache_control
  • ✅ Automatic mode available
  • ✅ Extended cache lifetime (1 hour)
  • ❌ Requires code changes

OpenAI

  • ✅ 75% cost reduction
  • ✅ Fully automatic (no code change)
  • ✅ Simple prefix-matching
  • ❌ Less control over caching
  • ❌ Shorter cache lifetime (~5 min)

Practical Implementation Patterns

Pattern 1: Multi-Turn Conversation

# Store conversation history and cache it
conversation_history = []

def chat(user_message: str):
    conversation_history.append({
        "role": "user", 
        "content": user_message
    })
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        cache_control={"type": "ephemeral"},
        system=SYSTEM_PROMPT,
        messages=conversation_history
    )
    
    conversation_history.append({
        "role": "assistant",
        "content": response.content[0].text
    })
    
    return response

Pattern 2: RAG with Document Caching

def rag_query(query: str, documents: list[str]):
    # Documents are cached, query is dynamic
    doc_text = "\n\n".join(documents)
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": f"Reference documents:\n{doc_text}",
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": "Answer questions based only on the reference documents."
            }
        ],
        messages=[{"role": "user", "content": query}]
    )
    return response

Pattern 3: Agent with Tool Definitions

TOOLS = [
    {
        "name": "search",
        "description": "Search for information",
        "input_schema": {...}  # Large schema
    },
    # More tools...
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system="You are a helpful agent.",
    tools=TOOLS,  # Tool definitions are cached
    messages=[{"role": "user", "content": "..."}]
)

Common Mistakes to Avoid

  • Placing dynamic content early — breaks cache matching. Put user-specific data at the END of prompts.
  • Changing prompt formatting — even whitespace changes break the cache. Use templates.
  • Not checking cache hit rate — monitor cache_read_input_tokens to verify caching is working.
  • Cache write costs — the first request pays a "cache write" premium (~25% extra input tokens). This pays off after 2+ requests.

Monitoring Cache Performance

def analyze_cache_performance(responses: list):
    total_input = 0
    total_cache_read = 0
    total_cache_write = 0
    
    for r in responses:
        usage = r.usage
        total_input += usage.input_tokens
        total_cache_read += getattr(usage, 'cache_read_input_tokens', 0)
        total_cache_write += getattr(usage, 'cache_write_input_tokens', 0)
    
    hit_rate = total_cache_read / total_input if total_input > 0 else 0
    
    print(f"Total input tokens: {total_input:,}")
    print(f"Cache read tokens: {total_cache_read:,}")
    print(f"Cache write tokens: {total_cache_write:,}")
    print(f"Cache hit rate: {hit_rate:.1%}")
    print(f"Estimated savings: ${calculate_savings(hit_rate):.2f}")

Conclusion

Prompt caching is a must-have optimization for any production LLM application. Key takeaways:

  1. Start with automatic caching — Claude's one-line cache_control is the easiest path
  2. Structure prompts correctly — static content first, dynamic content last
  3. Monitor cache hit rates — aim for 80%+ in multi-turn scenarios
  4. Calculate ROI — cache write costs pay off after 2+ similar requests

Bottom Line

For Claude: Add cache_control={"type": "ephemeral"} to your requests. For OpenAI: Structure prompts with static prefixes. Both can reduce your API costs by 75-90% with minimal code changes.

Further Reading

Related Articles