TANAY.SHAH
← FIELD REPORT/BLOG/ANTHROPIC-PROMPT-CACHE-PRODUCTION-PATTERNS
// PUBLISHED 2026-05-10· 7 MIN READ

What I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production

Prompt caching is sold as a 90% cost cut. In production agent loops it can quietly become a 30% cost increase, depending on five things the docs do not put in big letters. Here are the patterns that made the math actually work for me, including the on-demand tool loading trick that keeps the cache alive.

When you read the Anthropic prompt-caching docs you come away with a number: 90% cost reduction. When you ship caching into a production agent loop and look at the bill 30 days later, the number is often 0%, sometimes negative. The docs are not lying. The math is fragile in five specific ways that aren't obvious until you've already paid the lesson. This is the post I would have wanted six months ago.

The math, with numbers

Cache writes cost 1.25x the base input rate (5-minute TTL) or 2x (1-hour TTL). Cache reads cost 0.1x the base input rate. The cache only pays for itself if a given prefix gets read more than ~1.25 times within its TTL on the 5-minute tier (roughly: every cache write needs ~3 reads to break even versus paying full input price). For 1-hour TTL the breakeven is closer to 11 reads. If your call pattern doesn't hit those numbers, prompt caching makes you slightly worse off, not slightly better off.

The first thing I tell anyone shipping caching: instrument the hit rate before shipping the feature, not after. Anthropic returns usage.cache_read_input_tokens and usage.cache_creation_input_tokens on every response. Log them, divide. If the read-to-write ratio is below 1.25 on the 5-minute tier or 11 on the 1-hour tier, your cache is a net loss.

The prefix hierarchy is unforgiving

Cache prefixes layer top-down: tools → system → messages. A change at one level invalidates everything after it. If you reorder a tool, everything downstream of that tool in the prefix is a cache miss. If you add a single character to the system prompt, the system prompt and every message after it is a cache miss. The cache is byte-for-byte exact match, not semantic match.

┌────────────────────────────────────────────────┐
│  TOOLS    [stable, rarely change]              │  ← cache_control here
├────────────────────────────────────────────────┤
│  SYSTEM   [stable, semi-rare changes]          │  ← or here
├────────────────────────────────────────────────┤
│  MESSAGES                                      │
│   ├── prior turn 1 (cached)                    │  ← or here
│   ├── prior turn 2 (cached)                    │
│   └── current turn (uncached, current input)   │
└────────────────────────────────────────────────┘
       ▲
       └── any change at level N invalidates N
           and every level below it

Five gotchas that silently kill ROI

  • Timestamps in the system prompt. Common bug: "You are an assistant. The current time is 2026-05-10T14:32:11Z" at the top of the system prompt. Every request changes the prefix. Every request is a cache miss. The fix: move time context into the user turn (which is already non-cached) or quantize the timestamp to the cache TTL granularity (so it changes once every 5 minutes, not every call).
  • Adding a tool mid-session. The naive way to add a new tool when the agent discovers it needs one is to extend the tool list. This invalidates the entire prefix. The non-naive way is to keep the static tool catalog the same and use the deferred-tool / tool_reference pattern (or MCP tool search), which appends the new tool definition into the message stream rather than the cached prefix. The cache survives.
  • Switching models mid-conversation. Different models have different tokenizers. A cache built for Sonnet 4.6 is not portable to Opus 4.7. If your agent fails over between models on rate-limit hits, every fallback is a fresh cache write at the new model's tier. For workloads with frequent fallback, pin the cache key to a model and treat the fallback model's calls as separate (cold) traffic.
  • Schema churn during iteration. While you're tuning the JSON output schema or the tool definitions, every edit invalidates every cache write you made before the edit. The cost of "just one more iteration" is real. Plan stabilization sweeps where you batch schema changes together and avoid touching the cached layers between them.
  • Workspace boundaries. Since February 2026, prompt caching uses workspace-level isolation, not org-level. If you have separate dev and prod workspaces sharing the same model and prefix, they don't share cache. This bites teams who debug in dev and assume the prod cache is warm: it isn't.

The on-demand tool loading pattern

The hardest tension to resolve: agent loops want a long tool catalog (more tools = more capability) and prompt caching wants a stable tool catalog (changes invalidate the prefix). The resolution I shipped is on-demand tool loading layered on top of a small core catalog.

The cached prefix has six tools in it: read, write, exec, list, search, status. Those six are stable across every conversation. When the agent discovers it needs a more specific tool (say, a domain-specific schema validator or a vendor-specific API client), it calls a find_tool tool from the core catalog, which returns a tool_reference block that gets appended to the message stream. The reference block is structured so the model can call the new tool without it being in the cached prefix. Cache stays warm; capability stays expansive.

The cost trade-off: each tool reference adds ~150-300 tokens to the message stream the first time it's used. For an agent that uses 4-6 specialized tools per conversation, the marginal cost is small. The marginal saving (cache hit on the 6-tool prefix instead of cache miss on a 60-tool catalog) is large. The math comes out positive almost regardless of conversation length.

TTL choice: 5 minutes or 1 hour, and why it matters in 2026

Anthropic silently shifted the default cache TTL from 60 minutes to 5 minutes in March 2026. For most chat workloads the change is invisible. For agent workloads with multi-step tool calls and human-in-the-loop pauses, the change is a 30-60% cost increase that nobody warned you about.

The right default in 2026 is: explicit TTL on every cache_control block. Use 5 minutes for active conversations where the user is typing and the cache stays warm naturally. Use 1 hour for: agent runs that pause for review, eval harnesses where the same prefix is hit repeatedly across 100+ tasks in a wave, batch jobs that fan out the same system prompt across many customer records. The 1-hour write premium (2x) pays for itself by the 11th read, which is a cheap threshold to clear in any batch context.

The constraint to watch: if you mix TTLs in a single request, the longer-TTL block must appear before the shorter-TTL block. So in practice the hierarchy is: tools (1h cache_control), system (1h cache_control), messages (5m cache_control on the most recent turn or two). That layout was not obvious to me from the docs.

What I'd change if I rebuilt the caching layer from scratch

  • Treat hit rate as a product metric, not a vanity metric. Display it in the same dashboard as latency and error rate. Anything below 70% on the cache-read tier deserves an investigation, the same way a 70% test pass rate would.
  • Pin every cache breakpoint to a specific TTL explicitly. Auto mode is a smoke test. In production, you want every cache_control block to declare its intended residency.
  • Version-pin the tool catalog and the system prompt template. Treat any change as a deploy event with explicit rollback-on-regression criteria. Cache is sensitive enough that a one-character change in your prompt template can move 10K USD on the next bill.
  • Audit timestamps and dynamic context in the entire prompt tree on a recurring basis. The natural drift in any system is to add things to the system prompt for context. Each addition is a potential prefix-killer. Schedule a monthly review.

The bigger lesson

Prompt caching looks like a knob you turn on. It is closer to a discipline you adopt. The discipline is: make the prefix stable on purpose, instrument the hit rate, choose the TTL deliberately, and audit the prompt tree for drift. Teams that adopt the discipline see the 60-90% cost reduction the docs describe. Teams that just turn the feature on often pay slightly more than they did before, learn nothing about why, and quietly turn it off.

If you're a hiring engineer evaluating an AI infrastructure candidate, this is one of the questions I think actually separates the people who have shipped production agent loops from the people who have shipped demos: ask them how they monitor prompt-cache hit rate and what they do when it drops. The answers reveal more than any other single LLM-cost question I know.

References

  • Anthropic — "Prompt caching" official docs (platform.claude.com/docs)
  • Anthropic — "Tool use with prompt caching" (platform.claude.com/docs/agents-and-tools)
  • GitHub — Claude Code issue tracking the March 2026 TTL regression (anthropics/claude-code#46829)
  • AICheckerHub — "Anthropic Prompt Caching in 2026: Cost, TTL, and Latency Planning"
  • mager.co — "Claude: How prompt caching actually works" (April 2026)
  • DEV Community — production case studies on cache hit-rate optimization