Cache-aware agent architecture, or why your loop is paying for the same context fifteen times

The pricing pages of every frontier vendor now distinguish between cached and uncached input tokens at roughly a ten-to-one ratio. Most of the agent code we audit ignores this. The agents work. They just cost five to ten times what they should and run two to four times slower than they should, because every loop iteration re-reads the same system prompt, the same tool definitions, and the same long context as if it were the first turn.

Cache-aware architecture is one of the highest-leverage changes available to a team running agents in production right now. Most teams have not made it because the cache is invisible until you instrument it.

Why this is a 2026 problem and not a 2025 problem

Two things shifted. Cache TTLs got long enough to matter for real workflows — the practical window is now in the five-minute to one-hour range across the major vendors, depending on the tier. And agentic workflows that genuinely need that window — long-horizon agents, multi-turn tool-using loops, background workers — moved from demo to production.

Most of the production agent code in circulation was written before either of those was true. The cost line on the invoice is the visible symptom. The latency tail is the operational symptom. Neither shows up until you look.

What cache-aware actually means

Three concrete design rules.

First, the static prefix is sacred. The system prompt, the tool definitions, and the durable instructions should be byte-stable across every call in a session. We have watched teams break their own cache by injecting a timestamp into the system prompt "for debugging." That timestamp invalidates the cache on every call and quietly doubles the bill.

Second, the conversation grows by append, not by edit. If the agent's history is rewritten mid-session — summarized, reordered, prefixed with new instructions — the cache for everything after the edit point is gone. Compaction has to happen at planned boundaries, not opportunistically.

Third, the agent loop should be designed around the TTL, not against it. If the cache window is five minutes, a sleep-and-poll pattern that wakes up every six minutes pays the cache miss every iteration. A sleep-and-poll pattern that wakes up every four minutes does not. The difference is one line of configuration and roughly half the bill.

Two agent loops compared. The naive loop re-sends the full context on every one of fifteen turns, paying for the same tokens fifteen times. The cache-aware loop pays for the stable context once and then bills only the small per-turn delta, collapsing cost and latency.

The eval angle

The eval harness has to know about the cache. A green eval that runs in a fresh, uncached state proves the agent works correctly. It does not prove the agent works correctly in production, where the cache state changes the model's behavior at the margins — particularly around long contexts where cache reads and fresh reads can produce subtly different sampling.

We now run the eval twice on every change: once cold, once warm. A regression in the warm run that is invisible in the cold run is a real production regression. Most teams will not catch it because their evals are stateless by default.

What this looks like in the runbook

When the on-call gets paged for cost or latency spikes on an agent, the first question is no longer "did the model change?" — it is "did the cache hit rate change?" The runbook now includes a dashboard for cache hit rate per agent route, alerts for sustained cache miss windows, and a rollback path for prompt edits that broke the static prefix.

The frontier model is still the model. The cache is the system around it that decides whether the agent is operable at the scale you actually plan to run it.