Prompt Caching: How It Cuts LLM Cost & Latency
Every time your application sends a long system prompt, a dense set of few-shot examples, or a multi-page document to an LLM, the model processes those tokens from scratch — unless your provider has already seen them. Prompt caching is the mechanism that lets providers skip that re-processing: the processed prefix of your prompt is stored inside the inference cluster, so repeated context arrives cheaper and faster. Getting prompt caching right is one of the highest-leverage cost moves available to developers building on LLMs today.
This article covers what prompt caching is, how it differs from response caching, how to structure your prompts to maximize cache hits, how each major provider implements it, and how to track cached token savings in your cost accounting.
What is prompt caching?
When a model processes your prompt, it converts each token into a set of internal representations (key-value pairs in the attention mechanism). Normally those representations are computed fresh on every call. Prompt caching stores those KV representations server-side so that when you send the same prefix again, the provider can skip the computation and inject the pre-built representations directly into the inference pipeline.
The result has two effects you can measure:
- Lower cost on cached tokens. Providers charge a discounted rate for cached input tokens because they skipped the compute. The exact discount varies — verify current rates on each provider's pricing page.
- Lower time to first token (TTFT). Prefilling fewer tokens takes less time, so the model starts generating sooner. On requests with large, stable prefixes this reduction can be dramatic — hundreds of milliseconds to a second or more on long contexts.
Prompt caching is a provider-side optimization. It operates inside the model's inference stack, not at your application or gateway layer. That's the key distinction from response caching.
Prompt caching vs. response caching: what's the difference?
These two mechanisms are often confused but solve different problems.
| Property | Prompt caching | Response caching |
|---|---|---|
| Where it runs | Inside the provider's inference cluster | At the gateway / proxy layer |
| What is reused | Processed prefix (KV cache) of the prompt | The complete response from a prior call |
| Works with variable requests? | Yes — only the stable prefix is cached; the dynamic suffix still runs | No — the full request must be identical to hit the cache |
| Still calls the model? | Yes — the model still generates a fresh response | No — the stored response is returned directly |
| Safe for non-deterministic outputs? | Yes | Only at temperature = 0 or for idempotent tasks |
Prompt caching is safe to use broadly — you still get a fresh generation every time, just with cheaper and faster prefix processing. Response caching is more powerful (zero token cost, near-zero latency) but requires that the same answer is acceptable every time. For a deeper look at response caching, see the LLM response caching guide. The two optimizations are complementary: use prompt caching to reduce prefix cost on live calls, and layer response caching on top for fully static request/response pairs.
How to structure prompts for maximum cache hits
The core rule is simple: stable content first, dynamic content last.
Providers cache a prefix of the prompt — a contiguous block starting from the beginning of the context window. If your system prompt is 2,000 tokens and appears at the top of every call, those 2,000 tokens can be cached and reused across every request. If you insert a dynamic user ID or timestamp anywhere inside that prefix, the hash changes and the cache misses.
A practical ordering
- System prompt — instructions, persona, constraints, output format rules. Completely static. Put it first.
- Few-shot examples — labeled input/output pairs. Static. Put them immediately after the system prompt.
- Retrieval context / documents — if you inject retrieved documents that don't change per user, they belong in the stable prefix. If they're query-specific, they can still go in a "semi-stable" block — the cached prefix extends as far as the stable portion runs.
- Conversation history — earlier turns are stable relative to the current turn; put them before the current user message.
- Current user message — the dynamic tail. Always last.
An anti-pattern to avoid: injecting session-specific metadata (user name, account ID, timestamp) into the system prompt. This blows the cache for every user. Move that information to a separate user message at the end of the context or into a small suffix after the static system prompt.
Minimum cacheable length
Providers impose a minimum prefix length before caching activates — typically in the range of 1,024 tokens, though the exact threshold varies. Short system prompts may not qualify. Padding prompts artificially to hit the minimum is counterproductive; instead, check whether your use case naturally produces prefixes long enough to benefit.
How each major provider implements prompt caching
The mechanism and economics differ meaningfully across providers. Always verify current rates and behavior in each provider's documentation — numbers shift as models evolve.
Anthropic (Claude)
Anthropic uses explicit cache markers. You tag specific message blocks in your request with "cache_control": {"type": "ephemeral"} to tell Claude where the cacheable prefix ends. This gives you precise control. The cache TTL is around five minutes, extendable by repeated use. Cached tokens are billed at a meaningfully lower rate than standard input tokens; cache write operations carry a small surcharge.
{
"system": [
{
"type": "text",
"text": "You are a helpful assistant with deep expertise in...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Summarize this document: ..."}
]
}
OpenAI (GPT-4o, o-series)
OpenAI applies prompt caching automatically — no special markers needed. The API silently reuses cached prefixes when it detects a repeated context. You can observe cache hits in the response's usage object under prompt_tokens_details.cached_tokens. TTL is typically around one hour. The discount on cached tokens versus standard input tokens varies by model; check the pricing page for current figures.
Google (Gemini)
Gemini offers explicit context caching via a separate API resource. You create a cache object with a TTL, attach content to it, and reference it in subsequent requests. This is closer to a named, persistent resource than an automatic prefix cache. Minimum content sizes and TTL ranges differ from OpenAI and Anthropic; consult the Gemini documentation for current limits.
Other providers
Groq, Deepinfra, Fireworks, and similar inference providers vary — some pass through provider-level caching, some implement their own, some don't offer it yet. Check provider documentation and your usage response fields. The AI tokenomics guide covers how these pricing tiers stack against output token costs across the major providers.
Tracking cached tokens in your cost accounting
Prompt caching only helps your bill if you can see it working. Most providers surface cache information in the usage field of the API response:
- Anthropic:
usage.cache_read_input_tokensandusage.cache_creation_input_tokens - OpenAI:
usage.prompt_tokens_details.cached_tokens
A naive cost calculation that multiplies prompt_tokens * input_price will over-count your spend when caching is active. Accurate per-call cost accounting needs to apply the cached rate to cache_read_input_tokens, the cache-write surcharge rate to cache_creation_input_tokens, and the standard input rate only to the remaining uncached input tokens.
This matters more than it sounds. At scale, a miscounting layer will make it look like your prompt caching optimizations had no effect — because the savings are hidden in a bucket that your cost model charges at the wrong rate. Getting this right requires your gateway or observability layer to be aware of all three token buckets, not just total prompt tokens and completion tokens.
This is one reason developer teams using multiple providers benefit from a unified gateway that normalizes usage fields. A gateway that understands cached input tokens specifically — and accounts for them at the correct rate in per-call cost records — gives you an accurate picture of where your spend actually goes across providers, models, and workloads.
The realistic savings from prompt caching
The magnitude of savings depends on two factors: how long your stable prefix is relative to the total prompt, and how frequently that prefix repeats. A few scenarios that illustrate the range:
- Short system prompt, mostly dynamic input: Minimal benefit. If your stable prefix is 200 tokens and the user message is 1,000 tokens, even a perfect cache hit saves only 20% of input tokens — and only on the prefix, which is the smaller part.
- Large document QA: High benefit. If you inject a 50,000-token document as context and ask multiple questions against it, the document forms a stable prefix across all calls. Cache hits on that prefix dramatically reduce cost on every follow-up query.
- Agent with large tool manifests: High benefit. Tool definitions that run thousands of tokens can be cached so that the per-call overhead of maintaining the agent's capabilities is reduced on every turn.
- Few-shot classifiers at high volume: High benefit. Identical system prompt and examples cached across thousands of classification calls.
In high-hit-rate scenarios, prompt caching savings on input tokens can reduce your total API bill by 20–50% depending on the ratio of stable prefix to dynamic content and provider pricing — verify with your actual provider pricing and call mix.
Beyond cost, the latency improvement on long contexts is often the more immediately felt benefit. Shaving hundreds of milliseconds from TTFT on every agentic step or document-grounded response adds up quickly in user-perceived performance.
Getting per-call visibility without building the accounting yourself
Measuring prompt caching savings accurately — across providers that each use different field names, different rate tiers, and different caching semantics — is underdiscussed complexity. If you're routing across Anthropic, OpenAI, and Gemini, you need a normalization layer that knows how each provider reports cached tokens and applies the right price to each bucket on each call.
flo2 tracks cached input tokens natively in its per-call cost accounting, applying provider-correct rates to each token bucket automatically. Combined with zero markup on pass-through costs and bring-your-own provider keys, you get accurate cost data without building the accounting layer yourself — during beta, at no cost.
Related reading: LLM response caching — when to go further and skip the model call entirely. AI tokenomics — how cached input pricing fits into the full unit economics of an LLM-powered product.