LLM Response Caching: Cut Cost & Latency on Repeat Calls
Every time you send the same prompt to an LLM and wait for the same answer, you're paying twice — in money and in latency — for a result you already have. LLM response caching breaks that pattern: instead of forwarding an identical request to the model, your gateway returns a stored response immediately. The call takes microseconds and costs nothing. For any workload with repeated or predictable requests, response caching is one of the most direct levers you have on AI tokenomics.
This article explains how response caching works at a technical level, when it's safe to use, how it differs from prompt caching, and how to enable it in practice.
What is LLM response caching?
Response caching operates at the gateway or proxy layer, not inside the model itself. When a request arrives, the gateway computes a cache key from the request parameters. If a stored response exists for that key and is still within its time-to-live (TTL), the stored response is returned immediately — no API call, no token spend, near-zero latency. On a cache miss, the gateway forwards the request to the model, receives the completion, stores it, and returns it to the caller.
The practical result: repeat requests are served from cache at a fraction of the cost and in a fraction of the time.
When is it safe to cache LLM responses?
Response caching is appropriate when the output of a given request is expected to be the same on every invocation — or close enough that variation doesn't matter. The conditions that make caching safe:
- Low or zero temperature. At
temperature=0, most models are deterministic: the same prompt produces the same completion. As temperature rises, outputs become stochastic, and caching an early response means callers never see that variation. If variability is a feature — creative writing, brainstorming — don't cache. - Idempotent use cases. Classification, entity extraction, FAQ answering, code explanation, document summarization — these are all tasks where the "correct" answer to a fixed input doesn't change from call to call. They're natural candidates.
- Non-personalized prompts. If the prompt contains the user's name, account state, or anything specific to an individual session, cached responses from one user will be wrong for another. Keep personalized context out of cached prompts, or key the cache on the full request including that context and accept lower hit rates.
- Stable knowledge horizon. Don't cache responses that need to reflect real-time information — current prices, live data, today's news — unless the TTL is short enough that staleness is acceptable.
Exact-match caching vs. semantic caching
There are two fundamentally different ways to decide whether an incoming request matches a cached entry.
Exact-match caching
The cache key is a deterministic hash of the request: model name, the full messages array, and all generation parameters (temperature, max_tokens, top_p, etc.). Two requests hit the same cache entry only if they are byte-for-byte identical in every field that affects the response. This is predictable, safe, and fast — key lookup is O(1), there are no false positives, and the only failure mode is a miss on a request that's semantically equivalent but textually different.
Exact-match is the right default. Its weakness is that it doesn't catch near-duplicates: "What is the capital of France?" and "What's the capital of France?" are different cache keys even though the answer is the same.
Semantic caching
Semantic caching embeds incoming prompts into a vector space and retrieves cached responses for prompts that are close enough by cosine similarity or a similar metric. In theory, this catches rephrasings and minor variations. In practice it introduces a serious risk: false positives. Two prompts can be close in embedding space while requiring meaningfully different answers. "Summarize this contract with a focus on termination clauses" and "Summarize this contract with a focus on payment terms" might be similar enough to hit the same cache entry — and returning the wrong summary is worse than returning no summary at all.
Semantic caching also adds latency (embedding + vector search) and cost on every request, including misses. Use it only if you have measured that the hit-rate gain justifies the added complexity and you've validated your similarity threshold carefully against real traffic.
Cache keys, TTL, and invalidation
A well-designed cache key captures everything that determines the response. At minimum:
cache_key = hash(
model, // "gpt-4o", "claude-sonnet-4-5", etc.
messages[], // full conversation array, including system prompt
temperature,
max_tokens,
top_p,
// any other sampling params that affect output
)
Omitting any parameter that affects generation creates the risk of returning a cached response generated under different conditions. The model name is especially important: a response from gpt-4o should never be returned for a request that specified claude-opus-4-5.
TTL (time-to-live) controls how long a cached response is valid. Setting a TTL is an explicit acknowledgment that the cached answer is a good-enough approximation until the clock expires. Common choices:
- Short TTL (seconds to minutes): live-ish data, high-velocity prompts where model updates matter
- Medium TTL (hours): FAQ bots, documentation assistants, prompts with stable factual answers
- Long TTL (days or more): batch pipelines, offline classification, anything where the prompt and expected output are effectively frozen
Invalidation is the hard problem. Cache entries go stale when the underlying prompt changes, the model is updated, or the knowledge the response depends on changes. For most gateway-level caches, TTL-based expiration is the primary invalidation mechanism. If you need finer control — invalidate all entries for a given system prompt when you ship a new version — you'll need a cache key design that includes a version or namespace component you can purge deliberately.
The savings: what cache hits actually buy you
On a cache hit, the economics are straightforward:
- Token cost: zero. No tokens are consumed at the provider. You pay for the original fill; every subsequent hit is free.
- Latency: near-zero. A cache lookup from an in-memory or fast-path store is measured in single-digit milliseconds, compared to hundreds of milliseconds or more for a real model call. For latency-sensitive applications this can be the single biggest win — see reduce LLM latency for the full picture.
The savings compound with hit rate. At 50% hit rate on a workload that would otherwise cost $1,000/month, you're spending $500. At 80%, $200. For applications where the same or similar prompts recur frequently — customer support, FAQ systems, structured extraction over similar documents — hit rates of 70–90%+ are achievable with exact-match caching alone.
Response caching vs. prompt caching: an important distinction
These two terms are easy to confuse, and conflating them leads to incorrect expectations.
| Feature | Response caching | Prompt caching |
|---|---|---|
| Where it operates | Gateway / proxy layer | Inside the model provider |
| What is cached | The complete response to an identical request | The KV state of a repeated prompt prefix |
| Token spend on hit | Zero tokens consumed | Reduced input tokens (prefix is discounted) |
| Output generated on hit | No — stored response is returned as-is | Yes — the model still generates a new completion |
| Latency on hit | Near-zero (cache lookup only) | Reduced TTFT (prefill skipped for cached prefix) |
| Works with varied outputs | No — same stored output every time | Yes — each completion is unique |
Prompt caching (offered natively by Anthropic, OpenAI, and others) caches the KV representation of a static prompt prefix — your system prompt, few-shot examples, a large reference document. The model still generates a fresh completion on every call; you just skip the expensive prefill step for the shared portion. This is powerful for workloads with large, stable inputs but dynamic outputs.
Response caching is complementary: it returns the stored completion without generating anything new. The two can be active simultaneously: prompt caching reduces prefill cost on misses; response caching eliminates generation cost entirely on hits.
Opt-in response caching with flo2
flo2 is an LLM gateway that lets you bring your own provider keys and routes calls across OpenAI, Anthropic, and other providers through a single compatible endpoint. Response caching is opt-in and controlled per-request via the cache_ttl parameter — no global setting that silently caches things you didn't intend to cache.
Enable it by adding cache_ttl to your request body:
POST https://api.flo2.com/v1/chat/completions
Authorization: Bearer <your-flo2-key>
Content-Type: application/json
{
"model": "gpt-4o",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is the difference between TCP and UDP?" }
],
"temperature": 0,
"cache_ttl": 3600
}
With cache_ttl: 3600, flo2 caches the response for one hour. The first request generates normally; subsequent identical requests within that window are returned from cache instantly, with no token spend. Because flo2 charges zero token markup on calls it does route to providers, and zero cost on cache hits, the per-request cost accounting is transparent: you see exactly what you saved.
The opt-in design matters: you decide which calls to cache and for how long, rather than trusting a system-wide policy to get it right for every route in your application.
Putting it together
Response caching is one of the simplest high-leverage optimizations available for LLM-backed applications. The approach is:
- Identify routes in your application that are low-temperature, idempotent, and non-personalized
- Set a
cache_ttlappropriate to how quickly you want the data to refresh - Monitor hit rate and cost; adjust TTL or cache key scope as your traffic shapes up
- Reserve semantic caching for after you've exhausted exact-match, and only if you can validate false-positive rates on real data
Combined with latency optimizations like streaming, model routing, and racing, response caching makes the difference between an LLM backend that burns through budget on repeat work and one that scales economically. Start caching on flo2 — free during beta, no token markup, your keys.