2026-06-03 · flo2 blog

LLM Context Windows Explained: Tokens, Limits & Cost

Many confusing things about working with large language models — truncated replies, "context length exceeded" errors, a chatbot that forgets the start of a conversation, a RAG pipeline that quietly drops half its documents — trace back to one concept: the context window. It's one of the most important numbers on a model's spec sheet, and one of the most misunderstood. This guide explains what a context window is, how it's measured, why a bigger one isn't automatically better, and the patterns developers use to stay inside it without losing quality or overpaying.

What Is a Context Window?

A context window is the maximum amount of text a model can consider at once, measured in tokens. Crucially, it covers both sides of a request: the input you send (system prompt, conversation history, retrieved documents, the user's question) and the output the model generates — they share one budget. If a model advertises a 128K context window and your prompt already uses 120K tokens, only about 8K tokens are left for the response; ask for more and you'll get a truncated answer or an error.

This is why a few terms get used interchangeably. The context length or token limit usually refers to that total ceiling, while some providers quote separate input and output maximums. The mental model to keep is simple: input + output must fit inside the window. Everything below follows from that one constraint.

What is a token, exactly?

A token is the unit a model actually reads — not a character and not quite a word. Models break text into subword pieces with a tokenizer, so a token is typically a few characters. As a rough rule of thumb for English text:

The practical consequence: you can't reliably size a prompt by counting characters or words, because tokenization is non-linear. A page of dense JSON or source code consumes far more tokens than a page of plain English, and every provider's tokenizer splits text differently. When you need an exact number, count with the model's own tokenizer rather than estimating — the gap matters most at the edges, where overflow happens.

Why Bigger Context Isn't Always Better

Marketing pages advertise ever-larger windows — 200K, 1M, even more — so it's tempting to fix any context problem by reaching for the bigger model and pasting everything in. For many workloads that's the wrong instinct. Three forces push back.

Cost scales with input tokens

You pay per token, and the input side of a long context is billed on every single call. Stuff 100K tokens of background into a prompt and fire it a thousand times a day, and that's 100 million input tokens daily — most of which the model may not even need. A short prompt that retrieves only the relevant 2K tokens can be dozens of times cheaper.

Latency grows with prefill

Before a model can emit its first output token, it has to read and process the entire input — the "prefill" phase. That work grows with input length, so a giant prompt directly increases time-to-first-token: a 200K-token prompt can add noticeable delay before anything appears on screen. Long context isn't free in time, only in money.

Quality can drop in the middle

Large windows don't guarantee the model uses everything equally well. A well-documented failure mode — "lost in the middle" — is that models attend most reliably to the beginning and end of a long input and can overlook facts buried in between. The one sentence that matters, lost inside 150K tokens of noise, can lower answer quality versus a tight prompt that puts it front and center. More context can mean more distraction, not more accuracy.

How Input Size Affects Cost and Latency

The table below uses illustrative round numbers (not any vendor's pricing) to show how the same task behaves as the input grows. Assume a hypothetical model at $1.00 per million input tokens, a fixed ~500-token answer, and a single call:

ApproachInput tokensIllustrative input costRelative prefill / latency"Lost in the middle" risk
Tight prompt (RAG, top chunks)~2,000$0.002LowestLow — only relevant text is present
Moderate context~20,000$0.020~10x higherLow–moderate
Large context~100,000$0.100~50x higherModerate — key facts may be buried
Dump everything~500,000$0.500HighestHigher — signal diluted by noise

Read one row at a time: a 250x jump in input size is a 250x jump in input cost and a large jump in prefill latency, for an answer that is often no better — and sometimes worse. Context is a budget to spend deliberately, not a bucket to fill.

Practical Ways to Manage the Context Window

Staying inside the window — cheaply and without hurting quality — comes down to a handful of well-worn techniques. Most production systems combine several.

Trim and summarize conversation history

In a chat app, naively resending the full transcript on every turn makes each message more expensive than the last and eventually overflows the window. Instead, keep recent turns verbatim and summarize older history into a compact running summary. The model keeps the gist at a fraction of the cost, and you never hit the limit however long the session runs.

Use retrieval (RAG) instead of dumping everything

The biggest single win for knowledge-heavy apps: don't paste an entire knowledge base into the prompt. Index your documents, retrieve only the chunks relevant to the current question, and send just those. Retrieval-augmented generation keeps prompts small, cheap, and on-topic — and it sidesteps "lost in the middle" because almost everything in the window is relevant. A 3K-token RAG prompt routinely beats a 200K-token "here's all our docs" prompt on cost, latency, and accuracy.

Chunk inputs that genuinely are too big

When a single input truly exceeds the window — a long PDF, a large log file, a whole codebase — split it into chunks that each fit, process them separately, then combine the results (a map-reduce or iterative-refine pattern). Overlap chunks slightly so you don't sever an idea at a boundary. This handles inputs of any size with a fixed-window model.

Set max_tokens deliberately

Remember that output shares the window. Set max_tokens to what the task actually needs so the response can't run away, eat your remaining budget, and get truncated mid-sentence. Capping output also reduces cost (output is usually the pricier side) and tightens latency; asking for compact structured output — short JSON fields, not prose — pulls the same way.

Mind the cost angle with prompt caching

Sometimes you genuinely need a large, stable block of context on many calls — a long system prompt, fixed instructions, a big reference document. Prompt caching lets providers discount the static, repeated prefix of your prompt (often a large reduction on those cached tokens), so you pay full price once and a fraction thereafter. Keep the cached portion byte-stable and put the variable part last. Caching doesn't shrink the window, but it takes much of the sting out of long contexts.

Putting It Together

The context window is a hard ceiling and a soft budget at once. The ceiling — the token limit — throws errors when input plus output exceeds it. The budget is what shapes good engineering: even when everything fits, every extra token costs money, adds latency, and risks diluting the signal. The teams who get the most out of LLMs treat context as a scarce resource to spend.

Doing this across multiple models adds a wrinkle: each provider has a different window size, tokenizer, and pricing, so the "right" amount of context shifts with where a request is routed. That's the kind of cross-cutting concern an LLM gateway absorbs — one endpoint that routes each call to a model that fits the job, applies caching, and records true token counts and cost per call so you can see what your context is buying. For the economics behind those tokens, our guide to AI tokenomics breaks down input, output, and cached rates.

If you'd rather not hand-roll routing, fallback, caching, and per-call cost accounting, flo2 is a developer-first, bring-your-own-key LLM gateway: one OpenAI- and Anthropic-compatible key routes to the cheapest or fastest model, with smart routing, fallback, caching, and true per-call cost tracking at zero token markup. Free during beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to