max_tokens Explained: Control LLM Output Length & Cost
If you've ever received a reply that cuts off mid-sentence, watched an API call cost twice what you expected, or stared at a finish_reason: "length" in your response payload, you've already collided with max_tokens. It's one of the most commonly misunderstood parameters in LLM APIs — confused with the context window, set too high by default, and rarely thought about until something goes wrong. This guide explains exactly what max_tokens controls, how it differs from the context window, how to set it sensibly for cost and latency, and how to detect truncation before it silently breaks your application.
What max_tokens Actually Controls
max_tokens is a hard upper bound on the number of tokens the model will generate in its response. It is purely an output constraint. When the model reaches this limit, generation stops immediately — regardless of whether the thought, code block, JSON object, or sentence is complete.
It does not control:
- How many tokens you can send in your prompt
- The total size of the context window
- How many tokens the model "sees" or "thinks with"
Think of it as a meter on the output spigot. You can turn it down to save money and time, or leave it high to allow long responses — but it has no effect on what goes into the request, only on what comes back out.
max_tokens vs. the Context Window: A Critical Distinction
The context window is the total token budget shared by both input and output. If a model has a 128K context window and you send a 120K-token prompt, you have at most 8K tokens available for the response — no matter what you set max_tokens to. See LLM context windows for a full breakdown of how this budget works.
The relationship looks like this:
context_window = input_tokens + output_tokens
max_tokens caps output_tokens (and can only be ≤ context_window - input_tokens)
Setting max_tokens higher than the remaining context budget will either be silently clamped by the provider or return an error. It cannot extend the context window — it can only limit the output further.
When the two limits collide
The most common error developers encounter is something like This model's maximum context length is 128000 tokens. Your messages resulted in 127800 tokens, leaving only 200 tokens for completion, but you requested 4096. The fix is not to reduce max_tokens — it's to reduce the prompt. max_tokens is a ceiling on what you want; the context window is a ceiling on what's physically possible.
The max_completion_tokens Naming Variant
OpenAI renamed the parameter to max_completion_tokens in their newer API versions (starting with the o1 model family and carried into later releases). The semantic meaning is identical — it caps output tokens — but the rename makes the "completion-only" scope of the parameter explicit in the name itself, which is a welcome clarification.
Anthropic's API uses max_tokens and requires you to set it explicitly on every request (there is no default). Most other providers follow one convention or the other. When routing requests through a gateway like flo2, the translation between naming conventions is handled automatically, so you can use a single call format regardless of which model or provider is behind it.
Detecting Truncation: finish_reason
When a model stops generating because it hit your max_tokens limit rather than because it finished naturally, it signals this in the response:
finish_reason: "length"— output was cut off at the token limit (OpenAI, most OpenAI-compatible APIs)stop_reason: "max_tokens"— Anthropic's equivalent signalfinish_reason: "stop"— the model completed naturally; this is what you want to see
Always check finish_reason in production code. A truncated JSON payload, a half-written function, or a cut-off explanation looks fine in development when outputs happen to be short, then silently breaks in production when a user triggers a longer response. Here's a minimal example that checks for it:
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.flo2.com/v1", // drop-in via flo2
apiKey: process.env.FLO2_API_KEY,
});
const response = await client.chat.completions.create({
model: "anthropic/claude-sonnet-4-5",
max_tokens: 1024,
messages: [
{ role: "user", content: "Explain how RSA encryption works." }
],
});
const choice = response.choices[0];
if (choice.finish_reason === "length") {
// Output was truncated — handle it: retry with higher limit,
// ask the model to continue, or surface a warning to the user.
console.warn("Response truncated at max_tokens limit.");
} else {
// finish_reason === "stop" — model finished naturally
console.log(choice.message.content);
}
Setting max_tokens Sensibly
The wrong instinct is to set max_tokens as high as possible "just to be safe." That approach costs money on every call (you're billed for tokens generated, not tokens you planned for) and increases latency proportionally — a model that could have stopped at 300 tokens will keep generating if 4096 is on the table. See reduce LLM latency for how output length directly drives total completion time.
Match the limit to the task
- Classification, routing, yes/no decisions: 10–50 tokens is usually sufficient. Cap it there.
- Short summaries or single-paragraph answers: 200–400 tokens covers most cases.
- Chat responses: 512–1024 tokens handles typical conversational replies without overspending on the tail.
- Long-form generation (reports, essays, code files): 2048–4096 tokens; size to the actual expected output, not the model maximum.
- Structured outputs (JSON, tool calls): Estimate the schema size and add 20% headroom. Truncated JSON is invalid JSON.
Cost and latency are tightly coupled to output length
Output tokens are billed at a higher rate than input tokens on most providers — often 3–5× higher. A prompt that generates 2000 tokens per call, fired 10,000 times a day, produces 20 million output tokens daily. Dropping the realistic max_tokens ceiling from 2048 to 512 (because your actual task never needs more) cuts that bill by 75% immediately. The latency benefit is just as direct: total generation time is almost linearly proportional to tokens generated, so tighter limits mean faster responses at the p95 and p99 — the tail that matters most for user experience.
When you genuinely need long outputs
Some tasks — code generation, structured document creation, multi-step reasoning — do require high limits. In those cases, consider whether the task can be broken into smaller chained calls (each with a sensible limit) rather than a single unconstrained generation. Chaining is also more reliable: a 500-token intermediate step that produces valid JSON is far more useful than a 4096-token generation that truncates just before the closing brace.
Summary
max_tokenscaps output tokens only — it has no effect on your prompt or the context window.- Input + output must fit inside the context window;
max_tokenscan only further restrict the output portion. - Always check
finish_reason("length"/"max_tokens") to detect silent truncation. - The same parameter is called
max_completion_tokenson newer OpenAI model families. - Set
max_tokensto match the realistic output of your task — not the model maximum — to control cost and latency.
If you're routing across providers, flo2 normalizes parameter names, translates between API formats, and gives you per-request token accounting at zero markup — so you can see exactly how your max_tokens choices play out in real cost and latency data. Free during Beta.