2026-06-03 · flo2 blog

Fixing 'Context Length Exceeded' Errors in LLM APIs

A context length exceeded error stops your LLM request cold before the model ever runs. You'll see it as context_length_exceeded in OpenAI-compatible APIs, as a 400 with a message like "this model's maximum context length is 128,000 tokens," or as a similar token limit exceeded rejection from every major provider. The fix is almost always one of a small set of techniques — trimming your input, restructuring how you retrieve context, or routing the request to a model with a larger window. This guide walks through all of them.

What "Context Length Exceeded" Actually Means

Every model has a context window — a hard ceiling on how many tokens it can process in one call, covering both the input you send and the output you request. The relevant arithmetic is:

input_tokens + max_tokens (requested output) > model_context_limit
  → 400 context_length_exceeded

This catches developers off guard because the error is about the requested output budget, not what the model actually generated. If you send a 120,000-token prompt to a 128K-context model and set max_tokens=10000, the call fails immediately — even though 8,000 output tokens would fit. The API rejects the reservation, not the hypothetical result. Understanding this distinction is the first step toward a reliable fix. For a deeper look at how context windows work, see the LLM context windows guide.

Why the Error Happens: The Four Common Causes

1. Long documents stuffed directly into the prompt

The most common trigger is pasting a large file — a PDF, a codebase, a long webpage — verbatim into the system or user message. A 50-page Word document can easily run to 25,000–40,000 tokens. Add a system prompt and a few-shot examples, and you're well past midrange model limits before the user even types their question.

2. Unbounded chat history

Chat applications that append every prior turn to the next request grow without limit. A user who chats for 30 minutes with a verbose assistant can accumulate tens of thousands of tokens of history. The request works fine for the first few exchanges, then fails unpredictably when the conversation crosses the model's ceiling — a frustrating regression that's hard to reproduce in testing.

3. Oversized few-shot examples

Few-shot prompting works well, but including many detailed examples multiplies quickly. Five examples of 500 tokens each add 2,500 tokens before any user input. Combine with a long system prompt and a substantive question, and you consume a large fraction of a standard 8K or 16K context budget on the examples alone.

4. max_tokens set too high

If you hardcode a large max_tokens value — say, 4,096 as a conservative "just in case" buffer — that full amount is subtracted from the available input budget. On a smaller model, setting max_tokens=4096 when you only ever need a 200-token reply silently eats into your prompt headroom. See the max_tokens explained guide for the full breakdown of how this parameter interacts with context limits.

How to Fix a Maximum Context Length Error

Count tokens before you send

The single most useful habit is measuring your token budget before the API call, not after it fails. Every major provider publishes a tokenizer you can run locally. Here is a minimal Python token-budget check using OpenAI's tiktoken library — the same tokenizer used by GPT-4o and most OpenAI-compatible endpoints:

import tiktoken

MODEL = "gpt-4o"
CONTEXT_LIMIT = 128_000   # tokens
DESIRED_OUTPUT = 1_024     # tokens you want for the reply

enc = tiktoken.encoding_for_model(MODEL)

def fits_in_context(messages: list[dict]) -> bool:
    """Return True if the request leaves room for the desired output."""
    input_tokens = sum(
        len(enc.encode(m["content"])) + 4   # 4 overhead per message
        for m in messages
    )
    # Reserve space: input + output must be <= context limit
    if input_tokens + DESIRED_OUTPUT > CONTEXT_LIMIT:
        print(f"Over budget: {input_tokens} input + {DESIRED_OUTPUT} output "
              f"= {input_tokens + DESIRED_OUTPUT} > {CONTEXT_LIMIT}")
        return False
    return True

# Usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "Summarize this document: " + very_long_doc},
]

if not fits_in_context(messages):
    # trim, chunk, or route to a larger model before calling the API
    pass

Run this check at request-construction time, before hitting the API. A failed check is a signal to apply one of the strategies below rather than discovering the error in production logs.

Trim or summarize chat history

For multi-turn chat, implement a rolling window strategy: keep only the most recent N turns, or periodically summarize older turns into a compact paragraph. A common approach is to keep the system prompt and last 10 exchanges verbatim, and replace older history with a short "So far: [summary]" block generated by a cheap, fast model. This keeps conversations coherent without ballooning the context.

Use retrieval instead of stuffing

If you're including large documents because the model might need them, replace the "stuff everything in" approach with retrieval-augmented generation (RAG). Embed the document, store it in a vector database, and at query time retrieve only the two or three most relevant chunks. A 100,000-token document becomes a 1,500-token excerpt — one that's more useful to the model anyway, because it's the part that actually matters.

Chunk and map-reduce for long-document tasks

For tasks like summarizing or extracting data from a very long document, split the document into chunks that fit comfortably in context, process each chunk independently, then combine the results. A 200-page report becomes 20 ten-page chunks, each summarized separately, with the summaries combined in a final pass. This sidesteps the context limit entirely and scales to arbitrarily large inputs.

Lower max_tokens to match what you actually need

Audit what your application actually generates. If your completion is almost always under 300 tokens, don't reserve 2,048. Tighter max_tokens values free up input budget and reduce the chance of hitting the ceiling on edge-case prompts. They also give you a clearer contract: if a response genuinely needs more than your cap, that's a signal your task design needs rethinking, not a reason to inflate the buffer.

Route to a larger-context model

Sometimes the right answer is a model with a bigger window. Gemini 1.5 Pro offers up to 1M tokens; Claude handles 200K; newer GPT-4o variants handle 128K. The challenge with direct API switching is that each provider has a different endpoint, auth scheme, and parameter set — meaning routing logic quickly turns into a maintenance burden.

An LLM gateway handles this automatically. With flo2, you define routing rules that detect an oversized request — based on measured input tokens — and redirect it to a larger-context model without changing your application code. Your app calls one endpoint; the gateway inspects the payload, selects the model with sufficient headroom, and forwards the request using your own provider API key. There is no per-token markup: you pay exactly what the provider charges.

Putting It Together: A Layered Defense

In practice, robust applications combine several of these strategies rather than relying on any one:

The context_length_exceeded error is one of the most predictable failure modes in LLM development — and one of the most solvable. Count your tokens, trim what you can, retrieve what you need, and let a gateway handle the routing. For more on tuning these parameters, see the guides on LLM context windows and max_tokens.

If you want automatic overflow routing with zero token markup and your own provider keys, flo2 is free during beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to