2026-06-03 · flo2 blog

OpenRouter Rate Limits: Free vs Paid & How to Avoid 429s

You wired up a single OpenRouter key, shipped, and now your logs are filling with HTTP 429. OpenRouter rate limits are the most common wall developers hit on the platform, and they trip the hardest on exactly the path most people start with: free :free models and accounts carrying little or no credit balance. This guide explains how OpenRouter's rate limiting works conceptually, what a 429 actually means in this context, and the practical ways to avoid and handle one — from backoff and caching to upgrading your balance, going BYOK, or putting a gateway with automatic fallback in front so no single provider's ceiling can stop a request.

One ground rule before the details. OpenRouter's exact limits — the specific requests-per-minute numbers, the daily caps on free models, the balance thresholds that unlock more headroom — change over time and depend on your account. So this article deliberately does not publish hard figures. It explains the shape of the system and tells you where to confirm the current numbers: OpenRouter's own rate-limit documentation, which is the only version that stays correct.

How OpenRouter rate limits work, conceptually

OpenRouter is a hosted aggregator: one OpenAI-compatible (and Anthropic-compatible) key reaches hundreds of models behind a single endpoint. Sitting in that position, it enforces its own rate limiting on top of whatever the underlying providers do. A few principles describe how that limiting behaves, without needing any specific number:

None of this is a knock on OpenRouter — a free, best-effort tier on a hosted platform has real costs behind it, and rate limiting is how any aggregator keeps shared capacity fair. The point is simply that the boundaries are real and you should architect around them rather than assume they are not there. For the exact, current numbers, always defer to OpenRouter's rate-limit docs.

What an OpenRouter 429 actually means

An openrouter 429429 Too Many Requests — means OpenRouter accepted your request, looked at how fast you are sending relative to the applicable limit, and rejected it instead of running the model. It is not a bug in your code and it is rarely a billing failure; it is a throttle. Crucially, a 429 is a transient signal: it says "not right now," not "never." That distinction is what makes it handleable.

When you get an OpenRouter rate limit error, a couple of things are worth checking before you react:

For a deeper, provider-agnostic treatment of the status code itself — headers, retry semantics, and the client behaviors that resolve it — see fixing LLM 429 errors.

How to avoid and handle OpenRouter rate limits

There is no single switch that removes rate limits; there is a stack of techniques that, together, make them a non-issue. Work from the cheapest fix to the most structural.

1. Back off with jitter and respect the retry hint

The correct response to a transient 429 is to wait and retry — but how you wait matters. An immediate retry just hammers an already-saturated endpoint. Two principles fix this:

If the response includes a wait hint, prefer it over your own curve. Here is a clean, dependency-free pattern that honors Retry-After when present and falls back to exponential backoff with full jitter otherwise, with a hard cap on attempts so a sustained limit surfaces as a real error instead of an infinite stall:

import random
import time

import requests

RETRYABLE = {429, 500, 502, 503, 529}


def call_openrouter(url, headers, payload, max_retries=5, base=1.0, cap=30.0):
    """POST to OpenRouter with exponential backoff + full jitter; honors Retry-After."""
    for attempt in range(max_retries + 1):
        resp = requests.post(url, headers=headers, json=payload, timeout=60)

        if resp.status_code == 200:
            return resp.json()

        # Non-retryable client errors (400 bad request, 401 bad key) never
        # succeed on retry — fail fast instead of looping.
        if resp.status_code not in RETRYABLE:
            resp.raise_for_status()

        if attempt == max_retries:
            resp.raise_for_status()  # out of retries

        retry_after = resp.headers.get("retry-after")
        if retry_after is not None:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        time.sleep(delay)

    raise RuntimeError("unreachable")

2. Spread load and pace your requests

Backoff is reactive — it cleans up after you have already been throttled. The better move is to not cross the limit in the first place. Cap how many requests are in flight at once with a semaphore or worker pool, and queue the overflow rather than firing everything simultaneously. Even a small fixed gap between request starts turns a spike that trips a per-minute cap into a steady stream the limiter is happy with. This is especially effective against OpenRouter's tight free-model windows, where a short burst is enough to hit the wall.

3. Cache identical requests

The cheapest 429 fix is the request you never send. When the same prompt repeats — common with retries, idempotent jobs, and shared system prompts — a response cache serves it without touching OpenRouter at all, cutting request volume directly and leaving more of your limited budget for traffic that genuinely needs the model.

4. Upgrade your balance or go BYOK

If you are routinely hitting the wall, the two structural fixes change the ceiling itself:

Sidestep any single ceiling with automatic fallback

Every fix above optimizes one path. But here is the structural truth: a single key — OpenRouter's or any one provider's — has a single quota, and once real demand exceeds it, no amount of polite retrying creates more capacity. You are just queuing against a wall. The way past a single ceiling is to not depend on a single target.

A 429 from one provider says nothing about the others. If OpenRouter (or its upstream) is throttling you, Anthropic, Gemini, or Groq serving an equivalent model very likely is not. So the durable pattern is to define a fallback chain across keys and providers: when one returns a rate limit that will not clear, the request automatically reroutes to the next healthy target instead of failing. Stack several free tiers and provider keys and your effective ceiling becomes the sum of their limits, not any one of them.

AspectOne OpenRouter keyFallback chain across keys/providers
Effective ceilingThat account's per-model / per-account limitThe combined headroom of every key in the chain
When a 429 hitsYou back off and retry the same targetAuto-reroute to the next healthy provider or key
Free-tier strategyOne platform's rate-limited free poolSeveral providers' free tiers combined, each with its own limit
CostAggregator price for paid variantsProvider list price, zero markup, true per-call cost

The catch is orchestration. Built by hand, this means juggling several SDKs, catching provider-specific 429s, health-tracking which key is tapped out, and translating between API formats — re-tested in every service that calls a model. That coordination layer is exactly what an LLM gateway exists to own, as configuration rather than scattered application code.

Where flo2 fits

flo2 is a developer-first, bring-your-own-key LLM gateway built for precisely this. You register your own provider keys once — OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, and OpenRouter itself — and route every request through one endpoint that is drop-in compatible with both the OpenAI and Anthropic APIs. Define a fallback chain, and when one key or provider returns a rate limit, flo2 retries down the chain automatically, routing each request to the cheapest or fastest qualifying model. Because it is a BYOK gateway that never sits in the money path, it adds zero token markup — you pay each provider directly at their real price and see the true cost of every call. It is the zero-markup OpenRouter alternative, and it is free during Beta. If you are weighing the broader trade-offs, the full OpenRouter alternative breakdown compares pricing, control, and lock-in side by side.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to