2026-06-03 · flo2 blog

OpenRouter Rate Limits: Free vs Paid & How to Avoid 429s

You wired up a single OpenRouter key, shipped, and now your logs are filling with HTTP 429. OpenRouter rate limits are the most common wall developers hit on the platform, and they trip the hardest on exactly the path most people start with: free :free models and accounts carrying little or no credit balance. This guide explains how OpenRouter's rate limiting works conceptually, what a 429 actually means in this context, and the practical ways to avoid and handle one — from backoff and caching to upgrading your balance, going BYOK, or putting a gateway with automatic fallback in front so no single provider's ceiling can stop a request.

One ground rule before the details. OpenRouter's exact limits — the specific requests-per-minute numbers, the daily caps on free models, the balance thresholds that unlock more headroom — change over time and depend on your account. So this article deliberately does not publish hard figures. It explains the shape of the system and tells you where to confirm the current numbers: OpenRouter's own rate-limit documentation, which is the only version that stays correct.

How OpenRouter rate limits work, conceptually

OpenRouter is a hosted aggregator: one OpenAI-compatible (and Anthropic-compatible) key reaches hundreds of models behind a single endpoint. Sitting in that position, it enforces its own rate limiting on top of whatever the underlying providers do. A few principles describe how that limiting behaves, without needing any specific number:

Free and :free models are throttled the hardest. Zero-cost variants are best-effort capacity, so they carry the tightest caps — typically a low ceiling on requests per minute and a separate ceiling on requests per day. A bursty or even moderately steady workload against a :free model will start collecting 429s fast.
Limits can scale with your credit balance and usage. OpenRouter has historically tied free-tier headroom to whether you hold a credit balance, and paid throughput generally grows as you add credits and build account history. A near-empty account gets the least room; funding it loosens things. The thresholds move, so confirm the current ones in the docs.
Limits exist at more than one level. There are per-model limits (a given model, especially a free or high-demand one, has its own cap) and per-account limits (your key has an overall ceiling regardless of which model you call). You can trip either one, and the model-level cap is often the one you meet first.
Upstream provider limits still apply underneath. Even when you are comfortably inside OpenRouter's own ceiling, the provider actually serving the model has its own capacity, and pressure there can surface as throttling too. You are subject to whichever limit you hit first.

None of this is a knock on OpenRouter — a free, best-effort tier on a hosted platform has real costs behind it, and rate limiting is how any aggregator keeps shared capacity fair. The point is simply that the boundaries are real and you should architect around them rather than assume they are not there. For the exact, current numbers, always defer to OpenRouter's rate-limit docs.

What an OpenRouter 429 actually means

An openrouter 429 — 429 Too Many Requests — means OpenRouter accepted your request, looked at how fast you are sending relative to the applicable limit, and rejected it instead of running the model. It is not a bug in your code and it is rarely a billing failure; it is a throttle. Crucially, a 429 is a transient signal: it says "not right now," not "never." That distinction is what makes it handleable.

When you get an OpenRouter rate limit error, a couple of things are worth checking before you react:

Which ceiling did you hit? A 429 on a :free model with an empty balance almost always means the free per-minute or per-day cap. The same code calling a paid model on a funded account points more at a per-model or per-account throughput limit. The fix differs, so it helps to know which wall you met.
Is there a retry hint? Rate-limited responses often carry a Retry-After header (or equivalent) telling you how long to wait. When it is present, it is the authoritative answer — honor it rather than guessing. Headers and error-body shapes can vary, so read what OpenRouter actually returns rather than assuming a fixed format.

For a deeper, provider-agnostic treatment of the status code itself — headers, retry semantics, and the client behaviors that resolve it — see fixing LLM 429 errors.

How to avoid and handle OpenRouter rate limits

There is no single switch that removes rate limits; there is a stack of techniques that, together, make them a non-issue. Work from the cheapest fix to the most structural.

1. Back off with jitter and respect the retry hint

The correct response to a transient 429 is to wait and retry — but how you wait matters. An immediate retry just hammers an already-saturated endpoint. Two principles fix this:

Exponential backoff. Double the wait after each failed attempt (1s, 2s, 4s, 8s…) so you back off quickly when OpenRouter is clearly throttling you.
Jitter. Add randomness to each delay. Without it, every client that hit the limit at the same instant retries at the same instant — a synchronized "thundering herd" that re-saturates capacity the moment it frees up. Jitter spreads those retries out.

If the response includes a wait hint, prefer it over your own curve. Here is a clean, dependency-free pattern that honors Retry-After when present and falls back to exponential backoff with full jitter otherwise, with a hard cap on attempts so a sustained limit surfaces as a real error instead of an infinite stall:

import random
import time

import requests

RETRYABLE = {429, 500, 502, 503, 529}


def call_openrouter(url, headers, payload, max_retries=5, base=1.0, cap=30.0):
    """POST to OpenRouter with exponential backoff + full jitter; honors Retry-After."""
    for attempt in range(max_retries + 1):
        resp = requests.post(url, headers=headers, json=payload, timeout=60)

        if resp.status_code == 200:
            return resp.json()

        # Non-retryable client errors (400 bad request, 401 bad key) never
        # succeed on retry — fail fast instead of looping.
        if resp.status_code not in RETRYABLE:
            resp.raise_for_status()

        if attempt == max_retries:
            resp.raise_for_status()  # out of retries

        retry_after = resp.headers.get("retry-after")
        if retry_after is not None:
            delay = float(retry_after) + random.uniform(0, 0.5)
        else:
            delay = random.uniform(0, min(cap, base * (2 ** attempt)))

        time.sleep(delay)

    raise RuntimeError("unreachable")

2. Spread load and pace your requests

Backoff is reactive — it cleans up after you have already been throttled. The better move is to not cross the limit in the first place. Cap how many requests are in flight at once with a semaphore or worker pool, and queue the overflow rather than firing everything simultaneously. Even a small fixed gap between request starts turns a spike that trips a per-minute cap into a steady stream the limiter is happy with. This is especially effective against OpenRouter's tight free-model windows, where a short burst is enough to hit the wall.

3. Cache identical requests

The cheapest 429 fix is the request you never send. When the same prompt repeats — common with retries, idempotent jobs, and shared system prompts — a response cache serves it without touching OpenRouter at all, cutting request volume directly and leaving more of your limited budget for traffic that genuinely needs the model.

4. Upgrade your balance or go BYOK

If you are routinely hitting the wall, the two structural fixes change the ceiling itself:

Add credits. Because OpenRouter ties headroom to balance, funding the account and moving off :free variants raises your limits. This is the simplest path if you want to stay entirely within OpenRouter.
Go BYOK to the underlying provider. When you bring your own provider key (OpenAI, Anthropic, Google, Groq, Mistral, and so on), you are no longer rate-limited as one tenant of a shared aggregator pool — you hit that provider's own limits directly, scoped to your account and usage tier. For steady production traffic, your own provider quota is usually far roomier than a shared free pool, and you pay the provider's real price with no aggregator markup.

Sidestep any single ceiling with automatic fallback

Every fix above optimizes one path. But here is the structural truth: a single key — OpenRouter's or any one provider's — has a single quota, and once real demand exceeds it, no amount of polite retrying creates more capacity. You are just queuing against a wall. The way past a single ceiling is to not depend on a single target.

A 429 from one provider says nothing about the others. If OpenRouter (or its upstream) is throttling you, Anthropic, Gemini, or Groq serving an equivalent model very likely is not. So the durable pattern is to define a fallback chain across keys and providers: when one returns a rate limit that will not clear, the request automatically reroutes to the next healthy target instead of failing. Stack several free tiers and provider keys and your effective ceiling becomes the sum of their limits, not any one of them.

Aspect	One OpenRouter key	Fallback chain across keys/providers
Effective ceiling	That account's per-model / per-account limit	The combined headroom of every key in the chain
When a 429 hits	You back off and retry the same target	Auto-reroute to the next healthy provider or key
Free-tier strategy	One platform's rate-limited free pool	Several providers' free tiers combined, each with its own limit
Cost	Aggregator price for paid variants	Provider list price, zero markup, true per-call cost

The catch is orchestration. Built by hand, this means juggling several SDKs, catching provider-specific 429s, health-tracking which key is tapped out, and translating between API formats — re-tested in every service that calls a model. That coordination layer is exactly what an LLM gateway exists to own, as configuration rather than scattered application code.

Where flo2 fits

flo2 is a developer-first, bring-your-own-key LLM gateway built for precisely this. You register your own provider keys once — OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, and OpenRouter itself — and route every request through one endpoint that is drop-in compatible with both the OpenAI and Anthropic APIs. Define a fallback chain, and when one key or provider returns a rate limit, flo2 retries down the chain automatically, routing each request to the cheapest or fastest qualifying model. Because it is a BYOK gateway that never sits in the money path, it adds zero token markup — you pay each provider directly at their real price and see the true cost of every call. It is the zero-markup OpenRouter alternative, and it is free during Beta. If you are weighing the broader trade-offs, the full OpenRouter alternative breakdown compares pricing, control, and lock-in side by side.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →