LLM Model Routing: Send Each Request to the Right Model
Most teams start by hardcoding one model — pick GPT-class or Claude, point every call at it, ship. It works until it doesn't: you're paying frontier prices to classify support tickets, eating frontier latency on a one-line autocomplete, and stuck the day that model has a bad hour. LLM model routing is the fix. Instead of pinning every request to a single model, you choose the model per request — by cost, quality, latency, or task type — so easy work lands on a cheap fast model and only the genuinely hard work reaches an expensive one. The result is lower spend, lower tail latency, and a system that bends instead of breaking when one provider goes dark.
What LLM model routing actually is
A router is a decision layer in front of your model calls. Your application says "answer this," and the router decides which model answers — possibly a different one for the next request a millisecond later. That's the whole idea: dynamic model selection replaces a constant with a function of the request.
The inputs to that function are usually some mix of:
- Cost. Models span a wide price range per million tokens. Routing the long tail of simple requests to a cheap model is the single biggest lever on your bill — see AI tokenomics for how input, output, and cached pricing compound.
- Quality. A frontier model reasons through multi-step problems a small one fumbles. Capability is a real axis, not marketing — but you only need it on the requests that actually demand it.
- Latency. Some inference providers answer in a fraction of the time of others. For an interactive path, the fastest adequate model beats the smartest slow one.
- Task type. Code, summarization, extraction, free-form chat, and structured JSON each have models that punch above their weight (and price) at that one job.
Crucially, routing is not the same as fallback or load balancing, though they compose. Routing decides the best target for a request up front. Load balancing spreads requests across equivalent targets to dodge rate limits. Fallback reacts after a target fails. A mature setup runs all three; this article is about the first.
LLM routing strategies, from simplest to smartest
"Route to the right model" hides a spectrum of approaches with very different effort-to-payoff ratios. Start at the top and only climb when you need to.
Static rules
The 80/20 starting point: route on a property you already know at call time, with no extra inference. Tag each request with its task and map tasks to models — classify → cheap-small, chat → mid-tier, hard-reasoning → frontier. It's deterministic, debuggable, and adds zero latency. The ceiling is that you have to know the right model for each tag, and a request's difficulty isn't always visible from its type. But for most apps, a handful of static rules captures the majority of the savings.
Cost-tiered escalation
Send everything to the cheapest model first; escalate to a stronger one only when the cheap answer fails a check. The "check" is the art: a confidence score, a JSON-schema validation, a self-grade, or a downstream test (does the generated code compile?). Done right, you pay frontier prices only on the fraction of requests the budget model genuinely can't handle. The cost is added latency and complexity on escalated requests — you've now made two calls — so the cheap model needs to clear the bar often enough to come out ahead.
Capability-based routing
Match the request to a model's known strengths rather than to a generic price tier: route code to a code-tuned model, long-document summarization to a long-context model, fast structured extraction to a quick-and-cheap one. This is static rules with a sharper map — it leans on benchmark and eval knowledge of which model wins which job, which is exactly where A/B testing earns its keep (more below).
A classifier or router model
When difficulty isn't obvious from metadata, use a tiny, cheap model (or a fast heuristic) to read the request and emit a routing label — easy / hard, or a target model id. The classifier call is small and fast; the payoff is keeping expensive models off requests that don't need them. The tradeoff is honest: you've added a hop to every request, so the classifier must be cheap and fast enough that the savings on routed-down requests dwarf its overhead. Reserve this for high-volume paths where the routing decision genuinely can't be made statically.
Fallback on failure
Not a routing strategy on its own, but the safety net every router needs: when the chosen model returns a 429, a 5xx, or times out, reroute to a different model or provider instead of failing the user. Routing picks the best target; fallback guarantees a target. The two are complementary — for the full failover taxonomy, see LLM fallback and racing.
Here's how the strategies stack up:
| Strategy | Decision signal | Extra latency | Best for |
|---|---|---|---|
| Static rules | Known task tag / metadata | None | Most apps; the default starting point |
| Cost-tiered escalation | Cheap-model result quality check | Only on escalation | Tasks with a cheap, reliable success check |
| Capability-based | Task type → model strength | None | Mixed workloads (code, summarize, extract) |
| Classifier / router model | A small model reads the request | One small call, every request | High-volume paths where difficulty is opaque |
| Fallback on failure | Error status from chosen model | Only on failure | Always — a safety net under any router |
Why routing is worth the effort
The payoff is concentrated in one observation: request difficulty is wildly uneven, but a hardcoded model treats every request as the hardest one it might ever see. You provision for the worst case on every call.
Route instead, and three wins follow. Cost drops because the bulk of real traffic — classifications, short replies, extractions, retries — is easy, and easy requests on a cheap model can be an order of magnitude cheaper than on a frontier one. Latency drops because fast models serve the interactive path while you reserve slow, heavy reasoning for the requests that justify the wait. Resilience improves because once your code can target more than one model, no single provider's outage is fatal. You don't lose quality where it matters — hard requests still go to strong models — you just stop overpaying everywhere else.
A concrete routing example
Here's a compact router that combines a static task map, cost-tiered escalation with a validity check, and fallback on provider failure — the three layers most apps actually want, in pseudocode:
# task -> ordered model tiers (cheapest first within a task)
ROUTES = {
"classify": ["groq/llama-fast"], # trivial: one cheap model
"extract": ["gemini/flash", "openai/gpt-mini"], # cheap, escalate if invalid
"chat": ["openai/gpt-mini", "anthropic/sonnet"],
"reason": ["anthropic/opus", "openai/gpt-frontier"], # hard: start strong
}
RETRY_ON = {429, 500, 502, 503, 529, timeout}
def route(request):
task = classify_task(request) # cheap: regex/metadata, or a tiny model
tiers = ROUTES.get(task, ROUTES["chat"])
for model in tiers: # cost-tiered: try cheaper models first
resp = call(model, request)
if resp.status in RETRY_ON: # provider failure -> fall back to next model
continue
if not resp.ok: # 400/422 etc: bug, no model will fix it
raise resp
if not passes_check(task, resp): # cheap answer failed quality bar -> escalate
continue
return resp # good answer, cheapest model that cleared it
raise AllTiersExhausted
def passes_check(task, resp):
# task-specific, cheap validation: schema-valid JSON? non-empty? code compiles?
if task == "extract": return is_valid_json(resp.text)
return len(resp.text) > 0
Notice how routing, escalation, and fallback share one loop: the ordered tier list is the cost-tiered policy, a quality check failure escalates within it, and a retryable error falls through to the next model. Twenty lines of policy — but in production it has to be tested, instrumented, and re-tuned every time a price or model changes.
Pitfalls to design around
- Misrouting. Send a hard request to a weak model and you don't save money — you ship a wrong answer, or trigger an escalation that costs more than going straight to the strong model. Routing rules need real eval data behind them, not vibes, and they need monitoring so a drifting classifier doesn't quietly degrade quality.
- Added latency. Every routing hop costs something. A classifier model on the hot path, or a cost-tiered escalation that makes two calls, can erase the latency win you were chasing. Keep the routing decision cheaper than the savings it unlocks, and measure end-to-end, not just the model call.
- Inconsistent outputs. Different models format differently. If anything downstream parses the response, pin output contracts (JSON mode, tool schemas) so a routed-to or fallback model stays machine-readable.
- Stale routes. Model prices and quality move constantly. A route that was optimal last quarter may be wrong today. Make the routing table configuration you can change without a deploy, and revisit it as the landscape shifts.
- No per-call accounting. If you can't see which model served each request and what it cost, you're flying blind — you can't tell whether routing is actually saving money or whether one tier quietly carries all the traffic. Per-call cost and latency data is what makes routing tunable instead of guesswork.
How a gateway does the routing for you
Every piece above — the task map, the escalation check, the fallback loop, the output contracts, the per-call accounting — is infrastructure, not your product. Rebuilding it inside every service that calls a model means reinventing a routing problem over and over and re-testing it on each provider change. The natural home for it is one layer down: an LLM gateway that sits between your app and the providers, owns the routing policy as configuration, and exposes one stable endpoint. Your code makes a single call; the gateway decides which model on which provider serves it. For the broader picture, see what is an LLM gateway.
flo2 is a developer-first LLM gateway built for exactly this. Bring your own provider keys (OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, OpenRouter) and route every request through one OpenAI- and Anthropic-compatible key. Smart routing sends each request to the cheapest or fastest model that fits, fallback chains reroute automatically on a 429 or 5xx, and built-in A/B testing with an LLM judge measures real model–task fit so your routes rest on data instead of guesses — all with true per-call cost accounting and zero token markup, since you pay the providers directly. It's model routing without a router to build, and it's free during Beta.