2026-06-03 · flo2 blog

Cloudflare AI Gateway Explained: Caching, Analytics & Limits

If you want a single, observable layer in front of every model provider you call, Cloudflare AI Gateway is one of the first tools you will meet. It is a proxy that sits between your application and the LLM APIs you already use—OpenAI, Anthropic, Google, Groq, and others—and adds caching, rate limiting, retries, and analytics at Cloudflare's edge. This guide explains what it is, how it works, where it is strong, what to confirm before you commit, and how it compares to a router-first approach. For a broader primer, see what is an LLM gateway.

What is Cloudflare AI Gateway?

Cloudflare AI Gateway is an edge proxy for LLM and other AI API traffic. Instead of calling a provider's endpoint directly, you send the request to a Cloudflare gateway URL, and Cloudflare forwards it to the upstream provider while observing and optionally modifying the call. Because it runs on Cloudflare's global network, the proxying happens close to your users and adds a control plane—dashboards, logs, and policies—on top of providers that otherwise give you very little visibility on their own.

The mental model is straightforward: it is an observability-and-control layer, not a model vendor. You still bring your own provider keys and accounts—Cloudflare does not sell you tokens; it sits in the request path so you can see, cache, and govern what flows through. That framing matters for everything below, because it sets expectations about what the product optimizes for.

How Cloudflare AI Gateway works

Adoption is mostly a URL change. At a high level, the flow looks like this:

Create a gateway in the Cloudflare dashboard, which gives you an endpoint scoped to your account.
Rewrite your base URL so requests go to the Cloudflare gateway instead of the provider's host. The provider and model you target are encoded in the path or headers.
Keep your provider key—you still authenticate to the upstream provider with your own credentials (BYOK). Cloudflare passes the call through.
Get observability and controls automatically: every request is logged, and you can layer on caching, rate limits, and retry behavior.

Because the change is at the transport layer, most existing SDK code keeps working—you are not rewriting prompts or response handling, just where the request goes. Setup steps, supported providers, and header conventions evolve, so follow Cloudflare's current documentation rather than any single tutorial.

Cloudflare AI Gateway caching

Caching is one of the headline features. When enabled, Cloudflare can store a response and serve it again for matching requests, cutting both latency and the spend you would otherwise send to the provider. Served from the edge, a cache hit can be dramatically faster than a fresh model call and costs nothing in provider tokens—genuinely useful for repeated prompts, deterministic lookups, and high-traffic endpoints where the same questions recur. As always, you decide where it is safe: identical-input/identical-output workloads benefit most, while personalized or time-sensitive responses usually should not be cached. Check the docs for how cache keys are computed and what time-to-live controls exist.

Analytics, logging, and rate limiting

The other major draw is observability. The gateway records requests and surfaces dashboards covering volume, latency, errors, token counts, and cache hit rates, with per-request logs you can inspect when something looks off. It adds governance too—rate limiting to cap traffic, plus retries and fallback so a transient provider error or rate-limit response can be retried or routed to an alternative. For teams with almost no insight into their raw provider calls today, this control plane is a real upgrade.

Strengths of Cloudflare AI Gateway

It is a solid product, and it is easy to see why teams reach for it:

Edge response caching. On Cloudflare's network, cache hits and proxying happen close to users with low overhead.
Strong observability. Dashboards and per-request logs turn opaque provider traffic into something you can measure and debug.
Part of the Cloudflare platform. If you already use Workers, R2, or Cloudflare's CDN and security stack, it slots in with one less vendor and a familiar dashboard.
BYOK, no token resale. You keep paying providers directly with your own keys—Cloudflare is not a token reseller.
Generous to start. Approachable for early projects and easy to bolt onto an existing app without re-architecting.

Cloudflare AI Gateway pricing and limits: what to confirm

Pricing and limits are where you should slow down and read the source rather than trust a blog. AI tooling pricing changes frequently, so do not hard-code assumptions. Verify these directly on Cloudflare's documentation and pricing pages before you build on it:

What is free vs. paid. The current boundary between the free entry point and paid usage, and what specifically meters (logs stored, requests, cached entries, retention).
Log retention and storage. How long request logs are kept, and whether longer retention or higher volume moves you into a paid tier.
Rate-limit and cache limits. Any ceilings on cached entries, throughput, or configurable policies on lower tiers.
Provider coverage. Which upstream providers and request shapes are supported today, since the list grows over time.

The honest summary: treat any number you read elsewhere as potentially stale, and price your own expected traffic against Cloudflare's current published terms.

The main consideration: it is a caching/observability proxy, not a full router

This is the most important nuance for choosing a tool, and it is not a knock on Cloudflare—it is a question of job-to-be-done. Cloudflare AI Gateway is primarily an observability-and-caching proxy with retries and fallback, which is a coherent and valuable scope. What it is not centrally designed around is opinionated routing: dynamically choosing, per request, the cheapest or fastest model for a task, racing several models and taking the first good answer, or A/B-testing models with a judge to measure model–task fit.

If your goal is mostly "see and cache my existing calls, and survive provider hiccups," a caching/observability proxy fits well. If it is "let an intelligent layer decide which model to use and prove it was the right call," that is a different shape of tool. Confirm the current routing capabilities yourself, since feature scope shifts—but evaluate based on which job you actually need done.

Where flo2 fits as an alternative

flo2 is a developer-first LLM gateway built around the routing-and-economics side of the problem. Like Cloudflare, it is BYOK—you bring your own keys for OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, and OpenRouter, and pay each provider directly. The defining difference is zero token markup: flo2 does not resell tokens or add a per-token margin. In exchange for one OpenAI- and Anthropic-compatible key, you get the routing layer most teams otherwise assemble by hand:

Smart routing that sends each request to the cheapest or fastest model that fits the task.
Fallback chains so an outage or rate limit transparently fails over to the next option.
Racing to fire several models in parallel and return the fastest acceptable response.
A/B testing with a judge that scores model–task fit, so you choose models on evidence rather than vibes.
Opt-in response caching to cut latency and spend where it is safe.
True per-call cost accounting—real dollars per request, per model, not just aggregate token tallies.

The two approaches differ in emphasis. Cloudflare AI Gateway leans toward edge caching and observability inside the Cloudflare platform; flo2 leans toward intelligent routing, racing, A/B evaluation, and honest per-call economics through a drop-in OpenAI/Anthropic endpoint. flo2 is free during its Beta, so you can point an existing SDK at it and compare against your current setup directly.

How to decide

There is no universally correct answer—only fit. A few shortcuts:

If you live on the Cloudflare platform and your main pain is visibility plus edge caching of repeated calls, Cloudflare AI Gateway is a natural, low-friction choice—just confirm current pricing, limits, and retention.
If your main pain is choosing the right model per request—routing by cost and latency, racing, A/B with a judge, and seeing true dollar cost per call—a router-first gateway like flo2 targets exactly that gap.
If you want to weigh the whole field across cloud-vendor gateways, observability proxies, open-source self-host, and resellers, the best LLM gateway comparison walks through the categories and trade-offs.

Whatever you shortlist, test it against real traffic: measure latency with caching on and off, verify cost numbers against your provider invoices, and confirm the data path and retention meet your requirements. If zero-markup BYOK plus smart routing, racing, A/B testing, and true cost accounting match your priorities, flo2 is free to try during Beta—and Cloudflare AI Gateway remains a strong pick when edge caching and observability are the job to be done.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →