2026-06-03 · flo2 blog

Kong AI Gateway Explained: Plugins, Use Cases & Alternatives

If you are already running Kong as your API gateway and want to extend it to LLM traffic, Kong AI Gateway is the natural first stop. Kong AI Gateway is a suite of AI-specific plugins on top of the Kong API gateway—covering LLM routing, request and response transformation, token-based rate limiting, semantic caching, prompt guards, and observability. This article explains what it is, how it works, where it earns its place, and where a lightweight hosted alternative better fits teams that just want routing without running gateway infrastructure. For the broader category, see what is an AI gateway.

What is Kong AI Gateway?

Kong AI Gateway is not a standalone product—it is a set of capabilities layered on top of Kong Gateway, the widely-used open-source API gateway. Kong already handles HTTP traffic, authentication, rate limiting, and plugin-based transformation for traditional APIs. The AI Gateway layer extends those primitives specifically for LLM workloads: it understands that a request body might be an OpenAI-format chat completion, that a response contains token counts, and that routing decisions might depend on the model field rather than a URL path.

The mental model is AI traffic as a first-class API concern. Rather than treating an LLM call as just another HTTP endpoint, Kong AI Gateway gives you plugins that speak the language of models—prompt injection, token-based rate limiting, semantic deduplication of cached responses, and observability that surfaces token spend alongside latency and error rates.

How Kong AI Gateway works

Kong works as a reverse proxy: your application sends requests to a Kong route, and Kong forwards them upstream after running any plugins attached to that route. The AI Gateway plugins slot into that pipeline:

AI Proxy plugin routes requests to a configured LLM provider (OpenAI, Anthropic, Cohere, Azure, Llama deployments, and others) in an OpenAI-compatible format, translating as needed.
AI Request / Response Transformer plugins inject system prompts, rewrite messages, or modify the response without touching application code.
AI Rate Limiting Advanced plugin enforces limits on token consumption rather than just request counts—which matches how LLM costs actually accrue.
AI Semantic Cache plugin serves cached responses for semantically similar subsequent prompts, saving provider tokens on repeated queries.
AI Prompt Guard plugin inspects prompts against allow/deny patterns to enforce content policies before calls reach the model.
Observability plugins surface per-request token counts, latency, model name, and cost estimates into existing monitoring stacks—Prometheus, Datadog, and others.

Because all of this runs through Kong's plugin chain, the configuration lives alongside your existing API policies. For organizations that have already centralized API governance in Kong, there is no second control plane to operate.

Kong AI Gateway strengths

Enterprise API-management heritage. Kong has years of production use for traditional APIs. The plugin ecosystem, RBAC, certificate management, and deployment tooling are mature—and that maturity extends to LLM traffic.
Self-host control. Kong runs in your own infrastructure—on-prem, Kubernetes, or private cloud. For strict data-residency or security requirements that prohibit third-party proxying, this matters a great deal.
Kubernetes-native operations. Kong's Ingress Controller fits naturally into Helm- and GitOps-managed clusters, using tooling platform teams already know.
Unified control plane for mixed API estates. Traditional REST APIs and LLM APIs behind the same gateway, with the same policies, logging, and alerting.
Token-aware rate limiting and semantic caching. At scale, both can produce meaningful savings: token budgets per user or team, and semantic deduplication of repeated prompts.

Considerations before committing

Operational complexity. Kong's power comes with real infra-ops cost: managing the cluster, upgrading it, tuning the control-plane datastore, and maintaining plugin configuration across environments. For a platform team that already operates Kong this is existing work. For a product team that just wants routing and fallback, standing up Kong to get there is a significant upfront investment.

Routing logic is yours to design. Kong AI Gateway gives you primitives—a proxy plugin, a transformer, a rate limiter. Assembling those into an opinionated routing strategy ("send cheap tasks to Gemini Flash, fail over to GPT-4o, race on high-priority requests") requires you to design and wire that logic yourself. Kong executes the policies you define; it does not opine on which model to use for which task.

Licensing and feature distribution. Kong Gateway is open-source (Apache 2.0 for the core). Kong Konnect and Kong Enterprise add managed-control-plane features and support under commercial licensing. Because pricing and which AI plugins sit in which tier change over time, verify the current state on Kong's site before planning your architecture around specific capabilities.

Kong AI Gateway vs. flo2: side by side

Dimension	Kong AI Gateway	flo2 (zero-markup BYOK)
Primary audience	Enterprise platform teams already running Kong	Product and backend developers wanting routing without infra ops
Deployment	Self-hosted (Kubernetes, cloud, on-prem) or Kong Konnect	Hosted; drop-in endpoint replacement
Setup effort	Significant: deploy Kong, configure plugins, manage upgrades	Low: swap base URL and API key in existing SDK code
Token markup	None; Kong is not a token reseller	Zero markup; pay providers directly
API compatibility	OpenAI-compatible via AI Proxy plugin	OpenAI- and Anthropic-compatible out of the box
Routing strategy	Policy-driven via plugin config; you define the logic	Built-in: route by cost/latency, fallback, racing, A/B + judge
Semantic caching	AI Semantic Cache plugin (self-managed)	Opt-in response caching
Data residency	Full control; traffic stays in your infra	Hosted; requests proxied through flo2
Prompt guards	AI Prompt Guard plugin	Not the focus
Pricing	Open-source core free; Konnect/Enterprise: see Kong's site	Free during Beta

Where flo2 fits for teams that want routing without the ops

flo2 is a developer-first LLM gateway built around the routing-and-economics job that most product teams actually need solved. Like Kong, it is BYOK—you bring your own keys for OpenAI, Anthropic, Gemini, Groq, Cerebras, DeepInfra, Mistral, xAI, and OpenRouter, paying each provider directly with zero markup. Unlike Kong, there is nothing to deploy or operate: swap a base URL, get one key compatible with both the OpenAI and Anthropic SDKs, and the routing layer is live.

Out of the box you get smart routing (cheapest or fastest model per task), fallback chains (transparent failover on outages or rate limits), racing (fire several models in parallel, take the fastest good response), A/B testing with a model judge that scores task fit on evidence rather than intuition, opt-in response caching, and true per-call cost accounting in real dollars—not aggregate token tallies. flo2 is free during its Beta, so you can point an existing SDK at it and compare against your current setup in minutes.

How to choose

Kong AI Gateway is the right call if you already run Kong in production, have data-residency requirements that make self-hosting mandatory, or need token-aware rate limiting, semantic caching, and prompt guards inside a unified API-management control plane. Weigh the infra-ops cost honestly against the governance value you get.
flo2 is the right call if you are a product or backend team that wants routing, fallback, racing, A/B evaluation, and per-call cost accounting with no setup overhead and no token markup. The absence of infra to run is intentional, not a compromise.
Before you commit either way, the best LLM gateway comparison walks through open-source proxies, cloud-native gateways, and BYOK routers so you can see the full field.

The right gateway is the one that matches your operational constraints and the specific job you need done. If intelligent per-request routing with zero markup and no infra overhead is that job, try flo2—it is free during Beta and takes minutes to wire up.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →