2026-06-03 · flo2 blog

Use Gemini with the OpenAI SDK: The Compatibility Endpoint

Google ships a Gemini OpenAI-compatible endpoint that lets you call Gemini models with the exact same openai SDK you already use for GPT — same client, same wire format, same streaming interface. You change three things: the base URL (pointing at Google's generativelanguage OpenAI path), the API key (your Gemini key), and the model name (a Gemini model ID). That's it. If you have an existing OpenAI-based app, switching to Gemini is a two-minute environment-variable edit. This guide covers the endpoint, working curl and Python examples, what the compatibility layer supports, migration gotchas, and how to route Gemini behind a gateway alongside your other providers.

The Gemini OpenAI endpoint: where it lives

Google exposes its Gemini OpenAI endpoint under a path inside the generativelanguage.googleapis.com host. The base URL you hand to any OpenAI client is approximately:

https://generativelanguage.googleapis.com/v1beta/openai/

That trailing /openai/ segment is the compatibility shim — it maps the OpenAI wire format onto Gemini's underlying API. From there you get the familiar routes, including /chat/completions and /embeddings.

Always verify the exact URL in Google's official OpenAI-compatibility docs before you ship. Google has iterated on this path and may update it; the URL above is correct as of mid-2026, but a 30-second check against the current docs beats debugging a 404 at 2 a.m.

Authentication works with a standard Gemini API key — the same one from Google AI Studio you'd use with the native SDK. You pass it as a Bearer token in the Authorization header, just as you would an OpenAI key. Your OpenAI key will not work here; swap the credential when you swap the endpoint.

curl example: Gemini via the OpenAI endpoint

The fastest way to confirm the endpoint is live for your key is a direct curl call. Set GEMINI_API_KEY in your environment first:

curl https://generativelanguage.googleapis.com/v1beta/openai/chat/completions \
  -H "Authorization: Bearer $GEMINI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.0-flash",
    "messages": [
      {"role": "system", "content": "You are concise."},
      {"role": "user",   "content": "Explain what a context window is in one sentence."}
    ]
  }'

The response is the standard OpenAI shape: a choices array, message.content, finish_reason, and a usage object with prompt_tokens, completion_tokens, and total_tokens. The model ID above (gemini-2.0-flash) is an example — verify the current list of model IDs in Google's model docs rather than hard-coding a name from a blog post; Google revises version strings on their own cadence.

Use Gemini with the OpenAI Python library

This is the point of the compatibility layer. You keep using openai, you just override base_url and api_key in the constructor:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key=os.environ["GEMINI_API_KEY"],
)

resp = client.chat.completions.create(
    model="gemini-2.0-flash",       # confirm current model IDs in Google's docs
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": "What is the difference between Gemini Flash and Pro?"},
    ],
)

print(resp.choices[0].message.content)
print(resp.usage)  # prompt_tokens / completion_tokens / total_tokens

Every framework that accepts a configurable OpenAI base URL — LangChain, LlamaIndex, the Vercel AI SDK, instructor, your own wrappers — can point at Gemini the same way. The request they build is identical; only the destination changes.

Streaming with the OpenAI client

Streaming works the same as against OpenAI: pass stream=True and iterate the chunks. Google's endpoint follows the same server-sent-events protocol (data: lines ending with data: [DONE]), so existing streaming consumers need no changes.

stream = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[{"role": "user", "content": "List five uses for the Gemini API."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

If you need token accounting on streamed responses, pass stream_options={"include_usage": True} and check the final chunk for the usage field — then verify this option is currently supported on the Gemini endpoint in Google's docs, since streaming usage reporting can lag behind the non-streaming implementation.

Using the OpenAI Node / TypeScript SDK

The JavaScript SDK follows exactly the same pattern — set baseURL and apiKey:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://generativelanguage.googleapis.com/v1beta/openai/",
  apiKey: process.env.GEMINI_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "gemini-2.0-flash",
  messages: [{ role: "user", content: "Summarize the OpenAI compatibility endpoint for Gemini." }],
});

console.log(resp.choices[0].message.content);

What the Gemini OpenAI compatibility layer supports

Google's compatibility shim covers the core surface well, but it is not a pixel-perfect clone of the OpenAI API. Here is a practical working map — verify each point against Google's current documentation before you rely on it, since coverage expands over time:

Chat completions — well supported, including system, user, and assistant roles and multi-turn conversation history.
Streaming — supported via stream=True/stream: true.
Function calling / tools — available using the standard tools and tool_choice parameters; model-dependent, so confirm support for the specific Gemini model you choose.
Embeddings — accessible at the /embeddings route; confirm which models support it and the current vector dimension.
Common parameters — temperature, top_p, max_tokens, and stop are generally handled.
JSON / structured output — response_format for JSON is supported on capable models; strict JSON-schema modes — verify current support status.

What you will not find here: OpenAI's Assistants API, Responses API, image generation (dall-e-*), or text-to-speech endpoints. Those are OpenAI products. On the Gemini side, native multimodal inputs (images, audio, video in Gemini's own format), context caching, fine-grained safety settings, and the latest Gemini-specific features are accessible only through the native google-genai SDK — not the OpenAI shim. For an explanation of why the OpenAI wire format became the industry default in the first place, see OpenAI-compatible API.

Switching an existing OpenAI app to Gemini: the three-value migration

If your application already uses the openai SDK, the migration to Gemini is genuinely three environment variables. Your message construction, streaming loop, response parsing, and error handling code stays untouched:

LLM_BASE_URL → https://generativelanguage.googleapis.com/v1beta/openai/
LLM_API_KEY → your Gemini API key from Google AI Studio
LLM_MODEL → a current Gemini model ID (e.g. gemini-2.0-flash)

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["LLM_BASE_URL"],
    api_key=os.environ["LLM_API_KEY"],
)

resp = client.chat.completions.create(
    model=os.environ["LLM_MODEL"],
    messages=[{"role": "user", "content": "Summarize the latest release notes."}],
)

With this pattern you can switch between OpenAI, Gemini, or any other OpenAI-compatible provider by flipping three environment variables — no code changes, no redeploy of application logic. For the broader picture on providers that support this pattern, including Groq, Mistral, and others, see our Gemini API guide.

Gotchas to watch for after the switch

Model names are provider-specific. There is no gpt-4o on Gemini. Any routing logic, prompt templates, or feature flags keyed on OpenAI model strings need updating to Gemini IDs.
Rate limits differ. Google imposes its own requests-per-minute and tokens-per-minute quotas, and the free tier caps are fairly low. A workload that ran fine on OpenAI's paid tier can hit 429s quickly on Gemini's free tier. Read the Retry-After header and back off, or build fallback routing.
Unsupported parameters get dropped or rejected. Fields like logprobs, n > 1, and certain penalty parameters may not map onto Gemini. If you pass fields your application relies on, test them explicitly rather than assuming they silently round-trip.
Free-tier data handling. Google's free tier has historically had different data-use terms than the paid tier. If you are sending sensitive data, confirm the current terms before you route real production traffic through a free-tier key.
Context window and output limits vary by model. Gemini Pro and Flash have different context windows and max-output-token limits. Check the model spec, especially if you pass long prompts or expect long responses.

Routing Gemini behind a gateway alongside other providers

Pointing the OpenAI SDK straight at Google works well for a single-provider setup. The limitation appears when you want more: automatic fallback when Gemini rate-limits, cost-based routing that sends easy tasks to cheap Flash and hard tasks to a frontier model, or the ability to A/B-test Gemini against OpenAI on real traffic. Hard-coding one endpoint trades one lock-in for another.

A gateway solves this by giving you a single URL with many providers behind it. Because Gemini speaks the same chat/completions dialect as OpenAI, Groq, Mistral, and others, a gateway can treat them as interchangeable backends and select one per request — cheapest, fastest, or by capability. A common pattern for Gemini-based systems looks like:

Route high-volume, straightforward tasks (classification, extraction, summarization) to Gemini Flash, where the per-token price advantage compounds across millions of requests.
Escalate complex reasoning or long-context work to Gemini Pro or a frontier model from another provider.
On a 429 from Google, automatically retry the same request against a fallback provider — OpenAI, Groq, or any other — instead of surfacing an error to the user.

flo2 is a developer-first LLM gateway built for exactly this. It has zero token markup: you bring your own Gemini key (plus OpenAI, Anthropic, Groq, Cerebras, Mistral, and more) and pay each provider directly at their real prices — flo2 does not resell tokens or add a margin. One key, one endpoint, usable through both the OpenAI-compatible and Anthropic-compatible wire formats:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://flo2.com/v1",
    api_key=os.environ["FLO2_API_KEY"],
)

resp = client.chat.completions.create(
    model="gemini-2.0-flash",      # flo2 routes to your Gemini key
    messages=[{"role": "user", "content": "Summarize this support ticket."}],
)

Switch to model="auto" and flo2 routes each request to the cheapest or fastest model that fits, with automatic fallback if the preferred provider rate-limits or errors. You get routing, fallback, AI racing, A/B testing with an LLM judge, opt-in response caching, and true per-call cost accounting across every provider — with the Gemini OpenAI-compatible interface you just learned, so the migration is once again a base_url and api_key swap. flo2 is free during Beta.

Start experimenting with flo2, and if you want to understand the broader ecosystem of providers using this interface, see our guides on the OpenAI-compatible API and the full Gemini API guide.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →