Google Gemini API Guide: Keys, Free Tier, Pricing & Setup
If you want fast, capable models without burning through a budget, the Gemini API is one of the most practical starting points for developers in 2026. Google ships a genuinely useful free tier, a cheap-and-fast Flash model family for high-volume work, a more capable Pro family for hard reasoning, a native SDK, and — conveniently — an OpenAI-compatible endpoint so you can call Gemini with the same client code you already use for everyone else. This guide walks through getting a key, choosing a model, understanding the free tier and pricing, and wiring Gemini up next to your other providers.
Prices, rate limits, and model version names change frequently, so this guide stays conceptual and points you at Google's official pages for current numbers — treat anything specific as "go check the docs," not a quote.
Getting a Gemini API key from Google AI Studio
The quickest path to a working request is Google AI Studio. For most developers it's the lightweight, self-serve route to a key, while Vertex AI on Google Cloud is the heavier enterprise option with IAM, billing projects, and regional controls baked in. If you're prototyping, start with AI Studio.
- Sign in at aistudio.google.com with a Google account.
- Open the API key section and create a new key (it gets attached to a Google Cloud project — AI Studio can create one for you).
- Copy the key once and store it as an environment variable, not in source.
export GEMINI_API_KEY=...is enough to get going.
That single Gemini API key works for both the native Google SDK and the OpenAI-compatible endpoint, so you don't need separate credentials for the two calling styles. Always read the current setup instructions in Google's Gemini API docs, since the console layout and project-linking flow shift over time.
Gemini model families: Pro vs Flash
Google organizes Gemini into a small number of families rather than one monolithic model, and picking the right one is the single biggest lever on both cost and latency. The two you'll reach for most are Pro and Flash.
- Gemini Pro — the higher-capability tier, aimed at complex reasoning, long-context analysis, harder coding, and multi-step tasks where quality matters more than price. It's the more expensive per-token option.
- Gemini Flash — the cost-and-speed-optimized tier. It's dramatically cheaper and faster than Pro and is the default choice for bulk work: classification, extraction, summarization, routing, simple drafting, and anything you run at high volume. Google has also shipped even smaller/cheaper Flash variants (Flash-Lite-class) for the most price-sensitive workloads.
The mental model: reach for Flash first, escalate to Pro only when Flash visibly falls short. A large share of real-world LLM calls are easy enough that Flash handles them at a fraction of the cost. Because Google revises version numbers regularly, don't hardcode a specific release in your head — check the live model list for the exact identifiers (and which are stable vs preview) before you wire them in.
Multimodal and long context
Gemini models are natively multimodal with large context windows, so beyond plain chat you can pass images, audio, and long documents in one request. That makes Flash especially attractive for document processing and bulk media tagging, where its low per-token price multiplied across thousands of items is what decides your bill.
The Gemini API free tier
One of the strongest reasons to start with Gemini is its free tier. Through an AI Studio API key, Google offers a standing (not trial) free allowance on Flash-class models, generous enough for prototypes, internal tools, and low-traffic apps. Two things matter more than the headline, though:
- It's rate-limited, not unlimited. The free tier is bounded by requests per minute, tokens per minute, and a requests-per-day ceiling. A bursty or growing workload hits these caps quickly, at which point you either wait, upgrade to a paid (billed) tier, or route the overflow elsewhere.
- Free-tier data may be handled differently. Google's terms have historically allowed free-tier prompts and responses to be used to improve their products, whereas paid usage generally carries stricter commitments. For anything sensitive or regulated, read the current terms before sending real data — don't assume free and paid have identical privacy guarantees.
Because both the exact limits and the data terms change, confirm them on Google's rate limits and pricing pages rather than trusting any number you read in a blog post (including this one). For a broader look at how Gemini's free tier stacks up against other no-cost options, see our guide to free LLM APIs.
Gemini API pricing, conceptually
Once you're past the free tier, Gemini API pricing follows the same per-token model as the rest of the industry: you pay separately for input tokens (your prompt) and output tokens (the reply), priced per million tokens, at a rate that depends on the model. The key takeaways for budgeting:
- Flash is much cheaper than Pro — often by a large multiple — which is why model choice dominates your cost far more than prompt micro-optimizations.
- Output tokens usually cost more than input tokens, so capping response length is a direct cost lever.
- Long context and multimodal inputs add up as large input-token counts, and context caching can cut the cost of repeatedly sending the same large prefix (useful for RAG and long system prompts).
For the actual dollars-per-million-tokens by model, go straight to Google's official pricing page. Those numbers move, and pinning a stale figure in your cost model will mislead you.
Calling Gemini with the OpenAI-compatible endpoint
Here's the part that makes Gemini easy to adopt: Google provides an OpenAI-compatible layer. You can keep using the official OpenAI SDK and simply point base_url at Google's OpenAI-compatible endpoint, passing your Gemini key as the API key and a Gemini model name as the model. No new client library required.
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GEMINI_API_KEY",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
)
resp = client.chat.completions.create(
model="gemini-flash-latest", # check the current model list for exact names
messages=[
{"role": "user", "content": "Summarize the Gemini API in one sentence."}
],
)
print(resp.choices[0].message.content)
That's the same chat.completions shape you'd use against OpenAI — only the base_url, key, and model string change, and it works from the JS SDK or plain curl the same way. The compatibility layer covers the common surface (chat, streaming, often function calling and embeddings), but it isn't a 100% mirror of the full API. For Gemini-only features, use the native SDK instead.
The native Google SDK
When you want first-class access to Gemini-specific capabilities — fine-grained safety settings, the full multimodal request shape, context caching, and the latest features the day they ship — reach for Google's own google-genai SDK:
from google import genai
client = genai.Client(api_key="YOUR_GEMINI_API_KEY")
resp = client.models.generate_content(
model="gemini-flash-latest",
contents="Summarize the Gemini API in one sentence.",
)
print(resp.text)
A reasonable rule of thumb: use the OpenAI-compatible endpoint when you want Gemini to slot into existing multi-provider code with minimal change, and the native SDK when you're building Gemini-first and want everything it offers.
Routing Gemini alongside other providers
Gemini rarely lives alone in a real system. The free tier runs out, Flash is perfect for some tasks but not others, and you'll want a fallback when Google returns a rate-limit error. This is where putting Gemini behind a gateway pays off — and if the term is new, start with what is an LLM gateway.
The OpenAI-compatibility layer is what makes this clean: because Gemini, OpenAI, and many others all speak the same chat.completions dialect, a router can treat them as interchangeable backends and pick per request. A practical Gemini-centric setup looks like:
- Cheap Flash for bulk tasks. Send your high-volume, easy traffic (classification, extraction, tagging, routing) to Gemini Flash, where its low price dominates the math.
- Escalate to Pro or another frontier model for hard tasks. Route the genuinely difficult requests to Gemini Pro — or to a different provider's flagship — only when the job warrants it.
- Fall back on rate limits. When the Gemini free tier (or a paid quota) returns a 429, automatically retry the same request against another provider so the call still succeeds instead of failing.
- Stack Gemini's free tier with others. Combine Gemini's free allowance with other free tiers and spill into cheap paid tokens only when every free pool is exhausted.
Wiring all of that by hand means juggling SDKs, catching provider-specific errors, and tracking which key is throttled. A gateway collapses it into one endpoint and one policy — and a bring-your-own-key gateway does it without taxing your tokens.
Doing it with flo2
flo2 is a developer-first, bring-your-own-key LLM gateway with zero token markup: you register your own Gemini key (plus OpenAI, Anthropic, Groq, Cerebras, and more) and pay Google directly at their real prices — flo2 doesn't resell tokens. You get a single endpoint that's both OpenAI- and Anthropic-compatible, so the same code that calls Gemini through the OpenAI SDK can route to any model, with each request sent to the cheapest or fastest option and automatic fallback when one provider rate-limits. It's the zero-markup OpenRouter alternative, free during Beta — so you can run cheap Flash for the bulk of your traffic, fall back elsewhere on a 429, and see the true cost of every call.
Bottom line
The Gemini API is an easy, high-leverage place to build: grab a key from Google AI Studio, default to cheap-and-fast Flash and escalate to Pro only when needed, lean on the generous (but rate-limited, and data-policy-distinct) free tier while you confirm the current limits and terms, and call it either through the OpenAI-compatible endpoint or the native SDK. Then put it behind a router so Flash handles the bulk, Pro handles the hard cases, and a 429 quietly fails over to the next provider instead of breaking your app.