2026-06-03 · flo2 blog

OpenAI API Guide: Keys, Chat Completions & First Request

The OpenAI API is the most widely-adopted interface in production AI today — and for good reason. It gives developers a clean HTTP interface to GPT models, a well-documented SDK for Python and Node, and an ecosystem of compatible tooling that stretches far beyond OpenAI itself. This guide walks through everything you need to go from zero to a working first request: getting an OpenAI API key, authenticating correctly, understanding the Chat Completions API, and knowing the concepts — streaming, tool calling, JSON mode — that matter most in real applications. We'll also cover security basics and touch on pricing and rate limits at a conceptual level (always verify the current numbers on OpenAI's own pages).

Getting an OpenAI API Key

Your entry point is platform.openai.com. Create an account or sign in, navigate to the API keys section, and click Create new secret key. Give it a descriptive name so you can identify it if you ever need to rotate or revoke it. OpenAI only shows you the full key once — copy it immediately and store it somewhere safe, like a password manager or a secrets vault.

A few practical points before you move on:

Authentication: Authorization Bearer

Every OpenAI API request is authenticated with an Authorization header in the format:

Authorization: Bearer YOUR_API_KEY

That's it. No cookies, no session tokens, no OAuth dance for basic API use. Every HTTP client — curl, the Python SDK, the Node SDK, raw fetch — just attaches this header, and you're in.

The critical security corollary: never put your key in client-side code. A key embedded in a browser bundle or a mobile app is a key that anyone with a network inspector owns. Proxy requests through your own backend instead, where the key lives in server-side environment variables.

The Chat Completions Endpoint

Almost everything you'll build against OpenAI routes through a single endpoint:

POST https://api.openai.com/v1/chat/completions

This is the Chat Completions API — the core of GPT-4, GPT-4o, and the broader model family. It takes a conversation as input (a list of messages with roles) and returns the model's next message.

Your first request with curl

curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are a concise technical assistant."},
      {"role": "user", "content": "What is the Chat Completions API?"}
    ]
  }'

Replace gpt-4o-mini with whatever current model ID you're targeting — check the OpenAI models page for current IDs, since the lineup changes.

The same request in Python

Install the SDK once: pip install openai. The OpenAI API Python library handles authentication, serialization, retries, and streaming for you.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o-mini",   # verify current model IDs in OpenAI docs
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "What is the Chat Completions API?"},
    ],
)

print(response.choices[0].message.content)

The SDK reads OPENAI_API_KEY from the environment automatically if you don't pass api_key explicitly — convenient, but also means you must set that variable before running anything.

Key Concepts in the Chat Completions API

Messages and roles

The messages array is a list of objects, each with a role and content. The three roles you'll use constantly:

The model reads all the messages and generates the next assistant turn. For stateless deployments, you reconstruct the full message list on every call — the API itself holds no session state.

The model parameter

The model field selects which model handles the request. OpenAI maintains a range of models across the capability-cost spectrum — verify current model IDs and capabilities in the OpenAI models documentation. Don't rely on blog posts (including this one) for model IDs; they go stale quickly as new versions ship.

temperature

temperature controls output randomness, on a scale from 0 to 2. Lower values (0–0.3) make outputs deterministic and consistent — good for code generation, extraction, and structured tasks. Higher values introduce more variety — useful for brainstorming or creative writing. Start at 0.2 for most production tasks; add heat if you need diversity.

max_tokens

max_tokens caps how many output tokens the model generates. This is not the context window — it's a ceiling on completion length. Setting it prevents runaway long responses and controls cost, since output tokens are priced per-token. If a response hits the limit, the finish_reason in the response will be length rather than stop. See the OpenAI-compatible API guide for more on response structure.

Streaming

Pass "stream": true and the API returns tokens as server-sent events (SSE) rather than waiting for the full completion. This dramatically reduces time-to-first-token for the user — they see text appear immediately instead of waiting for the entire response to complete.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain streaming in three sentences."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Tool calling

The API supports tool calling (also called function calling) — you define a set of tools as JSON schemas, and the model can respond with a structured call to one of them instead of plain text. Your code executes the tool, returns the result as a tool role message, and calls the API again. This is how you build agents that can look up data, run calculations, or take actions while keeping the model in control of the flow.

JSON mode

Set "response_format": {"type": "json_object"} to tell the model to always emit valid JSON. Useful for extraction, classification, and any use case where downstream code needs to parse the response. Pair it with a system prompt that describes the expected JSON shape — the mode guarantees valid JSON but you still guide the structure.

Pricing and Rate Limits

OpenAI charges per token — separately for input (prompt) and output (completion) tokens, with output tokens priced higher than input. Some models also support cached input pricing for repeated prompt prefixes, which can cut costs significantly on workloads with a shared system prompt. The exact prices vary by model and change over time: always check OpenAI's official pricing page before estimating your bill.

Rate limits are expressed as requests per minute (RPM) and tokens per minute (TPM). They vary by account tier and model. When you exceed them, the API returns HTTP 429. The response includes a Retry-After header that tells you how long to wait. Your code should respect it rather than hammering the endpoint — exponential backoff with jitter is the standard pattern.

Security Best Practices

An exposed API key is an instant liability — anyone who finds it can run API calls charged to your account. Follow these rules without exception:

Your OpenAI-Compatible Code Works Everywhere

Here's the thing worth understanding once your first OpenAI call is working: the request format you just wrote — POST /v1/chat/completions with a messages array — has become an industry standard. Dozens of providers implement the same endpoint shape. That means the same Python code, pointed at a different base_url, reaches Anthropic Claude, Google Gemini, Groq, Mistral, DeepSeek, and many others without a rewrite. Read more about how this works in our guide to the OpenAI-compatible API.

In practice, this portability is what makes a routing layer worthwhile. Instead of hardcoding a single OpenAI endpoint, you point your app at a gateway. The gateway holds your provider keys and routes each request to the best available option — cheapest, fastest, or most available — with automatic fallback if a provider returns a 429 or times out. You get OpenAI's models on the fast path and every other provider as a fallback, all behind one stable endpoint. For a deeper look at why this matters and how it works, see our explanation of what is an LLM gateway.

flo2 is a developer-first LLM gateway built for exactly this pattern. Bring your own provider keys — OpenAI, Anthropic, Gemini, Groq, Mistral, and more — and pay each provider directly at their listed rates, with zero token markup. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically on errors or rate limits. It's free during Beta — so you can wire your existing OpenAI code through it today and get multi-provider resilience without changing your request format.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to