OpenAI API Guide: Keys, Chat Completions & First Request
The OpenAI API is the most widely-adopted interface in production AI today — and for good reason. It gives developers a clean HTTP interface to GPT models, a well-documented SDK for Python and Node, and an ecosystem of compatible tooling that stretches far beyond OpenAI itself. This guide walks through everything you need to go from zero to a working first request: getting an OpenAI API key, authenticating correctly, understanding the Chat Completions API, and knowing the concepts — streaming, tool calling, JSON mode — that matter most in real applications. We'll also cover security basics and touch on pricing and rate limits at a conceptual level (always verify the current numbers on OpenAI's own pages).
Getting an OpenAI API Key
Your entry point is platform.openai.com. Create an account or sign in, navigate to the API keys section, and click Create new secret key. Give it a descriptive name so you can identify it if you ever need to rotate or revoke it. OpenAI only shows you the full key once — copy it immediately and store it somewhere safe, like a password manager or a secrets vault.
A few practical points before you move on:
- New accounts typically start on a free tier with limited credits. Usage beyond that requires adding a payment method under Billing.
- Accounts are subject to usage limits that expand as your track record grows — check the Limits section on the platform for your current tier.
- You can create multiple keys (per project, per environment) and revoke them individually — use this, don't share a single key everywhere.
Authentication: Authorization Bearer
Every OpenAI API request is authenticated with an Authorization header in the format:
Authorization: Bearer YOUR_API_KEY
That's it. No cookies, no session tokens, no OAuth dance for basic API use. Every HTTP client — curl, the Python SDK, the Node SDK, raw fetch — just attaches this header, and you're in.
The critical security corollary: never put your key in client-side code. A key embedded in a browser bundle or a mobile app is a key that anyone with a network inspector owns. Proxy requests through your own backend instead, where the key lives in server-side environment variables.
The Chat Completions Endpoint
Almost everything you'll build against OpenAI routes through a single endpoint:
POST https://api.openai.com/v1/chat/completions
This is the Chat Completions API — the core of GPT-4, GPT-4o, and the broader model family. It takes a conversation as input (a list of messages with roles) and returns the model's next message.
Your first request with curl
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is the Chat Completions API?"}
]
}'
Replace gpt-4o-mini with whatever current model ID you're targeting — check the OpenAI models page for current IDs, since the lineup changes.
The same request in Python
Install the SDK once: pip install openai. The OpenAI API Python library handles authentication, serialization, retries, and streaming for you.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o-mini", # verify current model IDs in OpenAI docs
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is the Chat Completions API?"},
],
)
print(response.choices[0].message.content)
The SDK reads OPENAI_API_KEY from the environment automatically if you don't pass api_key explicitly — convenient, but also means you must set that variable before running anything.
Key Concepts in the Chat Completions API
Messages and roles
The messages array is a list of objects, each with a role and content. The three roles you'll use constantly:
- system — sets persistent instructions that shape the model's behavior for the entire conversation. Put your persona, constraints, and output format rules here.
- user — the human turn. What the caller is actually asking.
- assistant — a previous model response. Include past turns to give the model conversation history.
The model reads all the messages and generates the next assistant turn. For stateless deployments, you reconstruct the full message list on every call — the API itself holds no session state.
The model parameter
The model field selects which model handles the request. OpenAI maintains a range of models across the capability-cost spectrum — verify current model IDs and capabilities in the OpenAI models documentation. Don't rely on blog posts (including this one) for model IDs; they go stale quickly as new versions ship.
temperature
temperature controls output randomness, on a scale from 0 to 2. Lower values (0–0.3) make outputs deterministic and consistent — good for code generation, extraction, and structured tasks. Higher values introduce more variety — useful for brainstorming or creative writing. Start at 0.2 for most production tasks; add heat if you need diversity.
max_tokens
max_tokens caps how many output tokens the model generates. This is not the context window — it's a ceiling on completion length. Setting it prevents runaway long responses and controls cost, since output tokens are priced per-token. If a response hits the limit, the finish_reason in the response will be length rather than stop. See the OpenAI-compatible API guide for more on response structure.
Streaming
Pass "stream": true and the API returns tokens as server-sent events (SSE) rather than waiting for the full completion. This dramatically reduces time-to-first-token for the user — they see text appear immediately instead of waiting for the entire response to complete.
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain streaming in three sentences."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Tool calling
The API supports tool calling (also called function calling) — you define a set of tools as JSON schemas, and the model can respond with a structured call to one of them instead of plain text. Your code executes the tool, returns the result as a tool role message, and calls the API again. This is how you build agents that can look up data, run calculations, or take actions while keeping the model in control of the flow.
JSON mode
Set "response_format": {"type": "json_object"} to tell the model to always emit valid JSON. Useful for extraction, classification, and any use case where downstream code needs to parse the response. Pair it with a system prompt that describes the expected JSON shape — the mode guarantees valid JSON but you still guide the structure.
Pricing and Rate Limits
OpenAI charges per token — separately for input (prompt) and output (completion) tokens, with output tokens priced higher than input. Some models also support cached input pricing for repeated prompt prefixes, which can cut costs significantly on workloads with a shared system prompt. The exact prices vary by model and change over time: always check OpenAI's official pricing page before estimating your bill.
Rate limits are expressed as requests per minute (RPM) and tokens per minute (TPM). They vary by account tier and model. When you exceed them, the API returns HTTP 429. The response includes a Retry-After header that tells you how long to wait. Your code should respect it rather than hammering the endpoint — exponential backoff with jitter is the standard pattern.
Security Best Practices
An exposed API key is an instant liability — anyone who finds it can run API calls charged to your account. Follow these rules without exception:
- Never commit your key to source control. Add
.envto your.gitignorebefore you create the file. Check your git history if you've already pushed — keys in history are still compromised. - Load keys from environment variables.
os.environ["OPENAI_API_KEY"]in Python,process.env.OPENAI_API_KEYin Node. Use a.envfile locally (withpython-dotenvordotenv); use your platform's secret management (AWS Secrets Manager, Vercel env vars, Fly.io secrets) in production. - Use one key per environment. Separate dev, staging, and production keys so a leak in dev doesn't touch production quotas.
- Set spending limits. Platform.openai.com lets you configure hard monthly limits — set them before you go to production so a runaway loop doesn't generate a surprise bill.
- Rotate compromised keys immediately. Revoke the old key in the dashboard the moment you suspect exposure, issue a new one, and deploy it before the old one can be exploited further.
Your OpenAI-Compatible Code Works Everywhere
Here's the thing worth understanding once your first OpenAI call is working: the request format you just wrote — POST /v1/chat/completions with a messages array — has become an industry standard. Dozens of providers implement the same endpoint shape. That means the same Python code, pointed at a different base_url, reaches Anthropic Claude, Google Gemini, Groq, Mistral, DeepSeek, and many others without a rewrite. Read more about how this works in our guide to the OpenAI-compatible API.
In practice, this portability is what makes a routing layer worthwhile. Instead of hardcoding a single OpenAI endpoint, you point your app at a gateway. The gateway holds your provider keys and routes each request to the best available option — cheapest, fastest, or most available — with automatic fallback if a provider returns a 429 or times out. You get OpenAI's models on the fast path and every other provider as a fallback, all behind one stable endpoint. For a deeper look at why this matters and how it works, see our explanation of what is an LLM gateway.
flo2 is a developer-first LLM gateway built for exactly this pattern. Bring your own provider keys — OpenAI, Anthropic, Gemini, Groq, Mistral, and more — and pay each provider directly at their listed rates, with zero token markup. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically on errors or rate limits. It's free during Beta — so you can wire your existing OpenAI code through it today and get multi-provider resilience without changing your request format.