Fireworks AI API Guide: Fast Open-Model Inference & Setup
If you want fast, affordable inference on open-weight models without running your own GPU cluster, the Fireworks AI API is one of the most developer-friendly options available. Fireworks.ai is a hosted inference platform that specializes in serving popular open-weight models — Llama, Qwen, Mixtral, and others — with low latency, an OpenAI-compatible API, function calling, and a fine-tuning layer for teams that need custom model behavior. This guide covers how to get a Fireworks API key, how the OpenAI-compatible endpoint works, what Fireworks is genuinely good at, how to think about pricing, and how to route Fireworks behind a gateway for fallback and cost control.
One important note before we get into specifics: model names, pricing tiers, and the exact catalog change regularly. Treat Fireworks's own documentation and pricing page as the authoritative source — the figures here describe the shape of the offering, not hard numbers you should bake into a budget or your code.
What Is the Fireworks AI API?
Fireworks.ai is a managed inference platform. It doesn't train models — it takes open-weight models that the research community releases (Meta's Llama family, Alibaba's Qwen, Mistral AI's Mixtral, and others) and hosts them at production scale on optimized GPU infrastructure. You call a single API endpoint, pass a model ID, and get fast completions back. No instance management, no CUDA debugging, no cold-start math.
The platform's headline feature is speed. Fireworks invests heavily in inference optimization — custom kernels, batching strategies, speculative decoding — so that tokens come back fast even under load. Alongside raw inference, the platform offers:
- Function calling and structured output — the same tool-use patterns familiar from OpenAI's API, available on supported open models.
- Fine-tuning — upload a dataset, kick off a training run, and deploy the resulting adapter against Fireworks's infrastructure without managing any hardware.
- A growing model catalog — the exact list of available Fireworks AI models, their context windows, and their capabilities lives at the Fireworks docs; verify there before committing to a specific model ID.
For teams evaluating best open-source LLM APIs, Fireworks sits alongside providers like Together AI, DeepInfra, Groq, and Cerebras — each optimized differently. Fireworks tends to stand out for breadth of model selection combined with solid throughput.
Getting a Fireworks AI API Key
Three steps stand between you and your first call:
- Sign up at fireworks.ai and create an account.
- Navigate to the API keys section of your dashboard and generate a key.
- Store it in an environment variable — never commit it directly to source.
export FIREWORKS_API_KEY="fw_your_key_here"
Fireworks typically offers a free credit allocation to new accounts so you can evaluate the platform without a credit card on the first day. Check the current signup page for what's included — free credit amounts change over time.
The Fireworks OpenAI-Compatible Endpoint
The most developer-friendly part of the Fireworks API is that it speaks OpenAI's Chat Completions protocol. The base URL is:
https://api.fireworks.ai/inference/v1
Append /chat/completions to hit the standard completions endpoint. If your code already talks to OpenAI, swapping in Fireworks is a base URL and API key change — nothing else. Model IDs follow Fireworks's own naming convention (typically accounts/fireworks/models/<model-name>), but the request and response shapes are identical to OpenAI's API.
curl Example
curl https://api.fireworks.ai/inference/v1/chat/completions \
-H "Authorization: Bearer $FIREWORKS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [
{"role": "user", "content": "Explain the difference between RAG and fine-tuning in two sentences."}
],
"max_tokens": 256
}'
Replace the model ID with a current model from Fireworks's models page — model slugs change as new versions launch and older ones retire.
Python Example (openai SDK)
Because the endpoint is OpenAI-compatible, you can use the official openai Python package directly — just override base_url and api_key:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.fireworks.ai/inference/v1",
api_key=os.environ["FIREWORKS_API_KEY"],
)
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct", # verify current ID
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function that retries an HTTP call with exponential backoff."},
],
max_tokens=512,
temperature=0.2,
)
print(response.choices[0].message.content)
Streaming works the same way — set stream=True and iterate the event chunks:
stream = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-8b-instruct",
messages=[{"role": "user", "content": "Summarize the CAP theorem."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Function Calling
On models that support it, Fireworks implements the standard OpenAI tools/function-calling interface. You pass a tools array and optionally set tool_choice:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="accounts/fireworks/models/llama-v3p1-70b-instruct", # verify support
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name, tool_call.function.arguments)
Check the Fireworks docs for which specific model IDs have function-calling enabled — not every model in the catalog supports tool use.
What Fireworks Is Good At
Fireworks is a strong fit for several workload categories:
- Low-latency open-model serving. If you need a fast Llama, Qwen, or Mixtral endpoint without managing infrastructure, Fireworks is purpose-built for it. The inference optimization work the team has done shows up most clearly in time-to-first-token on larger models.
- Production workloads on open weights. Fireworks offers SLA-backed hosting, which makes it more suitable than hobbyist-tier options when you need reliability guarantees for a deployed product.
- Function calling with open models. Teams that want structured tool-use behavior from open-weight models without paying closed-model prices find Fireworks a practical middle ground.
- Fine-tuned model deployment. If you've trained a LoRA or full fine-tune on an open model, Fireworks's platform lets you deploy it without standing up your own serving stack.
- Switching costs from OpenAI. Because the API surface is identical, Fireworks is one of the lowest-friction ways to move existing OpenAI code to an open-model host.
Fireworks AI Pricing
Fireworks charges on a standard per-token model: separate rates for input tokens and output tokens, with larger and more capable models costing more per token than smaller ones. There's also typically a distinction between "serverless" inference (pay-per-call, no reserved capacity) and "on-demand" or reserved deployments for teams with predictable high-volume needs.
For current Fireworks AI pricing, always check the official pricing page — rates get updated as Fireworks adjusts its GPU costs and model catalog, and any figure in a blog post (including this one) can drift out of date quickly. What's stable: smaller open-weight models cost less than larger ones, and Fireworks is generally priced competitively versus closed-model APIs for equivalent capability.
Routing Fireworks Behind a Gateway for Fallback and Cost Control
Calling Fireworks directly works fine for a prototype. In production, directly coupling your application to a single provider creates two problems: if Fireworks has an outage or returns 429 rate-limit errors during a traffic spike, your app fails; and if a cheaper or faster provider launches a model that fits your task better, you need to change code to switch.
Putting Fireworks behind an LLM gateway solves both:
- Automatic fallback. Route requests to Fireworks first; if it's slow or rate-limited, the gateway transparently retries on another provider (DeepInfra, Together AI, Groq — wherever you have a key) with no change to your application code.
- Cost routing. A gateway that knows model pricing can route classification tasks to the cheapest model that fits and reserve Fireworks's faster (often pricier) capacity for latency-sensitive paths.
- One API key for everything. You keep one OpenAI-compatible key in your code and let the gateway manage the Fireworks key, the fallback keys, and the routing logic separately from your application.
This pattern is especially useful for Fireworks because its model catalog and pricing evolve quickly — keeping that complexity in a gateway config rather than scattered across services makes it easier to update.
flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this pattern. You add your own Fireworks API key alongside OpenAI, Anthropic, Gemini, Groq, Mistral, DeepInfra, xAI, and others, and pay each provider directly with zero markup on tokens — flo2 doesn't take a per-token cut. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest option and falls back automatically when a provider is rate-limited or unavailable. It's free during Beta, so you can wire Fireworks in behind a fallback and start routing in minutes.