2026-06-03 · flo2 blog

DeepInfra API Guide: Cheap Open-Model Inference & Setup

If your bill is being eaten alive by open-model inference, the DeepInfra API is worth a serious look. DeepInfra is a hosted inference platform: it runs popular open-weight models — the Llama family, Qwen, DeepSeek, and a long tail of others — on its own GPUs and exposes them through a simple, OpenAI-compatible API priced per token at the low end of the market. You don't provision or babysit a GPU; you send a request and pay for the tokens you use. For developers running classification, extraction, summarization, embeddings, or chat at volume, that combination — open models, no infra, very low per-token cost — is the whole reason to reach for it. This guide covers what DeepInfra is, how to get a key, the OpenAI-compatible setup with working curl and Python examples, what it's best at, how to think about pricing, and where a gateway fits to keep cheap models reliable.

One ground rule up front: DeepInfra's exact model catalog and per-token prices move quickly, and a number that's right today can be stale next quarter. Treat DeepInfra's official models and pricing pages as the source of truth, and verify anything below against them before you ship.

What is DeepInfra?

DeepInfra is an inference provider, not a model lab. It takes open-weight models that other organizations release and makes them callable as a hosted, pay-as-you-go API — so you get capable open models without buying GPUs, managing a serving stack like vLLM, or eating idle-capacity costs between bursts of traffic.

The DeepInfra models catalog spans the open ecosystem — large general-purpose chat models from the Llama and Qwen families, reasoning-focused models from DeepSeek, smaller fast models for cheap high-volume work, plus embedding and (depending on the day) image and audio models. That lineup changes constantly as new weights land and older ones retire, so don't hard-code a model assumption from a blog post — check DeepInfra's models page for what's live, its context window, and its exact ID first.

The other thing that makes DeepInfra easy to adopt: it exposes an OpenAI-compatible surface. If your code already speaks to /v1/chat/completions, moving a call to DeepInfra is mostly a base URL and an API key swap — no new SDK, no rewrite.

Getting a DeepInfra API key

Three things stand between you and your first token: get a key, point your client at the DeepInfra endpoint, and name a model.

Sign up for DeepInfra and open the API keys (sometimes labeled "tokens") section of the dashboard.
Create a new DeepInfra API key and copy it immediately — like most providers, the full secret is shown once.
Set up billing. DeepInfra is usage-billed, so you'll add a payment method or credit before sustained use; check the dashboard for any current free trial allowance.

Treat the key like any other secret: load it from an environment variable, never commit it to source control, and rotate it if it leaks. The examples below read it from DEEPINFRA_API_KEY.

export DEEPINFRA_API_KEY="your_key_here"

The DeepInfra OpenAI-compatible endpoint

The reason DeepInfra is so painless to drop in: its API is OpenAI-compatible, served under a dedicated base URL. Point any OpenAI-style client at it, pass your DeepInfra key as the bearer token, and set model to a current ID. The base URL is:

https://api.deepinfra.com/v1/openai

Note that DeepInfra model IDs are typically the full namespaced name from the model's origin — for example meta-llama/Meta-Llama-3.1-8B-Instruct rather than a short alias. Copy the exact string from DeepInfra's model page; a near-miss won't resolve. Here's a minimal curl call against chat completions:

curl https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain hosted open-model inference in one sentence."}
    ]
  }'

Replace the model value with a verified ID from DeepInfra's catalog. Request and response shapes mirror the OpenAI Chat Completions API, so fields like temperature, max_tokens, and stream behave as you'd expect.

DeepInfra with the Python openai client

Because the API is OpenAI-compatible, the official openai Python package talks to DeepInfra with two overrides — base_url and api_key. No DeepInfra-specific library is required.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key=os.environ["DEEPINFRA_API_KEY"],
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # verify current model ID
    messages=[
        {"role": "user", "content": "Give me three uses for cheap, hosted open-model inference."}
    ],
)

print(resp.choices[0].message.content)

Streaming works identically — pass stream=True and iterate the chunks. The same /v1/openai base URL also serves an OpenAI-compatible embeddings endpoint via client.embeddings.create(...). DeepInfra offers native routes for some model types too, but the OpenAI-compatible path is usually the fastest way into an existing codebase.

What the DeepInfra API is best for

DeepInfra's sweet spot is cheap open-model inference at scale. The advantage isn't a proprietary frontier model or record-setting speed — it's serving the open models you'd otherwise self-host, at a per-token rate low enough that volume jobs that were uneconomical elsewhere suddenly pencil out. Lean into it where that trade matters.

High-volume bulk work. Classification, extraction, summarization, synthetic-data generation, and evaluation over large datasets are where a low per-token rate compounds — a job that's marginal on a premium API can be comfortably affordable here.
Open-weight models without the ops. Want Llama-, Qwen-, or DeepSeek-class output without provisioning GPUs or managing a serving stack? DeepInfra hands you the model as an API.
Embeddings at scale. Vectorizing a large corpus is a classic case where a cheap rate on a hosted embedding model beats standing up your own.
Easy migration and A/B testing. Being OpenAI-compatible, you can pit a DeepInfra-hosted model against your current provider by changing a base URL and a model string.

One honest caveat: hosted open models can differ from a first-party endpoint in throughput, context window, and exact behavior, and performance varies by model and load. Benchmark from your own environment with your own prompts rather than trusting a headline number.

DeepInfra pricing, conceptually

DeepInfra bills the usual way for chat models — per token, input and output priced separately, with larger models costing more per token than smaller ones. Some model types (image or audio) may be billed differently, such as per generation or per second of compute. It consistently lands in "cheapest inference" conversations because those rates sit at the low end of the market for open-weight models — the entire point of choosing it over a pricier first-party API for the same model family.

For exact numbers, defer to DeepInfra's official pricing page rather than any figure quoted in a third-party article (including this one). Two things worth understanding when you model cost:

Price tracks model size. The lever you control most is which model you pick — dropping from a large model to a smaller one for tasks that don't need the extra capability is often the single biggest cost win.
Token accounting still matters. Long prompts and verbose outputs add up across millions of calls, so trimming prompts and capping max_tokens compounds at volume even when the rate is already low.

For where DeepInfra sits among the budget options and how to compare true per-token cost across providers, see our roundup of the cheapest LLM API.

Keeping cheap models reliable behind a gateway

Here's the tension you eventually hit with any single low-cost provider: the economics are great, but you've now made that provider's availability your availability. Open-model hosts can have outages, a model can be deprecated or capacity-constrained, and a traffic spike can run you into a rate limit and start returning HTTP 429s exactly when you can least afford it. Hard-wiring your app to a single DeepInfra endpoint turns its bad day into your bad day.

The clean pattern is to put DeepInfra behind an LLM gateway with automatic fallback. Your code calls one stable endpoint; the gateway routes the bulk of traffic to a DeepInfra model for the cost win, and transparently retries on another provider the moment DeepInfra is unavailable, rate-limited, or doesn't host the model a request needs — another host of the same open-weight model, or a different model entirely. You get DeepInfra's economics as the cheap default plus a safety net for the tail, without scattering provider-specific retry logic across your services — and "which model" stays decided independently from "whose endpoint," which is what stops a cheap-by-default stack from becoming a fragile one.

flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this. You add your own DeepInfra key (plus OpenAI, Anthropic, Gemini, Groq, Cerebras, Mistral, xAI, OpenRouter, and more) and pay each provider directly with zero markup on tokens — flo2 takes no per-token cut, which makes it a genuine zero-markup OpenRouter alternative. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically when a provider is down or rate-limited, with true per-call cost accounting so you see exactly what each request costs. That lets a DeepInfra-hosted model be your low-cost default without becoming a single point of failure. It's free during Beta, so you can wire DeepInfra in behind a fallback and start measuring today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →