2026-06-03 · flo2 blog

Use DeepInfra with the OpenAI SDK: Compatible API & Base URL

DeepInfra exposes a fully deepinfra openai compatible endpoint, which means you can drop it into any application already using the OpenAI Python or JavaScript SDK by changing three values: the base_url, the API key, and the model name. No new dependency, no custom client, no HTTP wiring. The base URL is https://api.deepinfra.com/v1/openai — point any OpenAI-style client there, pass your DeepInfra key as the bearer token, and you're running open-weight models from Llama, Qwen, DeepSeek, and others at inference prices that are significantly lower than frontier model costs. This guide covers the endpoint in detail, working curl and Python examples, streaming, migration gotchas, and why routing DeepInfra behind a gateway gives you resilience for free.

One ground rule: DeepInfra's model catalog and per-token prices move quickly. The exact model IDs and pricing numbers in any article — including this one — can be stale by the time you read it. Always verify the current model list, exact identifiers, and prices in the DeepInfra API guide and on DeepInfra's official models and pricing pages before you ship anything.

The DeepInfra OpenAI-compatible endpoint

The full base URL for the deepinfra openai endpoint is:

https://api.deepinfra.com/v1/openai

DeepInfra model IDs are namespaced, following the pattern the model's source organization uses — for example meta-llama/Meta-Llama-3.1-8B-Instruct or Qwen/Qwen2.5-72B-Instruct. These are not short aliases like gpt-4o. Copy the exact string from DeepInfra's models page; a near-miss will return a 404 or an unknown model error, not a helpful message.

Authentication is a standard bearer token. Your DeepInfra API key goes in the Authorization: Bearer header — the same format the OpenAI SDK sends by default. That is precisely why the compatibility works: the request shape, auth header, and response envelope are identical to what OpenAI's API produces.

Calling the endpoint with curl

A raw curl call is the fastest way to confirm your key and model ID are correct before you wire anything into application code:

export DEEPINFRA_API_KEY="your_deepinfra_key_here"

curl https://api.deepinfra.com/v1/openai/chat/completions \
  -H "Authorization: Bearer $DEEPINFRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a concise assistant."},
      {"role": "user",   "content": "What is a transformer attention head?"}
    ],
    "temperature": 0.2,
    "max_tokens": 256
  }'

The response is the standard OpenAI envelope: a choices array, each entry with a message.content string and a finish_reason, plus a top-level usage object reporting prompt_tokens, completion_tokens, and total_tokens. Any code that already parses OpenAI responses will parse this identically.

Use DeepInfra with the OpenAI Python SDK

The migration to use DeepInfra with OpenAI SDK code is literally constructor arguments. Every helper downstream — message construction, streaming loops, response parsing, instructor structured output, LangChain or LlamaIndex integrations — stays the same:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepinfra.com/v1/openai",
    api_key=os.environ["DEEPINFRA_API_KEY"],
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",  # verify current ID in DeepInfra docs
    messages=[
        {"role": "system", "content": "Reply in plain English, no fluff."},
        {"role": "user",   "content": "Explain KV cache in two sentences."},
    ],
    temperature=0.2,
    max_tokens=200,
)

print(resp.choices[0].message.content)
print(resp.usage)  # prompt_tokens, completion_tokens, total_tokens

Streaming responses

DeepInfra supports streaming via server-sent events — the same data: wire format OpenAI uses, terminated with data: [DONE]. Set stream=True and iterate normally:

stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Explain gradient checkpointing step by step."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

To capture token usage from a streaming response, check whether DeepInfra supports the stream_options={"include_usage": True} parameter — when it is supported, the final chunk includes a usage block. Verify the current behavior in DeepInfra's documentation, since support for this parameter varies by provider and evolves over time.

JavaScript / TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.deepinfra.com/v1/openai",
  apiKey:  process.env.DEEPINFRA_API_KEY,
});

const resp = await client.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct",  // verify in DeepInfra docs
  messages: [{ role: "user", content: "What is LoRA fine-tuning?" }],
  temperature: 0.3,
});

console.log(resp.choices[0].message.content);

Any framework that accepts a baseURL override — the Vercel AI SDK, LangChain.js, LlamaIndex.TS — works the same way. They all construct the same Chat Completions HTTP request under the hood.

What DeepInfra's compatibility layer covers

DeepInfra's endpoint targets the Chat Completions surface. Here is what to expect and what to verify before relying on it in production:

Chat completions — supported. Multi-turn conversations with system, user, and assistant roles work across all hosted models.
Streaming — supported via stream=True. Verify per-model availability in DeepInfra's docs since availability can differ across model families.
Common sampling parameters — temperature, top_p, max_tokens, and stop are generally supported. Verify model-level limits and supported parameters in DeepInfra's docs.
Tool / function calling — support is model-dependent. Some models in DeepInfra's catalog support the tools parameter; others do not. Check the model's detail page before relying on it.
Structured output / JSON mode — verify in DeepInfra docs. response_format: {"type": "json_object"} behavior varies by model and is not guaranteed.
Embeddings — DeepInfra hosts embedding models, but they are accessed separately; confirm the endpoint path in DeepInfra's docs.
Not available — Assistants API, Responses API, image generation, audio/TTS, and fine-tuning are OpenAI-specific products that do not exist on DeepInfra.

Migrating an existing app to DeepInfra

If you have an existing application calling OpenAI, migrating to DeepInfra for open-model workloads is straightforward. The recommended pattern is to externalize the three moving parts — deepinfra base url, API key, and model name — into environment variables so no code changes are required to switch providers:

# .env
LLM_BASE_URL=https://api.deepinfra.com/v1/openai
LLM_API_KEY=your_deepinfra_key
LLM_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct

import os
from openai import OpenAI

client = OpenAI(
    base_url=os.environ["LLM_BASE_URL"],
    api_key=os.environ["LLM_API_KEY"],
)

resp = client.chat.completions.create(
    model=os.environ["LLM_MODEL"],
    messages=[{"role": "user", "content": "Summarize this document: ..."}],
)

With this pattern, switching from OpenAI to DeepInfra — or between DeepInfra models, or back to OpenAI for a specific workload — is an environment variable change, not a code deployment.

Gotchas when using DeepInfra with the OpenAI SDK

Namespaced model IDs. DeepInfra uses the full namespaced model string (e.g. meta-llama/Meta-Llama-3.1-8B-Instruct), not short aliases. Any routing logic keyed on gpt-4o or a short name needs to map to the correct DeepInfra ID. Keep model names in config, not in application code.
Model availability is a moving target. DeepInfra adds and retires models. A model ID that works today may 404 next month. Pin the ID in config and monitor for deprecation notices.
Unsupported parameters may be silently ignored. Fields like logprobs, n > 1, frequency and presence penalties, or specific sampling options may not be supported. Audit every parameter your code sends and test explicitly against DeepInfra rather than assuming OpenAI parity.
Rate limits are separate from OpenAI. DeepInfra enforces its own RPM and TPM limits per API key. A burst that is well within your OpenAI quota can hit HTTP 429 on DeepInfra. Read the Retry-After header and implement backoff, or add a fallback provider.
Pricing varies by model and changes frequently. Verify DeepInfra's current per-token rates on their pricing page — the cost difference between a small and large model on DeepInfra can be 10x or more, so model selection has real budget impact.

Why pair DeepInfra's cheap inference with a gateway

Pointing the OpenAI SDK directly at the deepinfra chat completions endpoint is the right first step. But it hard-codes a single provider: when DeepInfra rate-limits you, a specific model is unavailable, or you want to route expensive tasks to a frontier model while routing cheap tasks to an open-weight model, you are back to editing application code. A gateway separates provider selection from application logic.

That is what flo2 is built for. flo2 is a developer-first LLM gateway with zero token markup — you bring your own DeepInfra key (plus keys for OpenAI, Anthropic, Groq, Gemini, Mistral, xAI, and others) and pay each provider directly at their published rate. A single flo2 key, usable through an OpenAI-compatible endpoint, routes each request to the cheapest or fastest provider for the task, with fallback chains so a DeepInfra 429 or quota wall rolls over to another provider instead of surfacing as an error.

import os
from openai import OpenAI

# One flo2 key routes to DeepInfra — or wherever is cheapest/fastest
client = OpenAI(
    base_url="https://flo2.com/v1",
    api_key=os.environ["FLO2_API_KEY"],
)

resp = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Extract the dates from this contract: ..."}],
)

print(resp.choices[0].message.content)

Your application code does not change. flo2 holds your provider keys, applies your routing policy, and exposes one OpenAI-compatible surface. If DeepInfra is the cheapest inference for a given model, flo2 routes there automatically. If it hits a limit, flo2 falls back without your application seeing an error. You pay providers directly — no markup, no middleman margin on every token. flo2 is free during Beta.

For a deeper look at DeepInfra's full API, keys, and model selection, see the DeepInfra API guide. For more on how the OpenAI-compatible wire format works and why so many providers adopt it, see OpenAI-compatible API.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →