2026-06-03 · flo2 blog

Together AI API Guide: Open Models, Setup & Pricing

If you want to run open-weight models through a clean, hosted API without managing GPU infrastructure, the Together AI API is one of the broadest on-ramps available. Together AI is not a model lab — it's an inference and fine-tuning platform that hosts a large catalog of community-popular open-weight models (Llama, Qwen, DeepSeek, Mistral, Mixtral, and others) and serves them through an OpenAI-compatible endpoint. For developers who want open-model flexibility — swap providers without rewriting client code, fine-tune on your own data, or run cheaper alternatives to frontier closed models — it's worth understanding how the Together AI API works, what makes it distinctive, and how to fit it into a resilient multi-provider stack.

One standing caveat throughout: exact model names, current prices, and fine-tuning options change regularly. Together AI adds and deprecates models often, and per-token rates shift. Treat the official Together AI models page and pricing page as your source of truth — this guide explains the shape, not hard-coded figures that could be stale by next quarter.

What is Together AI?

Together AI is a hosted inference platform purpose-built for open-weight models. Where proprietary API providers (OpenAI, Anthropic, Google) give you one or a handful of their own models, Together's catalog spans dozens of open models across multiple families and sizes. At any given time you'll find variants of:

Llama (Meta's open-weight flagship series, various sizes and instruction-tuned variants)
Qwen (Alibaba's multilingual and coding-strong series)
DeepSeek (high-capability coding and reasoning models from DeepSeek AI)
Mistral / Mixtral (efficient dense and MoE models from Mistral AI)
Various other community-popular checkpoints

Verify the current catalog on Together's models page — the list above reflects families present at time of writing, but specific model IDs, versions, and availability change. Together doesn't train these models; it hosts, maintains, and serves them at scale so you don't have to.

Beyond inference, Together is notable for supporting fine-tuning: you can upload your own dataset and train a custom adapter on top of supported base models, then serve that fine-tuned model through the same API. That full loop — evaluate, fine-tune, deploy — in one platform is a meaningful differentiator if you need task-specific customization.

Getting a Together AI API key

Sign up at together.ai and navigate to the API keys section of your dashboard. Generate a key, copy it, and immediately load it into your environment as a secret — never commit it to source control. The examples below read it from TOGETHER_API_KEY.

export TOGETHER_API_KEY="your_together_api_key_here"

Together offers a free credit allocation for new accounts so you can experiment before committing to paid usage. Check the current sign-up promotion on their site, as the amount changes.

The Together AI OpenAI-compatible endpoint

The Together AI API is OpenAI-compatible, which is the most important practical detail for most developers. If you already have code that calls /v1/chat/completions with the OpenAI SDK or any other OpenAI-format client, switching to Together requires only two changes: the base URL and the API key.

Together's OpenAI-compatible base URL is:

https://api.together.xyz/v1

Point any OpenAI-format client there and you get Together's full model catalog with request and response shapes identical to OpenAI's Chat Completions API — same messages array, same stream flag, same temperature and max_tokens parameters.

curl example

A minimal chat completions call with curl:

curl https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<together-model-id>",
    "messages": [
      {"role": "user", "content": "Explain the difference between MoE and dense transformer models."}
    ]
  }'

Replace <together-model-id> with a real model ID from Together's models page — for example a current Llama or Qwen variant in its fully qualified form (e.g. meta-llama/Llama-3.3-70B-Instruct-Turbo, though verify the exact ID and availability before using it). The response shape is identical to OpenAI's, so your parsing logic carries over unchanged.

Python with the OpenAI SDK

Because Together's API speaks OpenAI's format, you can use the official openai Python package with two overrides. No Together-specific SDK required:

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)

resp = client.chat.completions.create(
    model="<together-model-id>",   # from Together's models page
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarize the key strengths of open-weight LLMs."},
    ],
)

print(resp.choices[0].message.content)

Streaming works the same way — pass stream=True and iterate chunks:

stream = client.chat.completions.create(
    model="<together-model-id>",
    messages=[{"role": "user", "content": "Write a short Python function to chunk a list."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Together also publishes a native together Python SDK with additional features for fine-tuning and file management. For pure inference in an existing OpenAI-based codebase, the OpenAI SDK approach is the fastest drop-in. For the fine-tuning workflow, the native SDK adds the relevant endpoints. To understand why OpenAI compatibility matters across the ecosystem, see our primer on the OpenAI-compatible API.

What Together AI is good at

Together AI's strengths are breadth and flexibility rather than raw speed. Here's where it tends to win:

Access to many open-weight models in one place. Instead of provisioning separate accounts with every open-model host, you get a large portion of the popular open-weight catalog under one API key. This matters for evaluation: you can A/B test Llama vs. Qwen vs. DeepSeek through identical code paths with just a model ID swap.
Fine-tuning pipeline. If you need a task-specific model — something tuned on your domain data — Together's fine-tuning API lets you train, host, and serve that model without setting up separate infrastructure. The loop from base model to deployed fine-tune stays within one platform.
Cost-sensitive workloads. Open-weight models on hosted inference are often cheaper per token than frontier closed models for tasks where a strong but smaller model suffices. Verification pipelines, structured extraction, classification, and summarization are common use cases where a Llama or Qwen variant punches at sufficient quality for a fraction of GPT-4-class pricing — verify Together's current rates on their pricing page for the specific models you're evaluating.
OpenAI-format migration. Teams moving from OpenAI who want to explore open models can swap the base URL and model ID without touching client code, which dramatically lowers the switching cost for evaluation.

For a broader comparison of where Together sits among open-model hosting options, see our rundown of best open-source LLM APIs.

Together AI pricing: what to know

Together AI uses per-token pricing, billed separately for input and output tokens. Rates vary by model: smaller, efficient models cost less per million tokens than large or MoE models. The gap between the cheapest and most expensive models on Together can be several orders of magnitude, so picking the right-sized model for your task has a direct budget impact.

There is no authoritative per-model rate in this article by design — pricing changes often enough that any table here would drift out of date. Check Together's official pricing page for current figures before making any cost projections. What you can count on structurally: the per-token model applies across the catalog, input is cheaper than output, and free trial credits are available for new accounts.

Routing Together AI behind a gateway for fallback and cost control

Together AI, like every hosted inference provider, has its own rate limits and availability characteristics. If you hard-wire your application to call Together directly, you inherit those limits as your application's limits — a 429 from Together becomes a 500 to your users.

The pattern that solves this is putting Together (and every other provider you use) behind an LLM gateway with automatic fallback. Your application calls one stable endpoint; the gateway routes each request to the preferred provider and, when that provider returns an error or slows past your latency threshold, transparently retries on the next provider in the fallback chain — a different Together model, a different host serving the same open-weight model, or a different provider entirely. The retry logic, provider credentials, and routing rules live in one place instead of being scattered across services.

Beyond reliability, a gateway with cost routing lets you set rules like "use the cheapest model that can handle this task class" and have the gateway enforce them across requests — routing cheaper open models through Together for high-volume work and reserving frontier models for cases that actually need them.

flo2 is a developer-first, bring-your-own-key LLM gateway built for exactly this workflow. You bring your own Together AI key (plus OpenAI, Anthropic, Gemini, Groq, Mistral, xAI, and more), pay each provider directly at their listed rate, and flo2 charges zero markup on tokens — no per-token cut on top of what Together charges you. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest qualifying model and falls back automatically on errors or rate limits. It's free during Beta, so you can add Together alongside your other providers and start routing today.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →