2026-06-03 · flo2 blog

Qwen API Guide: Access Qwen Models (Open + Hosted)

If you've been sleeping on Alibaba's open models, the Qwen API is worth a serious look. Qwen is a family of open-weight models — general-purpose, code-focused, and reasoning variants — that competes credibly with much pricier closed models on a range of tasks, including multilingual work and structured-output generation. You can reach Qwen two ways: through Alibaba's own hosted API (DashScope / Alibaba Cloud Model Studio) or through third-party inference providers that serve the open weights. This guide covers what Qwen is, both access paths with working code examples, what it's actually good for, and how to route Qwen behind a gateway for reliability and cost control.

One ground rule up front: Qwen model names, versions, and per-token prices move fast. A model ID that's canonical today may be superseded next quarter. Treat the official Alibaba and provider documentation as the source of truth for current model IDs and pricing, and verify anything below before shipping to production.

What is Qwen?

Qwen is Alibaba's open-weight model family, released under Apache 2.0 licenses for most checkpoints (verify per-release, as terms have varied). The family spans several distinct lines:

Qwen (general). The core chat and instruction-following line. Multiple parameter sizes, from compact models suited for high-volume cheap inference up to large models competitive with mid-tier closed APIs on reasoning benchmarks. Verify current generation names — the family releases frequently.
Qwen Coder (QwenCoder / Qwen-Coder). Variants fine-tuned specifically for code generation, completion, and debugging. These have attracted attention for outperforming same-size general models on coding benchmarks and for supporting longer context windows useful in software tasks.
Qwen reasoning variants. Some releases include chain-of-thought or reasoning-optimized checkpoints, in the vein of what DeepSeek-R1 did for that family. Verify current availability on your chosen host.

Most Qwen models are widely available on third-party inference hosts alongside Alibaba's own API, which gives you real flexibility in where you run them.

Two ways to access the Qwen API

Option 1: Alibaba DashScope / Model Studio API

Alibaba offers a first-party hosted API for Qwen through its DashScope platform (also reachable via Alibaba Cloud Model Studio). This is the canonical path if you want every model variant — including ones that may not be released as open weights — and direct access to Alibaba's infrastructure.

DashScope exposes an OpenAI-compatible endpoint, so if your code already targets /v1/chat/completions, the migration is mostly a base URL and API key swap. The base URL and exact model IDs differ from OpenAI — check the DashScope documentation for the current endpoint and model name strings before wiring in. To get a Qwen API key:

Create an Alibaba Cloud account and enable the DashScope or Model Studio service.
Navigate to the API key section of the dashboard and create a new key.
Copy it immediately — full secrets are typically shown once.
Review DashScope's billing and any free-tier quota (the documentation covers current credit allowances).

Option 2: Third-party inference hosts

Because most Qwen checkpoints are open weights, a range of inference hosts serve them as hosted APIs, often at very competitive per-token rates. Providers that have carried Qwen models include DeepInfra, Together AI, Fireworks AI, and OpenRouter. Each exposes an OpenAI-compatible endpoint with its own base URL and model ID format. Verify current availability and exact model IDs on each provider's models page — open-weight model catalogs turn over frequently.

Third-party hosts are attractive when you want to compare pricing across providers, already have a key at one of them, or want to route across multiple hosts with a gateway (more on that below).

Qwen API: curl and Python examples

Both DashScope and third-party hosts expose an OpenAI-compatible surface. The two things that differ per host are the base URL and the model ID string. The examples below use environment variables — substitute the correct endpoint and model ID from your chosen provider's documentation.

curl

curl "$QWEN_BASE_URL/chat/completions" \
  -H "Authorization: Bearer $QWEN_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "YOUR_VERIFIED_MODEL_ID",
    "messages": [
      {"role": "user", "content": "Write a Python function that parses ISO 8601 timestamps."}
    ]
  }'

Replace $QWEN_BASE_URL with your host's base URL (e.g. something like https://api.deepinfra.com/v1/openai for DeepInfra, or the DashScope endpoint from their docs) and YOUR_VERIFIED_MODEL_ID with a current model ID from that host's catalog. The response shape mirrors the OpenAI Chat Completions format.

Python (openai SDK)

from openai import OpenAI
import os

client = OpenAI(
    base_url=os.environ["QWEN_BASE_URL"],  # verify per host
    api_key=os.environ["QWEN_API_KEY"],
)

resp = client.chat.completions.create(
    model="YOUR_VERIFIED_MODEL_ID",  # check host's model catalog
    messages=[
        {
            "role": "system",
            "content": "You are a careful, concise coding assistant.",
        },
        {
            "role": "user",
            "content": "Explain the difference between __repr__ and __str__ in Python.",
        },
    ],
    temperature=0.2,
)

print(resp.choices[0].message.content)

No Qwen-specific library is required — the standard openai package works with any OpenAI-compatible endpoint via the base_url override. Streaming works the same way: pass stream=True and iterate the chunks. For the Qwen3 API or Qwen Coder API specifically, verify the model ID format on your chosen host — the naming convention varies (e.g. Qwen/Qwen2.5-Coder-32B-Instruct on some hosts, a short alias on others).

What Qwen is actually good for

Qwen's practical strengths make it worth benchmarking for a specific set of use cases:

Strong quality at low cost. Qwen models land consistently near the top of open-weight benchmarks relative to their size. The smaller variants are cheap enough for high-volume inference jobs where frontier API pricing would be prohibitive.
Coding tasks. The Qwen Coder line is purpose-built for software work — code generation, completion, refactoring, and debugging. It's worth head-to-head evaluation against DeepSeek-Coder and other coding-specialist models for your specific stack. Results vary by language and task type, so measure rather than assume.
Multilingual workloads. Alibaba trained Qwen with heavy multilingual data coverage, and it's frequently cited as one of the stronger open-weight options for non-English text — Chinese in particular, but also other Asian languages and a wide set of European ones. If you're building for a multilingual user base, it's among the first open models to evaluate.
Long context. Several Qwen variants support longer context windows than many comparable open models — useful for document analysis, large codebase ingestion, or retrieval-augmented generation where you want to fit more context in one call. Verify the context length for the specific checkpoint you're using.
Structured output and JSON mode. Instruction-tuned Qwen variants respond well to structured-output prompts, which matters for extraction pipelines and agents that parse model output programmatically.

Honest caveat: model capability is task-specific. Headlines from a benchmark may not translate to your prompts. Run your own eval with representative inputs before committing Qwen to a production path.

Routing Qwen behind a gateway

Using a single Qwen endpoint directly is fine for prototyping. In production you hit a familiar set of problems: one provider's outage becomes your outage, rate limits return HTTP 429s at the worst moments, and model IDs get deprecated without warning. If you're routing across DashScope and third-party hosts, you also end up managing multiple keys and base URLs scattered across your services.

The cleaner pattern is to put Qwen — and your other model choices — behind an LLM gateway with automatic fallback. Your code calls one stable endpoint; the gateway routes to the cheap Qwen model by default, and transparently retries on another provider or model when the primary is unavailable or rate-limited. You get cost efficiency as the default without hard single-provider dependency.

This is directly relevant to how flo2 is designed. flo2 is a developer-first, bring-your-own-key LLM gateway with zero markup on tokens — you add your own provider keys (DashScope, DeepInfra, Together, OpenAI, Anthropic, and others), pay each provider directly, and flo2 takes no per-token cut. One OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model and falls back automatically when a provider is down or rate-limited. That lets a Qwen model be your low-cost default without becoming a single point of failure — and per-call cost accounting shows exactly what each request actually costs. It's free during Beta, so you can route Qwen in behind a fallback and start measuring today.

For more on the open-model landscape and how to compare providers, see our guide to the best open-source LLM APIs and our DeepInfra API guide, which covers another popular host for Qwen and other open-weight models in detail.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →