2026-06-03 · flo2 blog

Batch API Discounts: Cheaper LLM Calls for Async Workloads

The batch API discount is one of the most underused cost levers in LLM development. If your workload does not need a response in the next second — think evaluation pipelines, bulk classification, nightly summarization, or embedding generation — submitting requests asynchronously through a provider's batch endpoint can cut your per-token spend meaningfully compared to the standard synchronous API. This guide explains how batch APIs work across OpenAI and Anthropic, when to use them, the trade-offs you are accepting, and how to layer batch discounts with prompt caching for maximum savings.

What is a batch API discount?

Most LLM providers offer two modes for inference: a synchronous API that processes each request immediately and returns a response within seconds, and an asynchronous batch API that accepts a collection of requests, queues them, and returns results within a longer window — typically around 24 hours, though the exact SLA varies by provider and model. In exchange for the relaxed latency requirement, providers offer a significant discount on per-token rates. Often around half off standard pricing — but verify the exact discount and SLA on each provider's current pricing page, as these figures change.

The economic logic: when a provider does not have to guarantee low latency, it fills idle GPU capacity with your batch workload. You absorb the scheduling uncertainty; the provider passes part of the savings back.

When batch processing fits your workload

Batch APIs are a good match for any task that can tolerate a multi-hour result window and does not depend on streaming output. Common use cases where developers consistently reach for batch APIs:

Batch APIs are not appropriate for real-time user-facing features. Chat interfaces, code completion, live document editing, and anything that requires a streaming response or sub-second latency should go through the standard synchronous API.

How OpenAI's Batch API works

OpenAI's Batch API is built around JSONL files. The general flow, which you should verify against the current OpenAI docs:

  1. Build a .jsonl file — one JSON object per line, each with a custom ID, endpoint (/v1/chat/completions), and request body.
  2. Upload to the Files API to get a file ID, then create a batch job with a completion window (e.g. 24h).
  3. Poll the batch status endpoint until the job reaches a terminal state (completed, failed, or expired).
  4. Download the JSONL output file (and error file if any), keyed by your custom IDs.
# Minimal JSONL request line (one request per line in your .jsonl file)
{
  "custom_id": "req-001",
  "method": "POST",
  "url": "/v1/chat/completions",
  "body": {
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are a classifier. Reply with one word."},
      {"role": "user", "content": "Classify this text: 'Great product, arrived fast!'"}
    ],
    "max_tokens": 10
  }
}

Each request in the batch is independent. Failures in individual items do not cancel the whole job — you retrieve partial results for items that succeeded and check the error file for items that did not.

How Anthropic's Message Batches API works

Anthropic's equivalent is the Message Batches API. The concept is the same — submit many requests, wait for async processing, retrieve results — but the API surface uses JSON rather than file uploads. The high-level flow, per the current Anthropic docs:

  1. POST a list of request objects to the batch endpoint — each with a custom_id and a params block in standard Messages API shape. Receive a batch ID.
  2. Poll the batch status endpoint until processing completes.
  3. Stream the JSONL results file via the URL returned in the status response, keyed by your custom IDs.

Like OpenAI's implementation, individual request failures within a batch do not abort the entire job. Each result object has a result field — succeeded, errored, or canceled — so you can retry only the failed items.

Trade-offs: what you give up with batch APIs

Property Synchronous API Batch API
Latency Seconds Up to ~24 hours (verify per provider)
Per-token cost Standard rate Discounted — often around half; verify
Streaming Supported Not available
Suitable for real-time UX Yes No
Result format Inline response object JSONL file / streamed results
Error handling Per-request, in-band Per-item in results file; partial success possible
Rate limit exposure Token-per-minute and request-per-minute limits Separate batch-specific limits

The most common mistake is routing all traffic through batch APIs to save money. Batch is only appropriate when your pipeline can genuinely absorb a multi-hour window. If anything downstream is waiting, use the synchronous API.

Combining batch discounts with prompt caching

Batch discounts and prompt caching are additive — you can use both simultaneously, and doing so often produces the lowest per-request cost available without switching models.

Here is how the combination works in practice:

A rough mental model: batch discount reduces the cost of all tokens in the request; prompt caching reduces the cached-prefix tokens further. On a request where a 2,000-token system prompt is fully cached, the effective input cost can be a fraction of the standard rate — and lower than either optimization alone.

For the mechanics of prompt caching in more detail, see the prompt caching savings guide. For the underlying token economics that make these discounts matter, see AI tokenomics.

Tracking batch savings in your cost accounting

Batch costs are not immediately visible — the bill shows up after results arrive. Developers who run large batch jobs without per-request tracking often discover the spend only at the end of the billing cycle.

A BYOK gateway that does cost accounting on your provider keys solves this. Because the gateway mediates each API call — including batch submission and result retrieval — it attributes token counts and costs to specific jobs, teams, or features as results come in. With zero token markup, the numbers you see are your actual spend.

For a comparison of how batch pricing stacks up between OpenAI and Anthropic across model tiers, see the GPT vs Claude pricing breakdown.

Putting it into practice

Switching an offline workload from the synchronous API to batch is usually a small code change — the request bodies are identical; only the submission and retrieval wrappers differ. Confirm the current discount and SLA on each provider's pricing page before building around them, structure prompts for cache hits (stable prefix first), and add per-job cost tracking so savings are visible, not just theoretical.

If you want a single endpoint that routes across providers, handles batch and synchronous traffic, and gives you real cost accounting on your own keys, flo2 is built for that — free during beta, zero markup, bring your own provider keys.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to