2026-06-03 · flo2 blog

Batch API Discounts: Cheaper LLM Calls for Async Workloads

The batch API discount is one of the most underused cost levers in LLM development. If your workload does not need a response in the next second — think evaluation pipelines, bulk classification, nightly summarization, or embedding generation — submitting requests asynchronously through a provider's batch endpoint can cut your per-token spend meaningfully compared to the standard synchronous API. This guide explains how batch APIs work across OpenAI and Anthropic, when to use them, the trade-offs you are accepting, and how to layer batch discounts with prompt caching for maximum savings.

What is a batch API discount?

Most LLM providers offer two modes for inference: a synchronous API that processes each request immediately and returns a response within seconds, and an asynchronous batch API that accepts a collection of requests, queues them, and returns results within a longer window — typically around 24 hours, though the exact SLA varies by provider and model. In exchange for the relaxed latency requirement, providers offer a significant discount on per-token rates. Often around half off standard pricing — but verify the exact discount and SLA on each provider's current pricing page, as these figures change.

The economic logic: when a provider does not have to guarantee low latency, it fills idle GPU capacity with your batch workload. You absorb the scheduling uncertainty; the provider passes part of the savings back.

When batch processing fits your workload

Batch APIs are a good match for any task that can tolerate a multi-hour result window and does not depend on streaming output. Common use cases where developers consistently reach for batch APIs:

Evaluation pipelines. Running an LLM as a judge over hundreds or thousands of model outputs is the canonical batch use case. You submit all your eval requests before you sleep; results are ready in the morning.
Bulk document classification or tagging. Classifying support tickets, routing customer emails, labeling a dataset — these are high-volume, non-real-time tasks where batch pricing makes a meaningful dent in your monthly bill.
Large-scale summarization. Summarizing a backlog of articles, reports, or meeting transcripts is exactly the kind of offline pipeline batch APIs were designed for.
Embedding generation. Generating embeddings for a corpus of documents before indexing into a vector store is a natural fit — you have all the inputs upfront and no latency requirement.
Nightly enrichment or reporting jobs. Any scheduled pipeline that runs on a cron and consumes LLM calls fits the batch model well.

Batch APIs are not appropriate for real-time user-facing features. Chat interfaces, code completion, live document editing, and anything that requires a streaming response or sub-second latency should go through the standard synchronous API.

How OpenAI's Batch API works

OpenAI's Batch API is built around JSONL files. The general flow, which you should verify against the current OpenAI docs:

Build a .jsonl file — one JSON object per line, each with a custom ID, endpoint (/v1/chat/completions), and request body.
Upload to the Files API to get a file ID, then create a batch job with a completion window (e.g. 24h).
Poll the batch status endpoint until the job reaches a terminal state (completed, failed, or expired).
Download the JSONL output file (and error file if any), keyed by your custom IDs.

# Minimal JSONL request line (one request per line in your .jsonl file)
{
  "custom_id": "req-001",
  "method": "POST",
  "url": "/v1/chat/completions",
  "body": {
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are a classifier. Reply with one word."},
      {"role": "user", "content": "Classify this text: 'Great product, arrived fast!'"}
    ],
    "max_tokens": 10
  }
}

Each request in the batch is independent. Failures in individual items do not cancel the whole job — you retrieve partial results for items that succeeded and check the error file for items that did not.

How Anthropic's Message Batches API works

Anthropic's equivalent is the Message Batches API. The concept is the same — submit many requests, wait for async processing, retrieve results — but the API surface uses JSON rather than file uploads. The high-level flow, per the current Anthropic docs:

POST a list of request objects to the batch endpoint — each with a custom_id and a params block in standard Messages API shape. Receive a batch ID.
Poll the batch status endpoint until processing completes.
Stream the JSONL results file via the URL returned in the status response, keyed by your custom IDs.

Like OpenAI's implementation, individual request failures within a batch do not abort the entire job. Each result object has a result field — succeeded, errored, or canceled — so you can retry only the failed items.

Trade-offs: what you give up with batch APIs

Property	Synchronous API	Batch API
Latency	Seconds	Up to ~24 hours (verify per provider)
Per-token cost	Standard rate	Discounted — often around half; verify
Streaming	Supported	Not available
Suitable for real-time UX	Yes	No
Result format	Inline response object	JSONL file / streamed results
Error handling	Per-request, in-band	Per-item in results file; partial success possible
Rate limit exposure	Token-per-minute and request-per-minute limits	Separate batch-specific limits

The most common mistake is routing all traffic through batch APIs to save money. Batch is only appropriate when your pipeline can genuinely absorb a multi-hour window. If anything downstream is waiting, use the synchronous API.

Combining batch discounts with prompt caching

Batch discounts and prompt caching are additive — you can use both simultaneously, and doing so often produces the lowest per-request cost available without switching models.

Here is how the combination works in practice:

Stable system prompt + batch workload. If all requests in your batch share a long system prompt (common in classification, summarization, or eval tasks), structure your prompts so the system prompt is at the top — providers cache from the beginning of the prompt. Each request in the batch that reuses the same cached prefix pays both the batch discount and the cached-input discount on those tokens.
Few-shot examples. If you embed a fixed set of few-shot examples in every request, those examples benefit from the prompt cache in the same way. Keep them before the variable user content.
Order matters for caching. For Anthropic, cache breakpoints must be explicitly marked with cache_control. For OpenAI, caching is implicit but activates after a threshold of matching prefix tokens. In both cases, the stable prefix should come first.

A rough mental model: batch discount reduces the cost of all tokens in the request; prompt caching reduces the cached-prefix tokens further. On a request where a 2,000-token system prompt is fully cached, the effective input cost can be a fraction of the standard rate — and lower than either optimization alone.

For the mechanics of prompt caching in more detail, see the prompt caching savings guide. For the underlying token economics that make these discounts matter, see AI tokenomics.

Tracking batch savings in your cost accounting

Batch costs are not immediately visible — the bill shows up after results arrive. Developers who run large batch jobs without per-request tracking often discover the spend only at the end of the billing cycle.

A BYOK gateway that does cost accounting on your provider keys solves this. Because the gateway mediates each API call — including batch submission and result retrieval — it attributes token counts and costs to specific jobs, teams, or features as results come in. With zero token markup, the numbers you see are your actual spend.

For a comparison of how batch pricing stacks up between OpenAI and Anthropic across model tiers, see the GPT vs Claude pricing breakdown.

Putting it into practice

Switching an offline workload from the synchronous API to batch is usually a small code change — the request bodies are identical; only the submission and retrieval wrappers differ. Confirm the current discount and SLA on each provider's pricing page before building around them, structure prompts for cache hits (stable prefix first), and add per-job cost tracking so savings are visible, not just theoretical.

If you want a single endpoint that routes across providers, handles batch and synchronous traffic, and gives you real cost accounting on your own keys, flo2 is built for that — free during beta, zero markup, bring your own provider keys.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →