2026-06-03 · flo2 blog

LLM JSON Mode & Structured Outputs: Reliable Machine-Readable Replies

Free-text LLM replies are expressive, but expression is the enemy of reliability when your code needs to parse the answer. A model that sometimes wraps the data in a markdown code fence, sometimes adds a preamble sentence, and sometimes uses different key names on each run makes downstream logic fragile. LLM JSON mode and its stronger cousin, structured outputs, are the two standard mechanisms for getting a model to emit machine-readable JSON — every time, in the shape you specify. This article covers how they work, how provider support differs, and best practices for production use.

Why free-text LLM replies break parsers

Models are trained to be helpful, and helpful often means conversational: "Sure! Here's the extracted data:" followed by a JSON blob, followed by an explanation of the fields. That's pleasant to read and a headache to parse. Your code has to strip the preamble, find the JSON boundary, handle markdown fences (```json ... ```), and pray the model didn't decide to add a trailing comment this time. Any one of those steps can fail, and it fails silently — the parse throws or, worse, returns wrong data without error.

The fragility compounds as you add providers. Different models have different defaults for verbosity, quoting style, and structure. A prompt that reliably produces clean JSON on GPT-4o may produce a prose summary on a different model. If you're routing across providers (more on that below), a hardcoded parse assumption about output format becomes a latent multi-provider bug.

JSON mode: response_format = json_object

The first and most widely supported mechanism is JSON mode: you pass response_format: { type: "json_object" } in the API call, and the model guarantees it will return a syntactically valid JSON object — no prose wrapper, no fences, no trailing text.

// OpenAI-compatible request with JSON mode
const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  response_format: { type: "json_object" },
  messages: [
    {
      role: "system",
      content:
        "Extract the following fields from the user's text and return as JSON: " +
        '{ "name": string, "email": string | null, "intent": "buy" | "support" | "other" }',
    },
    { role: "user", content: userMessage },
  ],
});

const data = JSON.parse(response.choices[0].message.content);

A few caveats that catch developers off guard:

JSON mode guarantees valid JSON, not your schema. The model can return {"anything": "it_wants"}. You still have to validate the shape yourself.
Your prompt must mention JSON. OpenAI will raise an error if your messages contain no reference to JSON output — the mode alone is not enough to inform the model what you expect.
Refusals still happen. A model can refuse a request and return {"error": "I can't help with that"}. Valid JSON, wrong shape — validate before you use it.
Length limits apply. If the response is cut off by max_tokens, the JSON may be truncated mid-structure and fail to parse. Set max_tokens high enough, or check finish_reason.

Structured outputs: enforcing your schema

JSON mode is a step up from free text, but it still leaves schema enforcement to you. Structured outputs (also called constrained decoding or JSON schema mode) go further: you supply a JSON Schema, and the model's token generation is constrained so the output must match it — not as a best-effort instruction, but as a hard guarantee backed by the inference engine.

// OpenAI structured outputs with a JSON Schema
const response = await openai.chat.completions.create({
  model: "gpt-4o-2024-08-06",   // structured outputs require a supported model version
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "contact_extraction",
      strict: true,
      schema: {
        type: "object",
        properties: {
          name:   { type: "string" },
          email:  { type: ["string", "null"] },
          intent: { type: "string", enum: ["buy", "support", "other"] },
        },
        required: ["name", "email", "intent"],
        additionalProperties: false,
      },
    },
  },
  messages: [
    { role: "system", content: "Extract contact info from the user message." },
    { role: "user",   content: userMessage },
  ],
});

With strict: true and additionalProperties: false, the model cannot add extra keys or omit required ones. The parse still needs a try/catch for length-truncation edge cases, but schema drift is eliminated at the source.

Tool / function calling as structured output

Before JSON schema mode existed, function calling (now called tool use) was the standard way to get structured data. You define a tool with a JSON Schema for its parameters, instruct the model to call it, and read tool_calls[0].function.arguments. The model is constrained to emit valid arguments matching the schema — effectively the same mechanism, different API surface. Many teams still prefer tool-calling for extraction tasks because it's universally supported and the intent ("call a function") maps naturally to the model's training.

Provider differences — the honest picture

Each major provider implements these features differently. The landscape moves fast, so always verify current support in the provider's docs, but here is the general shape as of mid-2025:

OpenAI offers both json_object mode and full structured outputs with json_schema (the latter requires model versions that support it). Function/tool calling is also fully supported and widely used.
Anthropic (Claude) does not expose a response_format parameter in the same way. The standard pattern is to use tool use — define a tool with the exact schema you need and instruct Claude to call it. Claude's tool use is highly reliable for extraction and classification. For simpler cases, careful prompting with XML tags often yields clean, parseable output.
Google (Gemini) supports structured output via a response_mime_type: "application/json" flag and a separate response_schema parameter on the generation config. The surface is different from OpenAI but the capability is comparable.

The implication for multi-provider code: if you're targeting models across providers, you can't write one call site that uses response_format: json_schema and expect it to work everywhere. You need an abstraction layer — your own or a gateway's — that translates the schema request into each provider's native format.

Best practices for production structured outputs

Validate the output even with structured outputs

Structured outputs eliminate schema drift under normal conditions, but length truncation, refusals, and network corruption can still produce unparseable or semantically wrong JSON. Always wrap the parse and validate in a try/catch, check finish_reason === "stop" before trusting the payload, and run your own lightweight schema check (e.g., Zod, Ajv, Pydantic) as the last line of defense. This is defense-in-depth, not distrust of the model.

Keep schemas tight

A schema with fifty optional fields invites the model to make up data for fields you didn't need. Use additionalProperties: false and mark only genuinely required fields as required. Keep enum values minimal and precise. A compact schema is also faster to generate — the constrained decoding overhead scales with schema complexity.

Prompt for the schema, not against it

Even with constrained decoding, your system prompt should describe the expected output. Don't fight the constraint; reinforce it: "Return a JSON object with exactly these fields." Models perform better when the prompt and the schema tell the same story.

Handle refusals explicitly

A model that refuses a request but must emit valid JSON will return something like {"error": "..."} or an empty-ish object. Check for this pattern before passing the result downstream. A finish_reason of "content_filter" is a reliable signal; for tool-calling refusals, the model typically returns a normal message with no tool calls, which your code should detect.

Set max_tokens conservatively high

JSON is more verbose than prose for the same information. A 200-token prose answer might be 400 tokens as structured JSON, depending on key names and nesting. Size your max_tokens budget to the schema, not to free-text experience. A truncated JSON object is useless and hard to detect without explicit finish-reason checking.

JSON mode through an LLM gateway

When you work with a single provider, wiring up response_format is a one-time task. When you route across providers, the problem multiplies: each has its own API surface for structured output, and the translation logic has to live somewhere. This is one of the jobs an LLM gateway handles — it accepts your request with a schema, translates it into the correct format for the target provider, and returns a uniform response. Your application code stays the same regardless of which model handles the call.

A gateway also lets you route the same structured-output request to whichever model is cheapest or fastest for the task on a given day, without rewriting the call site. For a deeper look at the gateway pattern, see what is an LLM gateway. For details on how a single key routes to many providers through an OpenAI-compatible surface, see the OpenAI-compatible API explainer.

flo2 is a developer-first LLM gateway that passes response_format, tool definitions, and schema parameters through to whichever provider model supports them — zero token markup, your own provider keys, one compatible endpoint. Free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →