2026-06-03 · flo2 blog

LLM Temperature & top_p Explained: Controlling Randomness

The LLM temperature parameter is the single most-reached-for knob when developers want to tune model behavior — yet it's also one of the most misunderstood. Set it too low and your chatbot sounds like a broken record; set it too high and your data extraction pipeline returns confetti. This guide explains exactly what temperature does under the hood, how it relates to top_p, when to tune one versus the other, and which values actually work for common tasks.

What Temperature Does (and How It Works)

At inference time, an LLM doesn't "choose" the next word — it produces a probability distribution over its entire vocabulary. The highest-probability token is the most likely next word given the context. Temperature scales that distribution before sampling from it.

Concretely, each token's raw score (called a logit) is divided by the temperature value before the softmax that converts scores to probabilities:

# Conceptual pseudocode — not the actual CUDA kernel
import math

def softmax_with_temperature(logits: list[float], temperature: float) -> list[float]:
    scaled = [l / temperature for l in logits]
    exps = [math.exp(l) for l in scaled]
    total = sum(exps)
    return [e / total for e in exps]

What this means in practice:

The key insight: temperature doesn't change what the model knows. It changes how liberally it samples from what it knows.

What top_p (Nucleus Sampling) Does

top_p is a different approach to the same problem: reducing low-quality, low-probability token choices. Instead of scaling the entire distribution, it truncates it — keeping only the smallest set of top tokens whose cumulative probability mass reaches the top_p threshold, then renormalizes and samples from that nucleus.

For example, with top_p = 0.9, the model considers only the tokens that together account for 90% of the probability mass. Long-tail unlikely tokens are excluded entirely before sampling.

The distinction matters:

Why You Should Usually Tune One, Not Both

Both parameters reduce or expand the effective sampling pool, and they interact in non-obvious ways. Applying a low temperature (which already concentrates mass on top tokens) and a low top_p (which also cuts off the tail) stacks the effect and can make the model repetitively rigid. Applying high temperature and high top_p compounds unpredictability.

The practical rule most teams settle on:

OpenAI's own documentation makes this recommendation explicitly. Anthropic's guidance follows the same logic.

Practical Settings by Task

Task type Suggested temperature Reasoning
Data extraction / entity recognition 0.0 – 0.2 You want the highest-confidence answer, not a creative interpretation
Classification / routing 0.0 – 0.2 Categories should be stable; variance means instability
Code generation 0.1 – 0.4 Correct syntax and logic matter; a little variation helps avoid repetitive patterns
Factual Q&A / RAG retrieval synthesis 0.2 – 0.5 Grounded in retrieved context; low variance prevents hallucinated elaboration
Summarization 0.3 – 0.6 Some paraphrasing variation is fine; very low temp can sound mechanical
Conversational assistants 0.5 – 0.8 Natural variation keeps dialogue from feeling robotic
Creative writing / brainstorming 0.8 – 1.2 Diversity of output is the point; surprising word choices are valuable
Poetry / experimental generation 1.0 – 1.5 High temperature opens up rare token combinations; quality can be inconsistent

These are starting points, not laws. The right temperature for your use case depends on the model, the prompt, and what "good output" means for your application. Always evaluate against real examples.

Temperature and Reproducibility: Low Doesn't Mean Deterministic

A common misconception: setting temperature = 0 guarantees identical outputs. It does not — at least not across different runs on most provider infrastructure.

Even at temperature 0, floating-point arithmetic on GPU hardware is subject to non-determinism from parallel operations. Different batch sizes, different hardware generations, and provider-side infrastructure changes can all produce slightly different outputs. In practice, temperature 0 on a given model usually converges to the same answer for deterministic inputs — but you should not rely on it as a guarantee in critical systems.

For reproducibility, the right tool is the seed parameter, which most major APIs now support. Passing the same seed value (combined with identical inputs and sampling parameters) gives providers the ability to return consistent outputs — though most providers qualify this as "best effort" rather than a strict guarantee.

// Example: consistent extraction with seed + low temperature
const response = await client.chat.completions.create({
  model: "openai/gpt-4o",
  temperature: 0.1,
  seed: 42,
  messages: [
    {
      role: "user",
      content: "Extract the invoice number from: 'Invoice #INV-2024-0091 dated March 15'",
    },
  ],
});

Other Sampling Parameters Worth Knowing

Temperature and top_p handle the randomness of what tokens get picked. A few other parameters handle the shape of what's generated:

These parameters pass through to the underlying provider unchanged when you use a gateway like flo2 — meaning you can use the full sampling parameter set of any supported model without the gateway stripping or transforming them.

Sending Temperature Parameters Through an LLM Gateway

When you route requests through an LLM proxy or gateway, a common concern is whether custom sampling parameters get forwarded correctly. With some gateways, parameters that aren't on an explicit allowlist get silently dropped — which can mean your carefully tuned temperature, seed, or frequency_penalty never reaches the model.

flo2 is built with a pass-through philosophy: sampling parameters are forwarded as-is to the underlying provider. There's no token markup and no parameter manipulation. You use your own provider API keys, so the request that reaches the model is as close to a direct call as possible — just with routing, observability, and fallback logic added on top. See what is an LLM gateway for a fuller explanation of how the proxy layer works.

Concretely, switching to flo2 doesn't require changing your sampling parameters. The baseURL swap is the entire migration:

// Before: direct provider call
const client = new OpenAI({
  baseURL: "https://api.openai.com/v1",
  apiKey: process.env.OPENAI_API_KEY,
});

// After: route through flo2 (zero markup, BYO key)
const client = new OpenAI({
  baseURL: "https://api.flo2.com/v1",
  apiKey: process.env.FLO2_API_KEY,
});

// temperature, seed, top_p, frequency_penalty — all pass through unchanged
const response = await client.chat.completions.create({
  model: "openai/gpt-4o-mini",
  temperature: 0.2,
  seed: 100,
  messages: [{ role: "user", content: "Classify the sentiment: 'Shipping was fast but the box was damaged.'" }],
});

For more on how the output token budget interacts with what the model can generate, see max_tokens explained.

If you're tuning sampling parameters across multiple models and providers, flo2 gives you a single endpoint during Beta — free, no token markup, your keys.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to