2026-06-03 · flo2 blog

LLM Observability: Logging, Tracing & Cost Tracking

The first time a model-backed feature misbehaves in production, you reach for your usual playbook — open the traces, find the slow span, read the error — and it doesn't fit. The request returned a clean 200, latency was fine, and the output was still wrong. LLM observability is the practice of instrumenting model calls so you can answer the questions traditional APM can't: which model actually answered, how many input, output, and cached tokens it burned, what that attempt cost, whether a fallback fired, and whether the response was any good. It is LLM logging, LLM monitoring, and LLM tracing reframed around the three things that make a language model different from a normal microservice — non-determinism, per-token cost, and quality that drifts — and the only reliable way to truly track LLM usage and cost per call.

Why LLM observability differs from normal APM

APM assumes a request is correct if it returns the right status code quickly. That breaks the moment a language model is in the loop, for three structural reasons.

So LLM observability layers cost and quality on top of the usual performance and reliability. Latency and uptime still matter — they're just no longer sufficient.

What to capture on every call

The unit of LLM observability is the individual model attempt — and a solid per-attempt record includes:

SignalWhy it matters
Request & response payload (opt-in)The only way to reproduce a bad answer or build an eval set — but the most sensitive field, so it stays opt-in.
Model & provider that answeredWith routing and fallbacks the model you requested isn't always the one that replied; you can't attribute cost or quality without the real responder.
Tokens in / out / cachedThe raw drivers of cost and latency; logged separately they let you compute spend and spot prompt bloat or a broken cache.
Latency & time-to-first-token (TTFT)Total latency tracks the full call; TTFT tracks streamed responsiveness. They diverge, and users feel TTFT first.
Computed costTokens times the exact per-model rate, per attempt — a defensible number, not a guess.
Error typeA 429, a 503, a content-filter block, and a context overflow each demand a different response. The status class is the triage key.
Retries & which fallback firedOne logical request can be several physical attempts; the chain reveals hidden cost and silent reliance on a backup.
Quality / success signalDid the output pass validation — valid JSON, correct extraction, a passing rubric? Without it you measure spend but never cost per success.

Reconcile computed cost against the bill

Computed cost per attempt is the signal teams most regret skipping — provider consoles report yesterday's aggregate spend, not the cost of this route on this call. But it's only trustworthy if it ties out, so reconcile your per-attempt totals against each provider's actual bill; if they don't roughly match, your accounting or rate table is wrong — fix that before you trust any savings claim.

Tracing multi-step, agent, and RAG chains

A single completion is the easy case. Real systems chain calls — a RAG pipeline retrieves, reranks, then generates; an agent plans, calls tools, observes, and loops. When the final answer is wrong, the per-call logs above don't tell you which step failed. That's the distributed-tracing model you already know: a trace is the whole user-facing operation, and each model call, retrieval, or tool invocation is a span within it. Propagate one trace ID across every step and record, per span:

Now "the agent gave a bad answer" becomes "step 3 retrieved irrelevant context, so step 4 hallucinated" — a fault you can fix. Aggregated, a trace also gives you the real, all-in cost of one agent run, retries included.

Metrics and dashboards

Records and traces debug one request. To run the system you aggregate them into a few metrics, sliced by model, provider, and especially route or feature — "chat" and "background summarization" have nothing in common operationally.

Then alert on what users feel: a success-rate drop on a key route, a cost spike, p95 crossing budget, or a fallback rate that jumps after a provider deploys.

Privacy and PII handling

Prompts and completions are your richest debugging data and frequently the most sensitive — they carry names, emails, credentials, source code, or health and financial details. Treat payload capture as a privilege, not a default.

The result is a both/and: full fidelity on the routes where you've accepted the tradeoff, metadata-only everywhere else, never an accidental archive of customer secrets.

How a gateway centralizes it automatically

You can bolt all of this onto every service by hand. But notice where the signals live: model, provider, tokens, latency, cost, error class, retries, and which fallback fired are exactly the fields a router already touches on every request. That makes the gateway — the layer between your app and the providers — the natural collection point. Instrument it once and every call your fleet makes is observed, with no per-service wiring and nothing to re-verify when a provider or price changes. For the broader picture see what is an LLM gateway; for the cost model behind per-call accounting see AI tokenomics.

That's the gap flo2 fills. It's a developer-first, bring-your-own-key gateway: one OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model that meets your bar, with fallback chains, AI racing, and opt-in response caching. Because it sits on the request path, it gives you true per-call cost accounting for free — logging tokens in, out, and cached, throughput, and the computed cost of every attempt, including which fallback fired. Its A/B testing with an LLM judge even turns "model–task fit" into a signal you can watch over time. All at zero token markup, since you pay providers directly with your own keys — a zero-markup OpenRouter alternative, free during Beta.

One key, every model — zero markup.
Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.
Get your flo2 key →
© 2026 flo2.com — the zero-markup LLM gateway & router. flow → to