2026-06-03 · flo2 blog

LLM Observability: Logging, Tracing & Cost Tracking

The first time a model-backed feature misbehaves in production, you reach for your usual playbook — open the traces, find the slow span, read the error — and it doesn't fit. The request returned a clean 200, latency was fine, and the output was still wrong. LLM observability is the practice of instrumenting model calls so you can answer the questions traditional APM can't: which model actually answered, how many input, output, and cached tokens it burned, what that attempt cost, whether a fallback fired, and whether the response was any good. It is LLM logging, LLM monitoring, and LLM tracing reframed around the three things that make a language model different from a normal microservice — non-determinism, per-token cost, and quality that drifts — and the only reliable way to truly track LLM usage and cost per call.

Why LLM observability differs from normal APM

APM assumes a request is correct if it returns the right status code quickly. That breaks the moment a language model is in the loop, for three structural reasons.

Non-determinism. The same prompt can return different text on every call. A green status code tells you the API responded — not that it responded well — so success has to be measured at the content level, not the transport level.
Token cost. Every call carries a price that scales with how much text went in and out. A "successful" request that quietly consumed 8,000 output tokens is an incident your error rate will never show — cost is a first-class signal here.
Quality drift. Provider models change under the same name, prompts evolve, inputs shift. A route that worked last month can degrade with zero code changes and zero errors, and without a quality signal that decay is invisible until users complain.

So LLM observability layers cost and quality on top of the usual performance and reliability. Latency and uptime still matter — they're just no longer sufficient.

What to capture on every call

The unit of LLM observability is the individual model attempt — and a solid per-attempt record includes:

Signal	Why it matters
Request & response payload (opt-in)	The only way to reproduce a bad answer or build an eval set — but the most sensitive field, so it stays opt-in.
Model & provider that answered	With routing and fallbacks the model you requested isn't always the one that replied; you can't attribute cost or quality without the real responder.
Tokens in / out / cached	The raw drivers of cost and latency; logged separately they let you compute spend and spot prompt bloat or a broken cache.
Latency & time-to-first-token (TTFT)	Total latency tracks the full call; TTFT tracks streamed responsiveness. They diverge, and users feel TTFT first.
Computed cost	Tokens times the exact per-model rate, per attempt — a defensible number, not a guess.
Error type	A `429`, a `503`, a content-filter block, and a context overflow each demand a different response. The status class is the triage key.
Retries & which fallback fired	One logical request can be several physical attempts; the chain reveals hidden cost and silent reliance on a backup.
Quality / success signal	Did the output pass validation — valid JSON, correct extraction, a passing rubric? Without it you measure spend but never cost per success.

Reconcile computed cost against the bill

Computed cost per attempt is the signal teams most regret skipping — provider consoles report yesterday's aggregate spend, not the cost of this route on this call. But it's only trustworthy if it ties out, so reconcile your per-attempt totals against each provider's actual bill; if they don't roughly match, your accounting or rate table is wrong — fix that before you trust any savings claim.

Tracing multi-step, agent, and RAG chains

A single completion is the easy case. Real systems chain calls — a RAG pipeline retrieves, reranks, then generates; an agent plans, calls tools, observes, and loops. When the final answer is wrong, the per-call logs above don't tell you which step failed. That's the distributed-tracing model you already know: a trace is the whole user-facing operation, and each model call, retrieval, or tool invocation is a span within it. Propagate one trace ID across every step and record, per span:

Step identity and order — name and sequence, so a 7-step agent loop reads top to bottom.
Inputs and outputs of each step (opt-in) — retrieved chunks, tool arguments and results, intermediate completions. A RAG answer is usually wrong because retrieval returned the wrong context, which you only see if those documents are on the span.
Per-step tokens, latency, and cost — to find the one expensive step in an otherwise cheap chain and roll up a true per-trace total.
Parent/child links — so nested tool calls and sub-agents stay attributable to the request that spawned them.

Now "the agent gave a bad answer" becomes "step 3 retrieved irrelevant context, so step 4 hallucinated" — a fault you can fix. Aggregated, a trace also gives you the real, all-in cost of one agent run, retries included.

Metrics and dashboards

Records and traces debug one request. To run the system you aggregate them into a few metrics, sliced by model, provider, and especially route or feature — "chat" and "background summarization" have nothing in common operationally.

Cost per route — spend grouped by feature and model, ideally as cost per successful request, not raw dollars. Tells you where the money goes and what to optimize first.
Success rate — the share of calls that passed validation, per route and model. A dip is your earliest warning of quality drift, often before a user files a ticket.
p95 / p99 latency and TTFT — tail latency, not the average, which hides the slow requests that frustrate users. Track TTFT separately for streamed paths.
Error and fallback rate — 429/5xx frequency and how often a fallback fired; a climbing fallback rate means your primary provider is quietly degrading.
Token volume per route — input vs. output vs. cached over time, to catch prompt bloat and confirm caching is getting hits.

Then alert on what users feel: a success-rate drop on a key route, a cost spike, p95 crossing budget, or a fallback rate that jumps after a provider deploys.

Privacy and PII handling

Prompts and completions are your richest debugging data and frequently the most sensitive — they carry names, emails, credentials, source code, or health and financial details. Treat payload capture as a privilege, not a default.

Make raw capture opt-in, per route. Log metadata — tokens, latency, cost, model, status — everywhere, but store full request/response bodies only where you've deliberately enabled it. Metadata alone powers nearly every metric above without holding user content.
Redact before storage. Run PII detection and strip or hash sensitive spans on the way in, so the durable record never holds the raw secret.
Set retention deliberately. Keep payloads only as long as you need them for debugging or evals — short, explicit TTLs shrink your exposure if a log store is ever breached.
Scope access. Logs containing user content deserve the same access controls and audit trail as your production database.

The result is a both/and: full fidelity on the routes where you've accepted the tradeoff, metadata-only everywhere else, never an accidental archive of customer secrets.

How a gateway centralizes it automatically

You can bolt all of this onto every service by hand. But notice where the signals live: model, provider, tokens, latency, cost, error class, retries, and which fallback fired are exactly the fields a router already touches on every request. That makes the gateway — the layer between your app and the providers — the natural collection point. Instrument it once and every call your fleet makes is observed, with no per-service wiring and nothing to re-verify when a provider or price changes. For the broader picture see what is an LLM gateway; for the cost model behind per-call accounting see AI tokenomics.

That's the gap flo2 fills. It's a developer-first, bring-your-own-key gateway: one OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model that meets your bar, with fallback chains, AI racing, and opt-in response caching. Because it sits on the request path, it gives you true per-call cost accounting for free — logging tokens in, out, and cached, throughput, and the computed cost of every attempt, including which fallback fired. Its A/B testing with an LLM judge even turns "model–task fit" into a signal you can watch over time. All at zero token markup, since you pay providers directly with your own keys — a zero-markup OpenRouter alternative, free during Beta.

One key, every model — zero markup.

Bring your own provider keys. flo2 routes to the cheapest, fastest model with fallback, racing and true cost accounting — free during Beta.

Get your flo2 key →