LLM Observability: Logging, Tracing & Cost Tracking
The first time a model-backed feature misbehaves in production, you reach for your usual playbook — open the traces, find the slow span, read the error — and it doesn't fit. The request returned a clean 200, latency was fine, and the output was still wrong. LLM observability is the practice of instrumenting model calls so you can answer the questions traditional APM can't: which model actually answered, how many input, output, and cached tokens it burned, what that attempt cost, whether a fallback fired, and whether the response was any good. It is LLM logging, LLM monitoring, and LLM tracing reframed around the three things that make a language model different from a normal microservice — non-determinism, per-token cost, and quality that drifts — and the only reliable way to truly track LLM usage and cost per call.
Why LLM observability differs from normal APM
APM assumes a request is correct if it returns the right status code quickly. That breaks the moment a language model is in the loop, for three structural reasons.
- Non-determinism. The same prompt can return different text on every call. A green status code tells you the API responded — not that it responded well — so success has to be measured at the content level, not the transport level.
- Token cost. Every call carries a price that scales with how much text went in and out. A "successful" request that quietly consumed 8,000 output tokens is an incident your error rate will never show — cost is a first-class signal here.
- Quality drift. Provider models change under the same name, prompts evolve, inputs shift. A route that worked last month can degrade with zero code changes and zero errors, and without a quality signal that decay is invisible until users complain.
So LLM observability layers cost and quality on top of the usual performance and reliability. Latency and uptime still matter — they're just no longer sufficient.
What to capture on every call
The unit of LLM observability is the individual model attempt — and a solid per-attempt record includes:
| Signal | Why it matters |
|---|---|
| Request & response payload (opt-in) | The only way to reproduce a bad answer or build an eval set — but the most sensitive field, so it stays opt-in. |
| Model & provider that answered | With routing and fallbacks the model you requested isn't always the one that replied; you can't attribute cost or quality without the real responder. |
| Tokens in / out / cached | The raw drivers of cost and latency; logged separately they let you compute spend and spot prompt bloat or a broken cache. |
| Latency & time-to-first-token (TTFT) | Total latency tracks the full call; TTFT tracks streamed responsiveness. They diverge, and users feel TTFT first. |
| Computed cost | Tokens times the exact per-model rate, per attempt — a defensible number, not a guess. |
| Error type | A 429, a 503, a content-filter block, and a context overflow each demand a different response. The status class is the triage key. |
| Retries & which fallback fired | One logical request can be several physical attempts; the chain reveals hidden cost and silent reliance on a backup. |
| Quality / success signal | Did the output pass validation — valid JSON, correct extraction, a passing rubric? Without it you measure spend but never cost per success. |
Reconcile computed cost against the bill
Computed cost per attempt is the signal teams most regret skipping — provider consoles report yesterday's aggregate spend, not the cost of this route on this call. But it's only trustworthy if it ties out, so reconcile your per-attempt totals against each provider's actual bill; if they don't roughly match, your accounting or rate table is wrong — fix that before you trust any savings claim.
Tracing multi-step, agent, and RAG chains
A single completion is the easy case. Real systems chain calls — a RAG pipeline retrieves, reranks, then generates; an agent plans, calls tools, observes, and loops. When the final answer is wrong, the per-call logs above don't tell you which step failed. That's the distributed-tracing model you already know: a trace is the whole user-facing operation, and each model call, retrieval, or tool invocation is a span within it. Propagate one trace ID across every step and record, per span:
- Step identity and order — name and sequence, so a 7-step agent loop reads top to bottom.
- Inputs and outputs of each step (opt-in) — retrieved chunks, tool arguments and results, intermediate completions. A RAG answer is usually wrong because retrieval returned the wrong context, which you only see if those documents are on the span.
- Per-step tokens, latency, and cost — to find the one expensive step in an otherwise cheap chain and roll up a true per-trace total.
- Parent/child links — so nested tool calls and sub-agents stay attributable to the request that spawned them.
Now "the agent gave a bad answer" becomes "step 3 retrieved irrelevant context, so step 4 hallucinated" — a fault you can fix. Aggregated, a trace also gives you the real, all-in cost of one agent run, retries included.
Metrics and dashboards
Records and traces debug one request. To run the system you aggregate them into a few metrics, sliced by model, provider, and especially route or feature — "chat" and "background summarization" have nothing in common operationally.
- Cost per route — spend grouped by feature and model, ideally as cost per successful request, not raw dollars. Tells you where the money goes and what to optimize first.
- Success rate — the share of calls that passed validation, per route and model. A dip is your earliest warning of quality drift, often before a user files a ticket.
- p95 / p99 latency and TTFT — tail latency, not the average, which hides the slow requests that frustrate users. Track TTFT separately for streamed paths.
- Error and fallback rate —
429/5xxfrequency and how often a fallback fired; a climbing fallback rate means your primary provider is quietly degrading. - Token volume per route — input vs. output vs. cached over time, to catch prompt bloat and confirm caching is getting hits.
Then alert on what users feel: a success-rate drop on a key route, a cost spike, p95 crossing budget, or a fallback rate that jumps after a provider deploys.
Privacy and PII handling
Prompts and completions are your richest debugging data and frequently the most sensitive — they carry names, emails, credentials, source code, or health and financial details. Treat payload capture as a privilege, not a default.
- Make raw capture opt-in, per route. Log metadata — tokens, latency, cost, model, status — everywhere, but store full request/response bodies only where you've deliberately enabled it. Metadata alone powers nearly every metric above without holding user content.
- Redact before storage. Run PII detection and strip or hash sensitive spans on the way in, so the durable record never holds the raw secret.
- Set retention deliberately. Keep payloads only as long as you need them for debugging or evals — short, explicit TTLs shrink your exposure if a log store is ever breached.
- Scope access. Logs containing user content deserve the same access controls and audit trail as your production database.
The result is a both/and: full fidelity on the routes where you've accepted the tradeoff, metadata-only everywhere else, never an accidental archive of customer secrets.
How a gateway centralizes it automatically
You can bolt all of this onto every service by hand. But notice where the signals live: model, provider, tokens, latency, cost, error class, retries, and which fallback fired are exactly the fields a router already touches on every request. That makes the gateway — the layer between your app and the providers — the natural collection point. Instrument it once and every call your fleet makes is observed, with no per-service wiring and nothing to re-verify when a provider or price changes. For the broader picture see what is an LLM gateway; for the cost model behind per-call accounting see AI tokenomics.
That's the gap flo2 fills. It's a developer-first, bring-your-own-key gateway: one OpenAI- and Anthropic-compatible key routes each request to the cheapest or fastest model that meets your bar, with fallback chains, AI racing, and opt-in response caching. Because it sits on the request path, it gives you true per-call cost accounting for free — logging tokens in, out, and cached, throughput, and the computed cost of every attempt, including which fallback fired. Its A/B testing with an LLM judge even turns "model–task fit" into a signal you can watch over time. All at zero token markup, since you pay providers directly with your own keys — a zero-markup OpenRouter alternative, free during Beta.