Skip to main content

Command Palette

Search for a command to run...

In Probabilistic Systems, You Watch the Shape of Success

All green. All wrong..

Updated
18 min read
In Probabilistic Systems, You Watch the Shape of Success
A
AI Architect. I design the boring control plane plumbing that keeps your impressive demos from quietly setting themselves on fire in production.

🗓️ Last updated June 2026. Field names and stability levels track the still evolving OpenTelemetry GenAI semantic conventions; check them against the official spec on GitHub before adoption.

There's a moment that arrives in most agentic AI projects once they hit real users.

Picture it: a user complains. The agent gave a wrong answer, or booked the wrong thing, or escalated something it shouldn't have. The team opens the logs.

The logs show what you'd expect. A long, branching session with many LLM calls, tool invocations, retries, and at least one compensating action. Every request returned 200. The final response looks plausible, but it's wrong.

Now figure out which step was the wrong one.

This is the observability gap that classical infrastructure thinking doesn't fully prepare you for. Classical systems have plenty of nondeterminism (caches, retries, eventual consistency, scheduler effects), and modern LLMs can be made more deterministic with temperature=0, fixed seeds, and constrained decoding. So this isn't a clean deterministic versus probabilistic split. But the center of mass shifts. In a classical service, observability usually tells you what failed. In an agent system, nothing fails. Everything succeeds, plausibly. The question is no longer "did it work?" It's "did it do the right thing, at acceptable cost, within acceptable latency, and would a sensible reviewer agree?"

That is a fundamentally different question, and it needs a fundamentally different observability stack.


What "Working" Means Has Changed

In a classical service, "working" is a contract you can assert against:

  • The endpoint returned 200
  • The JSON matched the schema
  • The latency was under SLA
  • The downstream services received the expected payload

When any of those fail, you have a green to red signal. You alert, you page, you fix.

In an agent system, the model returns plausible output every time. The endpoint returns 200. The schema validates. The latency is fine. The downstream APIs got valid calls. And the user is still unhappy because the agent picked the wrong tool, or hallucinated a constraint, or summarized the wrong document, or chose to ask a clarifying question when it should have acted, or vice versa.

There is no green light. There is only a distribution of outcomes, and the question of whether that distribution is acceptable.

"Working" stops being a boolean. It becomes a measured property of the system.

That single shift is why observability for agents is not a tooling upgrade. It's a category change.


The Three Axes of Agent Observability

It's easy to instrument only one axis, usually structural, because that's the muscle classical infrastructure already has. You need all three.

1. Behavioral observability: what did the model decide, and why?

This is the trace of the model's reasoning: the prompts, the retrieved context, the tool descriptions it saw, the generation parameters in force, the tool it picked, the arguments it filled in, and the response it produced. Without this, you cannot debug why the agent made a choice, only that it did.

2. Structural observability: what calls actually ran?

This is the classical distributed systems trace: which services were invoked, which retried, which timed out, which compensated, which produced which side effects. This is the layer your infrastructure team already knows how to instrument.

3. Economic observability: what did it cost?

This is the axis most teams discover the month finance forwards the bill with a one word subject: "questions?" Token counts per turn, dollars per session, cost per resolved task, drift over time. Agents that work but cost too much are a different kind of failure, one classical systems rarely had.

A complete observability stack ties all three axes to a single trace ID, so when something goes wrong you can answer all three questions simultaneously.


Why Traditional Logging Falls Apart

Three concrete problems show up the first time a team tries to instrument an agent with classical logging.

1. Volume. LLM inputs and outputs are large. A single multiturn session can carry tens of thousands of tokens of context, retrieved chunks, tool descriptions, and model responses. Logging all of it raw is affordable right up until your traffic is meaningful, at which point it very much is not. You need a tiered approach: keep full payloads for a short retention window in a separately governed store, keep structured metadata long term, and tail sample the traces themselves (see below).

2. Causality is not obvious. When the agent produces a wrong answer, the cause could be in the retrieval step, the prompt template, the tool description, the model's reasoning, a retry that ran on stale state, or a tool that returned partial data. Flat logs do not surface this. You need linked spans across the whole session.

3. Silent failure. This is the killer. The system can return success on every individual call, and yet produce a wrong outcome. There's no exception to log, no error to alert on. The only signal is the gap between what happened and what should have happened, and "what should have happened" lives outside the runtime, in your evaluation set.

Logging alone cannot close that gap. Observability for agents needs a runtime layer and an offline evaluation layer.


Tracing: The Core Runtime Primitive

OpenTelemetry has become the dominant open standard for distributed tracing across languages and platforms. It's a CNCF graduated project with broad vendor support, and the right starting point for agent observability for the same reason it's the right starting point for microservices: it's an open standard, and the ecosystem around it is large.

The OpenTelemetry community has a Generative AI semantic conventions working group that defines a gen_ai.* attribute namespace for LLM calls, covering model, token usage, finish reasons, request parameters, and more. As of June 2026 the conventions are still partly experimental, so expect some breaking changes as the spec matures. Adopting them early is still the right call, because the alternative is a proprietary schema you'll have to migrate later anyway, but plan for at least one rename pass.

The shape that works in practice:

  • Each user request produces one trace (for short lived requests), or each user session produces one trace (for long running conversations, async work, or human in the loop workflows). Pick one and be consistent per surface.
  • Each LLM call, tool call, retrieval, retry, and compensation produces one span.
  • Spans link to parents to form a DAG, not strictly a tree. Use OTel span links to model concurrent execution where parallel results merge back together (for example, a planner that triggers parallel tool calls whose results merge).
  • Behavioral, structural, and economic data hang off the spans as attributes and span events.

This sounds straightforward. The discipline is keeping it consistent across every codepath, not letting any agent internal call slip through uninstrumented.


What to Capture per Span

A practical, minimum viable span schema for an agent system. Where the OTel GenAI semconv defines a field, the convention name is shown in parentheses.

Common fields (every span):

  • trace_id, span_id, parent_span_id
  • session_id, user_id (hashed, and kept as a span attribute rather than a metric label, to avoid cardinality blow up)
  • turn_index
  • start_time, end_time, latency_ms
  • outcome: one of success, parse_fail, tool_fail, timeout, compensated, retry_N

LLM call spans:

  • model_name (gen_ai.request.model), served_model (gen_ai.response.model), provider (gen_ai.system)
  • operation (gen_ai.operation.name): chat, text_completion, embeddings, and so on
  • Generation parameters: temperature (gen_ai.request.temperature), top_p, max_tokens, seed, stop_sequences. Without these you cannot reproduce a decision.
  • input_tokens (gen_ai.usage.input_tokens), output_tokens (gen_ai.usage.output_tokens), cached_tokens
  • cost_usd, computed and stored once at log time. For direct provider calls, use the price effective when the call ran. For aggregators and routers (LiteLLM, Bedrock, Vertex) where the effective price may only be known via a delayed usage report, log a cost_usd_estimated and reconcile against billing later.
  • prompt_template_id plus prompt_template_version. These are far more analytically useful than a hash of the fully rendered prompt, because rendered prompts are nearly unique per call and can't be grouped.
  • rendered_prompt_hash and rendered_prompt_preview (first N chars; the full prompt lives in a separately governed cold store)
  • tool_choice (which tool the model picked, if any) and tool_schema_version (so a shift in selection rate can be correlated with a change in tool description)
  • finish_reason (gen_ai.response.finish_reasons)
  • Prompts and completions: emit as span events (gen_ai.content.prompt, gen_ai.content.completion) rather than attributes, per the semconv. This keeps cardinality manageable and makes opt in or opt out of message content a clean toggle.

Tool call spans:

  • tool_name, tool_version, tool_schema_version
  • arguments_redacted (PII and secrets removed at the observability boundary)
  • idempotency_key: the same key used by the tool runtime to dedupe retries (see part 3). It is deliberately stable across traces and retries; that is the whole point of the pattern. Log it on every span that touches the operation so you can group across trace boundaries.
  • outcome
  • external_call_trace_id (so you can link to the downstream service's trace)

Two principles to keep in mind:

  • Hash before you log. Store full prompts in a separate, retention bounded store with stricter access controls; keep only template IDs, hashes, and previews in your hot observability layer. This keeps query latency low and storage cost predictable.
  • Compute cost once. Store the dollar figure at log time and never recompute historical cost from today's pricing; the conversion rate moves the moment a provider changes a price.

Sampling: The Cost Lever Most Teams Miss

Span volume on an agent system at scale will dominate your observability bill if you keep everything. The standard answer is tail sampling: a collector buffers all spans for a trace, and at the end decides whether to keep it based on properties of the complete trace.

A sampling policy that works in practice:

  • 100% of traces with outcome != success on any span
  • 100% of traces above a cost or latency threshold
  • 100% of traces flagged by an online eval (see below) as low quality
  • A small uniform sample (say 5% to 10%) of the remainder, for baseline distributional metrics

Head sampling, which decides at trace start, cannot make these decisions, because the interesting properties of an agent trace only emerge once it finishes. Plan for tail sampling from the start.


The Eval Loop: Observability's Missing Half

Runtime observability tells you what happened. It can't tell you whether what happened was right.

For that, you need evaluation. Treating evals as a first class part of your observability stack, not a side project, is one of the highest leverage architectural decisions you can make. It's also one that's easy to defer until it's expensive to retrofit.

A serviceable eval setup has three layers, not one.

1. A graded test set (offline). Real examples from your domain, ideally drawn from production traces, with expected behavior labeled. 50 to 100 examples is a reasonable seed, but for regression detection with LLM as judge at meaningful statistical confidence (catching a 5% drop without false alarming weekly) you should plan to grow toward several hundred. The cases the agent gets wrong in production are the cases you want most in the eval set.

2. Online (shadow) evals. A stratified sample of live production traces, scored continuously by the same rubric used offline. This is what catches drift you didn't anticipate, surfaces new failure modes, and feeds new cases back into the offline set. The offline set tells you about regressions on known failure modes; online evals tell you about unknown ones.

3. Counterfactual replay. Captured traces are a goldmine. You can replay the same prompts and state through a candidate model or prompt version offline, diff the outputs, and ship with much higher confidence. This is the bridge between runtime observability and the eval loop, and it only works if your traces capture enough state to be deterministically replayable (which is why generation parameters in the span schema are not optional).

Three common scoring methods, in increasing complexity:

  • Exact match for tasks with a single correct answer (right tool, right argument, right value)
  • Rubric based human scoring for tasks where multiple answers can be valid
  • LLM as judge for scale: a separate model scores outputs against a rubric. This technique has documented biases. Position bias (preferring the first option in pairwise comparisons), verbosity bias (preferring longer answers), and self preference (preferring outputs from the judge's own model family). Mitigate with order randomization, conciseness instructions, and judging by multiple models; treat the judge's scores as directional, not absolute, and cross check periodically against human labels.

A trigger. Run the offline eval set on every prompt change, every model version change, every tool description change, and on a periodic schedule. A drop in eval score is a regression signal, the only one that catches silent quality failures.

Without evals, you have monitoring without truth. You'll see the dashboards stay green while quality silently degrades.


Metrics That Actually Matter

Once you have spans flowing, the metrics worth dashboarding are different from what you'd track for a classical service. They slice the three axes into something you can put on a dashboard: quality and behavioral metrics track the behavioral axis, latency and reliability track the structural axis, and cost tracks the economic one.

Quality

  • Task success rate (defined per task type, scored by eval set or human review)
  • Eval score over time (per prompt version, per model version, broken out by offline and online)
  • Drift: distribution shift in outputs week over week. Concretely, KL divergence on tool selection distributions, and embedding based clustering of outputs to detect new clusters.

Cost

  • Cost per session
  • Cost per resolved task (not per call; a session that takes 4 retries to succeed costs four times more). This is the single most useful economic metric in an agent system, because it disambiguates "agent is cheap" from "agent is cheap and effective."
  • Token burn per turn over time

Latency

  • P50, P95, P99, broken out per stage (retrieval, planning, tool execution, response generation)
  • End to end session latency

Reliability

  • Parse fail rate (the model returned output your parser couldn't handle). Note that constrained or structured decoding (OpenAI Structured Outputs, Anthropic tool_use, Gemini schema mode, grammar constrained decoding in open source runtimes) has materially lowered this rate. If yours is meaningful, switching on structured outputs is usually a faster fix than teaching your parser to forgive the model's creativity.
  • Retry rate per tool
  • Saga compensation rate

Behavioral

  • Tool selection distribution (which tools the agent reaches for, and how that drifts)
  • Clarification question rate (is the agent asking too much, or too little?)

You won't need all of these on day one. The minimum viable set is: task success rate, cost per resolved task, P95 latency, tool selection distribution, and parse fail rate. The rest are second pass.


Alerting on Probabilistic Systems

This is where classical alerting habits cause the most pain.

In deterministic systems, you often alert on individual events: error rate spiked, latency crossed threshold, queue depth exceeded. Those alerts work because the signal is binary.

In probabilistic systems, alerting on every individual failure of intent is noise. Some percentage of LLM calls will return malformed JSON (even with structured outputs, edge cases exist). Some retrievals will miss the right chunk. Some tool calls will fail. The right question for these is usually "is the distribution changing?", not "did one fail?"

The cleaner framing isn't "thresholds bad, distributions good." It's event level versus window level, and hard guardrails versus soft quality.

  • Event level, page immediately: hard guardrails. A refund tool fired twice, a single session crossed a hard cost ceiling, PII appeared in user visible output, an irreversible action ran without confirmation.
  • Window level, distributional: soft quality. Cost per resolved task drifts up by more than X% week over week; tool A's selection rate doubles overnight; eval score drops by more than Y% on a prompt change; P95 latency on the retrieval stage degrades over multiple windows.

The mistake is using event level alerting for soft quality signals (you'll drown in noise), or window level alerting for hard guardrails (you'll miss the single bad event that matters). Both habits show up, and the fix is to be explicit about which category each signal belongs in. The dashboards have to come first; the alerts come after you understand the shape of normal.


The Privacy and PII Tax

Agent observability has a privacy problem that classical observability mostly doesn't.

When you log full prompts and responses, you capture whatever the user typed. That may include PII, credentials accidentally pasted in, account numbers, medical information, or anything else a user might share with a system they're treating as conversational. The model itself may also reflect or paraphrase that information back in its output, so output redaction matters as much as input redaction.

The mistake is to handle this at the application layer. By the time you're writing application code, the data has already been collected, and any redaction is incomplete. The right place is at the observability boundary, the layer where spans are emitted to your telemetry backend. That layer should:

  • Detect and redact common PII patterns before storage, on both inputs and outputs
  • Hash user identifiers consistently across spans (so debugging still works without exposing identity)
  • Drop or hash full prompts in hot storage; keep them in a separately governed cold store with stricter access controls and shorter retention

A caveat worth stating clearly: pattern based PII detection is a floor, not a ceiling. It has known false negative rates on free form text, and on PII paraphrased by the model in output. For regulated domains (health, finance, legal), pair it with stricter input controls, ML based detectors (Microsoft Presidio, AWS Comprehend PII, cloud provider equivalents), and an explicit toggle to opt in or opt out of message content capture. The OTel GenAI semconv already separates content capture from structural attributes for exactly this reason.

This isn't optional. It's a compliance posture decision that becomes very expensive to retrofit. Build it before you scale.


What to Build First

A pragmatic order of operations for an agent observability stack:

  1. OpenTelemetry tracing around every LLM call and tool call, using the GenAI semantic conventions where they apply
  2. A cost and token tracker that attributes spend per session and per task type (this connects directly to the bounded cost architecture in part 5)
  3. A redaction layer at the observability boundary before scaling traffic
  4. Tail sampling in the collector before your hot storage bill outruns the project
  5. A graded offline eval set of 50 to 100 examples (growing toward several hundred), with regression alerts via LLM as judge on prompt and model changes
  6. Online or shadow evals scoring a stratified sample of production traces, feeding new failure cases back into the offline set
  7. Distribution dashboards for cost per resolved task, tool selection rate, eval score, parse fail rate, and P95 latency per stage
  8. Event level guardrail alerts for hard limits (cost ceilings, irreversibility, PII leakage), and window level distributional alerts for soft quality, only after you understand normal

The temptation is to build dashboards first, because dashboards are satisfying and span instrumentation is not. Resist it. The dashboards are worthless without the spans flowing through them. Get the trace data right, then the rest follows.


The Architect's Mental Model

The mental model that holds up:

Treat an agent as a probabilistic batch process whose outputs need runtime monitoring, offline scoring, and online scoring. Runtime gives you the dashboard. Offline evals give you regression detection against known failure modes. Online evals catch the unknown ones. None of the three alone is enough.

Classical observability often collapses these into one because in many deterministic systems, runtime success and correctness are tightly coupled. In agent systems, they aren't. A green dashboard with a degraded eval score means the system is running fine and getting worse. You will not catch that without all three layers.

The shape that holds up in practice: treat observability as a three axis problem (behavioral, structural, economic), instrument with open standards from day one, sample at the tail, redact at the boundary, and pair every dashboard with an eval set, offline and online.

In deterministic systems, you watch the failures.

In probabilistic systems, you watch the shape of success.


This is part 4 of Architecting Agents, a series on building agentic AI systems that survive production.

Architecting Agents

Part 4 of 4

Agentic AI is the most exciting shift in software in a decade — and one of the easiest to get wrong in production. Architecting Agents is a series of deep-dive essays from a senior software architect's chair: why agentic systems fail under real-world load, which classical distributed-systems patterns still apply, and how to design agents that survive contact with production. Less hype. More engineering. Written for engineers who ship.

Start from the beginning

Why Most Agentic AI Systems Fail in Production — A Software Architect's Perspective

Old engineering principles aren't optional. They're load-bearing