You can't debug what you can't trace

A refactor lands. Three agents shipped it — one read the diff and summarized the intent, one rewrote the call sites, one ran the test suite and reported it green — and every one of them logged success. The logs agree: files read, model called, tests invoked, patch applied, no errors. A week later production throws on an input no test covered, the rollback is ugly, and the post-mortem cannot say which agent's decision was the wrong one. The on-call engineer scrolls through a flat stream of timestamped events for an hour and learns nothing, because the failure is not in any single event. It is in the space between them.

That space is exactly what a log cannot show you and a trace can. As agent systems go from one model call to a coordinated mesh of them, the gap between "we have logging" and "we can debug this" becomes the gap between a system you operate and a system that fails after the demo.

Why event logs stop working

Event-based logging was built for a world of discrete, centrally-orchestrated steps. A request comes in, a handful of known things happen in a known order, you log each one, and when something breaks you read the line where it broke. That model holds right up until control stops being central.

A multi-agent system has no center. Agents pass partial results to each other, make decisions based on what they received, and the interesting failures are emergent: information that was correct when agent A produced it is gone by the time agent C needs it; two agents wait on each other; one agent's mistake quietly poisons a shared context that three downstream agents then trust. None of those produce an error line. Each agent did its job and logged success. The log is a pile of true statements that together explain nothing.

Published research on agent failures puts hard numbers on this — execution traces across many frameworks show failure rates that would be alarming in any other software, clustering into system-design problems, inter-agent misalignment, and verification gaps. The common thread is that the failures live in coordination, and coordination is invisible to per-component logging.

What a trace actually is

A trace is not a fancier log. It is a different data structure: a tree, not a stream.

Each unit of work is a span — an LLM generation, a retrieval, a tool call — and every span carries its own ID and its parent's ID. Spans that belong to the same request share one trace ID. That is the whole trick, and it is enough to reconstruct the entire causal shape of a request across every agent and tool boundary: what called what, in what order, with what inputs, and where the time and the tokens went. A flat log tells you that fifteen things happened. A trace tells you that this span caused those three, which fed that one, which is where the context got truncated.

Two things turn a trace tree from a timing diagram into a debugging instrument:

Semantic context on the span. Not just "retrieval ran," but what it retrieved, why this tool was chosen, the confidence the agent attached, and which memory it read. The sequence shows what happened; the semantic context shows why the agent thought it should.
Version stamps on the span. Which prompt version, which tool config, which agent revision was live when this span ran. The moment your agents start adapting from feedback, this is the only thing that lets you connect a new failure to the change that caused it — but that is its own discussion.

The failure taxonomy you're actually debugging

Once you can see the tree, the recurring multi-agent failure modes stop being mysteries and become recognizable shapes:

Context collapse — the window fills with intermediate output and a critical earlier value (a confidence score, a flag) gets truncated before the agent that needed it.
Cascading errors — one agent's wrong-but-plausible output becomes another agent's trusted input, and the error compounds downstream.
Coordination deadlocks — two agents each wait on the other.
Retry storms — a miscoordination triggers exponential retries that look like a load spike.
Context pollution — one agent writes bad data into shared memory and every reader inherits it.

A flat log renders all of these as "everything succeeded, output was wrong." A trace renders each as a specific, fixable picture.

Observability is part of the build, not an add-on

The mistake we see most is treating tracing as something you bolt on after the agent works in a demo. By then it is too late and too expensive — you are retrofitting instrumentation into a system whose failures you already cannot see. Observability is architecture. The decision about what a span captures, where the trace boundaries are, and how context is propagated across agents is made at design time or not made well at all. It is the same lesson as putting a verifier in the runtime and writing the 3 a.m. runbook before 3 a.m.: the system that survives production is the one that was built to be looked inside.

"The AI messed up" is not a debuggable statement. It is the sound of a team that shipped a multi-agent system with logs instead of traces and is now doing archaeology instead of engineering.

So build the tracing in from the first span. The instrument you skip at design time is exactly the one you reach for during the incident — and by then it is too late to add.