The Compounding Errors Problem: Why Multi-Agent Systems Fail and the Architecture That Fixes It

Zartis Team
AI

Disclaimer: The best practices and architecture we are about to explain is not theoretical. This article was deliberately researched and drafted using a multi-agent pipeline that implements necessary interventions to fix the compounding error problem.

There is a specific kind of failure that doesn’t appear in demos. You build a multi-agent system. Each component works, you test it, you’re satisfied. The document analyser hits 92% accuracy. The data extractor rarely hallucinates. The reasoning agent produces coherent outputs when you run it alone. You wire them together and run the full workflow end to end. The success rate lands somewhere between 20% and 40%. Nothing specific broke. Nothing you can point to and fix with a better prompt or a smarter model. The pipeline failed because the steps were chained.

This is error compounding. It is not an implementation bug. It is arithmetic.

Understanding it, and building around it, is the central challenge of production multi-agent systems in 2026. Not capability. Not model quality. Architecture.

The Arithmetic Nobody Talks About

The mathematics is straightforward but its implications are routinely ignored. If you have a pipeline of *n* sequential steps, and each step succeeds with probability *p*, the probability that the full pipeline succeeds is p^n. That is the individual reliabilities multiplied together. Not averaged. Multiplied.

At 95% per-step accuracy across ten steps: 0.95^10 = 59%.

At 90%: 0.90^10 = 35%.

At 85% — which is a strong result for a complex reasoning task — 0.85^10 = 20%.

A pipeline where each agent is individually impressive can succeed one time in five. The compounding is multiplicative regardless of where you start on the accuracy curve, which means no amount of model improvement fully solves the structural problem. Wand.ai’s production analysis puts it more viscerally: a 1% per-token error rate — an error rate that feels almost negligible — compounds to 87% cumulative failure by token 200.

But the mathematics of sequential multiplication is only the first layer of the problem.

The 2025 paper “The Illusion of Diminishing Returns” from researchers at MPI-INF and TU Kaiserslautern isolates a second mechanism they call self-conditioning: when an LLM’s context window contains its own previous errors, it becomes measurably more likely to produce further errors. The model reads its earlier output, interprets the established pattern as correct, and builds on it. One wrong step doesn’t just fail in isolation — it degrades the epistemic context for every step that follows. The degradation is not linear. It accelerates.

“Failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason.”

The models aren’t intellectually incapable of the task. Given the right context (correct prior steps, no accumulated errors) they can reason through complex problems reliably. The failure is in execution under contaminated context. And because each step adds potential contamination, longer chains become progressively harder to complete cleanly regardless of model capability.

METR’s empirical work on long-horizon task completion maps the consequence. Frontier models achieve near 100% success on tasks taking humans less than four minutes. On tasks requiring more than four hours, success rates fall below 10%. The 50% reliability horizon — the task duration where current frontier agents succeed half the time — has been doubling roughly every seven months for six years. Claude 3.7 Sonnet sat at around 55 minutes in early 2025 (METR, 2025). That progress is real. It also means the four-hour wall is very real, and most enterprise workflows – the kind that actually automate meaningful knowledge work – don’t fit inside 55 minutes.

The Topology Tax

Sequential compounding is one problem. Multi-agent systems introduce a second one that compounds the first in ways the sequential model doesn’t capture: error amplification across independent agents.

A 2025 paper from a cross-institutional team, “Towards a Science of Scaling Agent Systems,” ran 180 controlled experiments across four agentic benchmarks and three LLM families to derive empirical scaling laws for multi-agent coordination. The headline result is precise: independent, decentralised agent architectures amplify errors 17.2 times compared to a single-agent baseline. Centralised coordination — a single orchestrating agent delegating to specialised sub-agents — contains this to 4.4 times.

The mechanism is unchecked propagation. In a peer-to-peer or “bag of agents” architecture, Agent B receives Agent A’s output without intermediate verification. If A produced an error, B reasons from that error. If B produces a further error, C reasons from both. The errors are not isolated events. They are epistemic inputs that shape all downstream reasoning. The researchers describe the fully-flat peer discussion pattern as one that “amplifies faults”: a finding consistent across all four benchmarks and all three model families they tested.

More damaging to the intuition that more agents means better performance:

“Multi-agent variants degraded sequential reasoning performance by 39 to 70%, challenging the assumption that more agents always improve outcomes.”

The 2025 MAST study pressed further, analysing 1,600 execution traces across seven popular multi-agent frameworks; AutoGen, MetaGPT, ChatDev, and others. Researchers found 14 distinct failure modes, and what makes their taxonomy useful is the order in which the categories appear. System design failures come first, creating the conditions before a single task runs: agents given conflicting instructions, unclear role boundaries, or insufficient context to recognise when they’re out of their depth. These architectural weaknesses set up the second category, inter-agent misalignment, where agents interpret shared task context differently and produce outputs that are locally consistent but globally contradictory, making assumptions about each other’s work that neither has validated. And those misaligned outputs flow into the third category without interruption: task verification failures, where completed steps are not checked before their outputs enter the next agent’s context, so errors that should have been caught at an intermediate review are instead passed downstream as authoritative inputs.

The critical finding is that these categories are not independent. They cascade. A misaligned agent produces an output that the next agent accepts as authoritative, and a verification failure that should have caught the problem is never triggered because the system wasn’t designed to look for it. Failure rates on state-of-the-art open-source multi-agent systems ranged from 41% to 86.7%. The failures are systematic, not random. They cluster into recognisable patterns. And that means they are designable around.

The evaluation layer compounds this further: most teams cannot tell whether their pipeline is actually improving, because their metrics are designed to measure the wrong thing. The ReliabilityBench study found that pass@1 metrics – the standard reporting format – overestimate real reliability by 20 to 40%, because they measure whether an agent succeeds *once* given optimal conditions, not whether it succeeds *reliably* under production variability. The CLEAR framework analysis found agents completing the same task with 60% success rate on one day and 25% on another. Teams building production pipelines in 2026 are largely optimising for a metric that doesn’t measure what they care about, which means they can invest months in agent development, achieve measurable improvement on their benchmark, and deploy a system that performs worse in production than the numbers suggested, not because they were careless, but because the measurement instrument was mismatched to the deployment reality.

The Six Sigma Insight: Breaking the Chain

If error compounding is structural, so is the solution. The research converges on three architectural interventions that work not by making individual agents more accurate, though that helps too, but by changing how errors propagate through the system.

Decompose into Atomic, Verifiable Units

Error compounding requires an unbroken chain. You break it by decomposing complex tasks into a directed acyclic graph (DAG) of atomic sub-tasks, each small enough that its output can be independently evaluated before it enters any downstream context.

This is the architectural foundation of the Six Sigma Agent framework, published in January 2026. The mathematical result is striking: if individual atomic tasks have error rate *p*, and you sample *n* independent outputs per task and select by consensus, the system error rate becomes O(p^⌈n/2⌉). Error falls *exponentially* with redundancy rather than compounding multiplicatively with chain length.

The concrete consequence: with a 5% per-action error rate and five parallel samples per step, consensus voting reduces system error to 0.11% — a 45x improvement. At thirteen parallel agents with dynamic scaling triggered when consensus is weak, the architecture achieves 3.4 defects per million opportunities, the Six Sigma manufacturing standard. The 14,700x reliability improvement over single-agent execution they report is not a magic number. It is the direct consequence of two design decisions: atomic decomposition stops error propagation between DAG branches, and consensus voting replaces single-sample probabilism with controlled redundancy.

That said, thirteen parallel agents per pipeline step is not cost-neutral, and any honest treatment of this architecture needs to say so directly. At current frontier model pricing, full redundancy at every step is not viable for most workflows. The practical application is selective deployment: apply the full consensus protocol to high-stakes, low-frequency decision points — the steps where a wrong output would corrupt every downstream branch — and use lighter verification (a single Inspector rather than consensus voting) for routine steps where errors are recoverable. The reliability-per-dollar curve is steep in both directions, which means the architecture scales from a lightweight two-agent Inspector setup to the full thirteen-agent protocol depending on what each step costs if it fails.

The consensus mechanism also only works if the parallel samples are genuinely independent. Using the same model with the same temperature and the same prompt context produces correlated outputs; they share failure modes. The framework requires diverse models or meaningfully varied prompting, because correlated agents voting on a wrong answer will reach consensus on the wrong answer.

LangGraph provides the production infrastructure for this pattern with its checkpoint-based execution model. State is persisted at each step; failed steps restart from a known-good checkpoint rather than from scratch; the graph structure makes branch independence explicit. The pattern replaces the implicit hope that steps won’t fail with the explicit design assumption that some will, and the architecture contains the consequence.

Install Adversarial Verification Agents

Atomic decomposition handles structural propagation. But not all errors are detectable through schema validation or format checking. Reasoning errors, factual errors, and the subtle drift between what a task required and what an agent interpreted are a different category — they require an agent positioned specifically to find flaws in another agent’s output.

This is the Inspector pattern. The 2025 ICML paper “On the Resilience of LLM-Based Multi-Agent Collaboration” studied it formally: a verification agent placed after each primary agent in the pipeline, with an explicit adversarial mandate – not to review passively, but to identify errors, inconsistencies, and failures of task adherence. The recovery rate: 96.4% of errors introduced by faulty agents were caught before propagating downstream. The same study found closed-loop architectures incorporating adversarial checking neutralise more than 40% of faults that would otherwise compound.

The design requirement that makes or breaks this pattern is independence. An Inspector that uses the same model, the same context, and the same prompt as the primary agent will share its failure modes. Two instances of the same model reasoning from the same context will hallucinate on the same inputs. Independence here means: different model, different instructional framing, explicitly adversarial mandate, and ideally different underlying training data. The value is not intelligence, it is epistemic position. The Inspector reviews from outside the primary agent’s error context rather than inside it.

The devil’s advocate pattern in deliberative reasoning systems is structurally identical. Rather than asking an agent to confirm a hypothesis, you instantiate a second agent whose explicit role is to construct the strongest possible argument against it. Confirmation bias operates at the model level as well as the human level — a model asked to evaluate its own output will find it more credible than a model asked to challenge that output. The adversarial framing inverts the incentive structure.

Match Topology to Task Structure

The third intervention happens before pipeline execution: you choose your coordination architecture based on the structure of the task, not on what sounds architecturally sophisticated.

The scaling laws paper provides a practical framework. Parallelisable tasks, where sub-problems are genuinely independent, benefit substantially from centralised multi-agent coordination, with 80.8% performance improvement in controlled experiments. Sequential reasoning tasks, where each step depends on the previous one, degrade by 39 to 70% in multi-agent settings compared to single-agent execution. Tool-heavy tasks perform worse under multi-agent architectures at fixed compute budgets because coordination overhead exceeds the benefit of specialisation.

The framework predicts optimal coordination strategy from observable task properties with 87% accuracy on held-out configurations. The practical decision rule: before adding agents, classify the task. If sub-problems are independent, centralise the orchestration and parallelise the execution. If the problem is fundamentally sequential, use a single agent with extended chain-of-thought, not a committee. If the task is tool-heavy with a fixed compute budget, a well-prompted single agent will likely outperform a specialised team.

The intuition here cuts against a prevalent industry assumption that more specialisation is always better. It isn’t. It’s better when the task decomposes into genuinely independent parallel problems. It’s worse when the task is sequential and the specialisation mostly creates coordination overhead and additional error propagation surfaces.

The Zartis Research Pipeline as a Live Case Study

The architecture described above is not theoretical. This article was researched and drafted using a multi-agent pipeline that implements all three interventions deliberately. Walking through it concretely illustrates what these principles look like in practice — and it makes the argument more honest, because what follows is not what the pipeline is supposed to do, but what we learned by building it and watching where it broke.

The pipeline runs through several stages of discovery and synthesis before any content is produced. Discovery agents search arXiv, GitHub, and the web, scoring and ranking sources by relevance and empirical quality. A connection-mapper explores adjacent fields — control theory, distributed systems, cognitive science — because the most useful insights tend to arrive from unexpected directions. A synthesis engine then works across all discovered sources to generate original patterns and frameworks. Synthesis here is deliberately distinguished from summarisation. Summarisation reduces; synthesis creates.

At this point in the pipeline, there is a temptation to hand the synthesis to a content generator and be done with it. This is the single-agent pattern, and in this pipeline it is explicitly rejected. Instead, three adversarial agents run in sequence before any content is produced — and the reason for this sequence was learned through failures, not designed from first principles.

What we found is that synthesis agents, left unchallenged, produce confident-sounding conclusions that rest on hidden assumptions. The first adversarial agent — structured as a devil’s advocate — runs an assumption audit: surface the hidden premises in every major claim, rate them by likelihood of being false against impact if false, and require the content to either address or remove all claims rated critical. For this article, that included challenging whether the 17.2x error amplification figure was a laboratory artefact that wouldn’t hold in production, whether the Six Sigma approach’s cost implications were being addressed honestly (they weren’t, in the early draft), and whether using this pipeline as the illustration was genuinely illuminating or self-promotional. The most demanding phase is steelmanning: the devil’s advocate is instructed to represent opposing views at their strongest. The current consensus being challenged — that multi-agent systems with proper specialisation outperform single agents on complex tasks — is a real position with real evidence behind it. The article only earns the right to challenge it by engaging with the steelman version, not a weakened caricature.

The claim-verifier runs next, because what the devil’s advocate finds is claims that are weak or unsubstantiated but weakness is not the same as falsity, and the claim-verifier’s job is to distinguish them. It runs a five-phase process: extract every claim from the synthesis, search local research sources, search external sources where local verification fails, categorise each claim into one of eight verification statuses, and generate actionable corrections with exact citations. The distinction between “verified,” “verifiable but uncited,” “verifiable with modification,” and “unverifiable” matters enormously – the last category means the claim must be removed, because if it doesn’t exist in any source, it is an invention. The pipeline’s internal target is 95% citation coverage with every statistic traceable to a primary source. This is not pedantry. Every uncited number in a blog about AI reliability is the article making itself untrustworthy while arguing that systems need to be trustworthy.

The prose-critic evaluates the draft on four dimensions: voice consistency, AI pattern detection, unique angle, and publication readiness. The framing question it applies is deliberately personal: *”Would I publish this under my name without edits?”* The AI pattern detection dimension runs a specific red-flag checklist: generic openings, performative transitions, conclusion patterns, and what the system calls “list-itis” — the tendency for AI-generated content to present arguments as bullet lists rather than developing them as prose. These patterns are signals that the content was generated without genuine editorial judgment. The unique angle dimension asks a harder question: could this article have been written by anyone who Googled the topic? A summary of existing research on error compounding could be. An article that connects the 17.2x amplification finding to the Inspector pattern’s 96.4% recovery rate to the Six Sigma mathematical proof — and then uses a running pipeline as the concrete illustration of those principles — has a reasonable claim to saying something that isn’t already written.

What this pipeline represents architecturally is a direct implementation of the three principles: the synthesis engine is the primary agent, each adversarial agent is an Inspector positioned adversarially after it, and the pipeline is decomposed into checkpointed stages where the output of each stage is verified before becoming the context for the next. The devil’s advocate resets the epistemic context by forcing engagement with the strongest counter-arguments. The claim-verifier resets it again by requiring every claim to trace to external evidence. The prose-critic resets it a third time by evaluating the draft from the reader’s perspective rather than the writer’s. Errors do not propagate unchecked from synthesis to published article because the pipeline is designed specifically to interrupt them — not by making the individual agents more capable, but by positioning verification at every point where contamination could occur.

What the Field Is Learning

The METR data projects that the 50% reliability horizon for frontier agents — around 55 minutes in early 2025 — will reach four hours by approximately 2027 and approach day-long autonomous task execution by the end of this decade. That improvement comes from scaling model capability and compute: larger models self-condition less severely, extended chain-of-thought reasoning reduces error accumulation, and the mathematical relationship between per-step accuracy and task length means that incremental accuracy improvements produce disproportionate gains in what can be completed reliably.

But scaling alone won’t close the production reliability gap, and the research is increasingly clear about why. The 2026 paper “A Control-Theoretic Foundation for Agentic Systems” makes a precise analogy: agent retry loops that don’t converge exhibit the mathematical signature of *integral windup*, a classical control systems failure mode first documented in PID controller design. In a PID controller, integral windup occurs when the integrator accumulates error signal faster than the system can correct it, causing the controller to over-compensate and oscillate rather than settle. In an agent retry loop, the analogous failure is an agent that accumulates context from each failed attempt — error-contaminated reasoning that it treats as additional evidence — and produces increasingly divergent outputs rather than converging on a solution. The fix in control engineering is not to increase the controller’s power. It is to design anti-windup mechanisms that reset the integrator when the correction exceeds the plant’s range. The agent equivalent is checkpoint-based context reset: rather than retrying with accumulated error context, restart from the last verified-good state. The agent reliability problem, framed this way, is not a capability deficit — it is an architecture deficit.

The evaluation problem sits alongside this: SWE-bench Verified saturated in late 2025, with the top five models within 1.3 percentage points of each other. Its successor was immediately necessary because the original had exhausted its signal. These benchmarks measure single-task, best-case performance and say almost nothing about multi-session reliability, cost efficiency under load, or consistency across runs. The CLEAR framework proposes measuring reliability distribution rather than average performance — what is the distribution of outcomes when this agent runs this task 100 times, not just whether it succeeded once? It has not yet become a field standard, but it represents the evaluation equivalent of the adversarial Inspector: measuring what actually matters for production rather than what is easiest to score.

The manufacturing analogy from the Six Sigma framework is equally instructive. Manufacturing engineering did not achieve six-sigma quality by building better individual components. It achieved it by designing systems where defects were caught before propagating, where processes were decomposed into auditable units, and where verification was structural rather than optional. The reliability engineering discipline for multi-agent systems is younger than two years old. It is converging, from multiple directions, on the same set of principles.

The OWASP Top 10 for Agentic Applications, released in December 2025, represents another convergence signal: the professional security community identified prompt injection in 73% of assessed production deployments as the leading vulnerability. The adversarial verification pattern is directly relevant here too — an agent that checks for injection attempts in tool outputs before passing those outputs downstream is doing the same structural work as an Inspector checking for reasoning errors. The Inspector pattern is not a reliability pattern specifically; it is a pipeline integrity pattern. The property you are enforcing — reasoning correctness, injection resistance, factual citation — is a parameter. The structure is the same.

The Design Principle

The argument above reduces to a single, deployable principle: reliability in multi-agent systems comes from the structure of verification, not the capability of individual agents.

A 95% accurate agent in an unverified ten-step chain succeeds 60% of the time. The same agent in a pipeline with adversarial checkpoints after every few steps — where errors are caught, flagged, and corrected before they contaminate downstream context — sustains reliability across chains an order of magnitude longer. The Inspector pattern’s 96.4% fault recovery rate is achievable not because the Inspector is smarter than the primary agent, but because it reviews from outside the primary agent’s error context. The Six Sigma framework’s 14,700x improvement is achievable not because the individual agents are 14,700 times better, but because the system architecture exponentiates the effect of incremental individual accuracy through controlled redundancy and atomic decomposition.

The Conclusion

The uncomfortable implication for teams that have been optimising models and prompts in isolation: agent quality is a necessary condition for pipeline reliability, not a sufficient one. You can have the best individual model performance in your industry and still build a pipeline that fails on eight-step workflows. The failure won’t surface in unit tests of individual agents. It will surface in production, at the point where no single agent can be blamed, because the system as a whole was never designed to contain failure — and at each handoff, you were betting that the error didn’t happen yet. The longer the chain, the worse that bet becomes — unless you designed the chain to catch you when it does.

At Zartis, we help organisations design multi-agent systems and pipelines to transform the way they work. Build strategies that are disciplined, measurable, and aligned to product goals. Reach out today to discuss how we can help your team!

Sources:

“Towards a Science of Scaling Agent Systems” (Kim et al., 2025, arXiv:2512.08296)
“Why Do Multi-Agent LLM Systems Fail?” (Cemri et al., 2025, arXiv:2503.13657)
“The Six Sigma Agent” (Patel et al., 2026, arXiv:2601.22290)
“The Illusion of Diminishing Returns” (Sinha et al., 2025, arXiv:2509.09677)
“Measuring AI Ability to Complete Long Tasks” (METR, 2025)
“On the Resilience of LLM-Based Multi-Agent Collaboration” (arXiv:2408.00989)
“Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems” (arXiv:2511.14136)
“A Control-Theoretic Foundation for Agentic Systems” (arXiv:2603.10779)
OWASP Top 10 for Agentic Applications (December 2025)