Every frontier model now holds a million tokens. Multi-step agents are still failing at step twelve. The capacity problem and the quality problem are not the same problem — and most engineering organisations solved one while the other got harder to see.
The Upgrade That Closed the Wrong Ticket
Eighteen months ago, the standard architectural response to agent unreliability in long workflows was a larger context window. The logic was straightforward: agents lost track of instructions, forgot earlier conclusions, and produced incoherent late-stage outputs — clearly a capacity problem, clearly solved by more capacity. The frontier labs obliged with 128k, then 200k, then million-token windows. Engineering teams updated their configurations, marked the context ticket closed, and moved on.
The financial document analysis workflow at a mid-market lending firm ran twelve steps — ingest filings, extract metrics, reconcile against prior periods, flag anomalies, produce a risk summary. Before the context upgrade, it regularly ran out of window mid-execution. After the upgrade, it ran to completion reliably. Success rate, measured by task completion: 100%.
Six months later, a compliance audit surface-compared a sample of outputs to analyst-reviewed benchmarks. Late-stage reasoning — the reconciliation steps, the anomaly flags, the risk characterisations — diverged from expected results at roughly twice the rate of short-run outputs. The agent wasn’t failing to finish. It was finishing with conclusions that had been quietly corrupted somewhere in the middle.
The window was wide open. The view was blurry from step four onward, and nobody was looking.
A Degradation Mode That Is Not About Capacity
Context rot is a quality failure, not a capacity failure. The distinction matters because the response to each is completely different.
The Morph LLM research team tested eighteen frontier models — GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and fifteen others — specifically to separate these two failure modes. They loaded each model with contexts of increasing length, well within each model’s stated window, and measured output quality throughout the range. Every single model degraded as context grew. Not at the limit. Throughout. The degradation began early, was measurable at moderate context lengths, and worsened progressively. No model was immune. The finding was not that models fail when context overflows; it was that context length itself is a quality variable, independent of capacity.
The underlying mechanism was documented earlier by Liu et al. in 2023, studying what they called the “lost in the middle” effect. Models perform significantly better on information presented at the beginning or end of a context than on identical information buried in the interior. The finding replicated across model families. In a static document, “the middle” is a fixed location you can reason around. In a growing context, nothing stays at the beginning. Every new token pushes prior content one position deeper. The original goal statement, the initial instructions, the foundational constraints — they migrate into the interior as the context grows. They don’t disappear from the window. They slide into the worst positional slot for attention.
Expanding the context window made the middle larger. That is the opposite of a fix.
The Contamination That Compounds
Long-document retrieval has a context problem. Multi-step agents have a different one, and the difference is what engineering teams have been systematically missing.
A document context is passive. You can audit it before inference, position critical information deliberately, trim what degrades performance. The context is fixed before the model touches it. You have design leverage.
In an agentic workflow, the context is the execution history. At step one, it contains the goal and the first tool call. At step five, it contains the outputs of steps one through four — intermediate conclusions, retrieved data, model-generated reasoning, errors. At step twelve, it contains everything the agent has tried, concluded, and gotten wrong. The context isn’t a document the agent is reading. It’s a transcript the agent is conditioning on. And unlike a document, it’s contaminated in proportion to how much went wrong earlier.
The research group at MPI-INF studying long-horizon agent execution (arXiv 2509.09677) named the mechanism precisely: self-conditioning. When a model’s context contains its own prior outputs, it treats those outputs as established ground truth rather than as evidence to be weighed. Earlier conclusions don’t get re-evaluated at each subsequent step. They get reinforced. An incorrect intermediate finding at step four isn’t one error. It is in context at step five, step six, step fourteen — and at each step, the model reads it the way it reads any other fact in its input. The error ages into authority.
The paper’s framing cuts through a common misunderstanding: “Failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason.” The model is capable of correct reasoning at step twelve. The failure is not intellectual incapacity. The failure is that contaminated context is shaping every inference the model makes, and the model has no mechanism to identify which parts of its own prior output are reliable and which aren’t. It reads its own history the way a good analyst reads a primary source — with baseline trust.
JetBrains Research documented the operational form of this in December 2025: agents in extended workflows lose accurate tracking of their own goal state not because the goal statement left the context window, but because it became diluted by accumulated intermediate content. The goal was present. It had just been buried in the middle by everything the agent had done since. This is context rot at the goal level — not a memory failure, a positioning failure caused by dynamic context growth.
The Instrumentation Gap That Larger Windows Created
Here is the part that is counterintuitive and important: larger context windows didn’t just fail to fix context rot. They made it systematically harder to detect.
When an 8k-token window filled, the agent failed visibly. The overflow was a legible error. Engineering teams investigated context management, built chunking logic, thought carefully about what to retain. The capacity failure drove a response because it was impossible to ignore.
A 128k-token window doesn’t produce that forcing function. The agent runs to completion. It returns an output in the correct format. It passes whatever structural validation you have. On any individual run, the late-stage reasoning degradation is subtle enough to survive review. The failure isn’t loud; it’s statistical. And statistical failures only surface if you are running the right measurements.
Most teams aren’t. The LangChain 2025 survey of 1,340 agent practitioners found quality and reliability ranked as the top barrier to production deployment — but standard evaluation practice runs agents on short, clean, isolated tasks. This is not sloppiness. It is a methodology designed for a different problem. Short-task evaluation was appropriate for detecting capacity failures: if the agent ran out of context, the short task would still show structural issues. It is systematically wrong for detecting quality degradation over long workflows, because the degradation doesn’t appear on short tasks.
METR’s empirical data on long-horizon task completion maps the production consequence precisely. Frontier agents achieve near 100% success on tasks taking humans under four minutes. On tasks requiring more than four hours, success rates fall below 10%. The 50% reliability horizon — where agents succeed half the time — was around 55 minutes in early 2025, doubling approximately every seven months. The financial document analysis workflow described above typically runs forty to seventy minutes. It sits at the reliability boundary not because the model isn’t capable but because workflow duration is long enough for context rot to erode late-stage reasoning quality in ways that short evaluations will never surface.
The instrumentation was never redesigned to look for what replaced the capacity problem. Larger windows appeared to solve the problem. The monitoring gap formed in the space where the original problem used to be. The failures became invisible precisely when they got worse.
The Signature of Silent Failure
Context rot in production doesn’t look like failure. That is what makes it operationally dangerous.
The workflow completes. The output is well-formed. Content filters find nothing to flag. Short-run evaluation samples — the 30% of your evaluation set that are brief, clean tasks — look correct. Your aggregate accuracy metric is fine.
The degradation is in late-stage reasoning on extended runs. An early step establishes an intermediate finding — a cost basis, an anomaly classification, a regulatory flag. That finding was subtly wrong. In a short workflow, the error would have been the final output and caught in review. In a twelve-step workflow, it propagates. Subsequent steps treat the early finding as a verified input. They build on it. The final output doesn’t contain one error. It contains the compounded downstream consequences of one early error that aged into authority.
The arXiv study at 2505.16067 measured the directional impact directly: memory management strategy — how prior agent outputs are represented in later context — significantly affects agent behavior across long workflows. What the agent keeps in context about its own prior reasoning is not an implementation default. It is the primary variable determining whether late-stage outputs are reliable. Verbatim retention of every prior step is the worst option. Structured summarisation with explicit uncertainty representation is measurably better. Most production agents use the former because nobody explicitly chose the latter.
The reason this mode stays invisible in aggregate metrics is straightforward. Post-hoc audits are sampled. Sample design is calibrated for random errors. If long-run outputs degrade systematically but long runs represent 20% of your evaluation set, the systematic late-stage bias is diluted in the aggregate. The number looks fine. The bias is real.
The Decision That Hasn’t Been Made
Here is the strategic reframe, stated plainly: the context window decision and the context quality decision are not the same decision. Most engineering organisations have made the first one. Almost none have made the second.
The context window decision is about capacity. Which model, what maximum length, what retrieval architecture when workflows exceed the window. This is largely made. Frontier models have sufficient capacity for most enterprise workflows. Continuing to optimise context capacity is diminishing-returns engineering. The teams responsible for it are right to have moved on.
The context quality decision is about what accumulates in context and how that accumulation is managed as a reliability variable. Which prior outputs get retained verbatim versus summarised. How uncertainty in intermediate conclusions is explicitly represented so that later steps can weight it appropriately. Whether workflows that cross the 55-minute reliability boundary include checkpoint-based context resets that strip accumulated contamination rather than carrying it forward. How goal state is maintained in early-context position despite the growing volume of intermediate content pushing it toward the middle. None of this is a question about the model. It is a question about the architecture surrounding the model.
The practical test for where your organisation stands: run your longest production workflow — not a short validation task, the full forty-step, ninety-minute, multi-document version — and compare the quality of reasoning at step ten to the quality of reasoning at step two on the same workflow. Not a different run. The same run. If you cannot make that comparison systematically, you do not have visibility into whether context rot is affecting your production outputs. The evaluation methodology that was appropriate for detecting capacity failures will not surface this.
This is the gap that expanded context windows created by appearing to close the context problem. The teams that upgraded their windows, tested their agents on short tasks, and found everything working were not being careless. They were using the right methodology for the wrong problem. The capacity problem gave clear signals — hard failures, explicit errors, easy-to-design-for-detection. The quality problem is statistical, progressive, and only visible if you measure across the right dimension.
The ownership question is equally sharp. The team that made the context window decision typically owns model selection and infrastructure. The team that needs to make the context quality decision typically owns agent architecture and workflow reliability. In most organisations, these decisions have been treated as one decision, made by the same people, and the second half was never explicitly assigned. Context quality management has no owner because the problem was never named as distinct from the problem that was already solved.
What the Architecture Actually Needs to Solve
The Morph LLM finding that every one of eighteen frontier models degrades progressively with context length is not a statement about current model limitations. It is a statement about how attention-based architectures interact with accumulated context. The next generation of models will have longer windows and better long-context performance. They will still degrade as context grows. The property is architectural, not parametric.
That means the relevant engineering question is not which model handles long context best. The relevant question is how to design workflows so that the context at any given step contains only what the model can reason reliably over — and so that errors from earlier steps are summarised, uncertainty-weighted, and positioned correctly, rather than retained verbatim and left to age into false authority.
This is not a model-selection problem. It is a context management problem — a different category of work than the one that occupied engineering teams when the constraint was capacity. It requires treating the agent’s prior output as a managed input to subsequent steps rather than as a free-form accumulation. It requires evaluation infrastructure that measures quality as a function of workflow step number, not just accuracy on task completion. And it requires someone in the organisation to own the question: at step twelve, is this agent’s reasoning reliable?
The window is wide enough. That question has been answered. The one that remains open is whether the view from step twelve is still clear — and whether your organisation is currently able to see the answer.
Sources:
- Morph LLM, “Context Rot: Why LLMs Degrade as Context Grows“
- Liu et al., “Lost in the Middle: How Language Models Use Long Contexts” (2023) arXiv:2307.03172
- Measuring Long Horizon Execution in LLMs arXiv:2509.09677 (MPI-INF)
- JetBrains Research Blog, “Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents” (December 2025)
- Xiong et al., “How Memory Management Impacts LLM Agents” (2025); arXiv:2505.16067
- LangChain, “State of AI Agents” survey (2025, n=1,340)