AI Agent Cost Optimisation

AI Agent Cost Optimisation: Why Token Cost Is the Wrong Number to Optimise

The most common failure mode in agent cost optimisation is not inefficiency, it’s optimising the wrong number with impressive precision. A team spends three weeks implementing context compression, dynamic model routing, and prompt caching. The API bill drops 40%. Two months later, the quarterly cost review reveals that total operating cost for the workflow increased — not because the optimisation failed, but because reduced context precision triggered a 12-point drop in task success rate, and each failed task required an analyst to spend twenty minutes in remediation. The token savings were real. They were also irrelevant. This is the pattern that production deployments honest about their economics keep discovering: inference cost and operating cost are not the same number, and the optimisation community has been working on the cheaper one.

At $200 per day in inference tokens with 70% task success, a typical document-processing agent generates roughly $20,000 per day in analyst remediation cost — at $80/hour loaded cost, five minutes per failure, across 10,000 daily tasks. The API dashboard shows the $200. The $20,000 is distributed across Jira tickets, team calendars, and manager frustration. It does not appear in the LLMOps dashboard. The dashboard is not lying; it is just showing you the smaller number.

The entire discourse about agent token optimisation — context compression, model routing, caching strategies — is solving the right category of problem with the wrong objective function. The optimisation literature almost universally targets inference cost per call. The number that actually governs whether an agent system is economically viable is something else entirely: reliability-adjusted cost per task. The gap between these two metrics is where organisations are quietly losing money at scale.

This article is scoped to custom-built and API-accessed agent deployments — the class of system where token economics are visible and architectural decisions are yours to make. If your agents are delivered through solutions such as Salesforce Agentforce, Microsoft 365 Copilot, or ServiceNow’s embedded AI, you are buying capacity on a subscription model and the framework below applies differently. For everyone else building agents on top of LLM APIs, what follows is an argument about getting the cost model right before scale makes the error expensive.

 

The Metric Nobody Is Tracking

The economically correct unit of analysis for a production agent system is not cost per token, nor cost per API call. It is total operating cost per successfully completed task. The formula is unglamorous but precise:

 

```
total_cost_per_task = (token_cost + infrastructure_cost) / reliability_rate
                    + (failure_rate × human_remediation_cost_per_failure)
```  

 

Most teams measure the first term of the numerator and ignore everything else. This is a category error with predictable consequences.

Consider two concrete architectures for the same document processing workflow. Architecture A is token-frugal: compressed prompt, no structured output validation, everything routed through a small model. Cost per call: $0.01. Task success rate: 70%. Each failure requires 10 minutes of analyst review at $100/hour loaded cost. At 10,000 tasks per day, the arithmetic is: $100 in tokens, plus 3,000 failures at $16.67 each — total daily cost of roughly $50,100, of which $50,000 is human oversight. Architecture B is token-expensive: larger model, validation steps, structured output contracts. Cost per call: $0.05. Task success rate: 95%. At 10,000 tasks per day: $500 in tokens, plus 500 failures at $16.67 — total daily cost of roughly $8,835. Architecture B costs five times more in tokens and costs nearly six times less to operate.

This inversion is not unusual. It is what happens when inference costs run under $1/M tokens for capable models and human knowledge worker time costs $80–150/hour. The ratio of human oversight cost to token cost in most production deployments runs somewhere between 20:1 and 200:1. LangChain’s 2025 survey of 1,340 agent engineering practitioners found that quality and reliability were by far the top reported barriers to production deployment — not token cost. The cost problem is real; it is just not where the dashboards point.

The reason engineering teams optimise tokens rather than total operating cost is not incompetence — it is information architecture. Token spend appears on a single API bill, updated daily, with line-item breakdown. Human remediation cost is distributed: analyst time across Jira tickets, team sync conversations, manager escalations, re-run workflows. Aggregating it requires intentional measurement. Most teams have not built that measurement, which means they are optimising the thing they can see. The first step in agent cost engineering is instrumenting human escalation rate and remediation time with the same rigor applied to token consumption. Everything else is downstream.

 

```python

def reliability_adjusted_cost_per_task(
    tokens_per_task: float,
    cost_per_token: float,
    reliability_rate: float,
    human_remediation_minutes: float,
    human_hourly_rate: float = 100.0
) -> dict:
    """
    Compute total operating cost per successfully completed task.
    This is the number your architecture should optimize — not cost per call.
    """
    if not (0 < reliability_rate <= 1.0):
        raise ValueError("reliability_rate must be between 0 and 1")
 
    failure_rate = 1.0 - reliability_rate
    token_cost = tokens_per_task * cost_per_token
    remediation_cost = failure_rate * (human_remediation_minutes / 60) * human_hourly_rate

    # The divisor here is the key insight: low reliability multiplies all costs.
    # A 70% success rate means every dollar of token spend costs you $1.43 in
    # inference-per-success before counting remediation at all.

    inference_cost_per_success = token_cost / reliability_rate
    total_cost_per_success = inference_cost_per_success + (remediation_cost / reliability_rate)
    return {
        "token_cost_per_call": token_cost,
        "inference_cost_per_success": round(inference_cost_per_success, 4),
        "remediation_cost_per_task": round(remediation_cost, 4),
        "total_cost_per_successful_task": round(total_cost_per_success, 4),
        "human_to_token_ratio": round(remediation_cost / token_cost, 1)
    } 
# Architecture A: cheap but brittle
arch_a = reliability_adjusted_cost_per_task(
    tokens_per_task=5000,
    cost_per_token=0.000002,  # $2/M tokens
    reliability_rate=0.70,
    human_remediation_minutes=10
)

# Architecture B: expensive but reliable
arch_b = reliability_adjusted_cost_per_task(
    tokens_per_task=15000,
    cost_per_token=0.000002,
    reliability_rate=0.95,
    human_remediation_minutes=10
)
print(f"Arch A total cost per success: ${arch_a['total_cost_per_successful_task']}")
# => Arch A total cost per success: $7.1571
 
print(f"Arch B total cost per success: ${arch_b['total_cost_per_successful_task']}")
# => Arch B total cost per success: $0.9088
 
print(f"Arch A human:token ratio: {arch_a['human_to_token_ratio']}:1")
# => Arch A human:token ratio: 500.0:1

```

 

The ratio at the bottom of Arch A’s output is the number most engineering teams have never computed. When it sits above 10:1, architectural decisions that trade reliability for token efficiency are almost certainly making the system more expensive, not less. At 500:1, you could triple the token cost and still come out ahead by routing to a model with better task accuracy.

 

The Architecture Tax — But Not the One You Think

 

Once you have the right metric, the architectural patterns that actually drive overhead come into focus. There are five patterns that compound across nearly every production agent system, and they generate overhead through distinct mechanisms that require distinct fixes.

The first is context avalanche. In a naive ReAct loop, each tool call appends its full output to the context window. By step eight or ten, the model is re-reading thousands of tokens of prior reasoning, raw tool outputs, and accumulated error messages on every subsequent inference call. Token consumption grows quadratically with chain length — not linearly, as most cost estimates assume. A ten-step workflow where each step adds 500 tokens of tool output accumulates approximately 22,500 tokens in context overhead alone, before counting reasoning tokens. This is a working-memory tax, not an intelligence cost. The model is re-processing history it already interpreted.

The fix is context compression as a first-class architectural concern: summarise completed subtask outputs rather than appending raw tool results verbatim, and re-inject a compressed goal-state summary at each major checkpoint rather than relying on attention patterns to preserve goal awareness across a long context. Treat the context window as a budget that must balance, not a ledger that only accumulates.

 

```python
from dataclasses import dataclass, field
from typing import Callable
 
@dataclass
class CompressedContextManager:
    """
    Replace naive context accumulation with a summarize-on-checkpoint pattern.
    Each completed subtask produces a compressed summary, not a raw transcript.
 
    The design principle: the model doesn't need to re-read what it already
    concluded. It needs the conclusion, not the path to it.
    """
    goal: str
    summarize_fn: Callable[[list[dict]], str]  # your summarization call
    compression_threshold: int = 2000  # tokens before triggering summarization
 
    _steps: list[dict] = field(default_factory=list)
    _compressed_history: list[str] = field(default_factory=list)
 
    def add_step(self, role: str, content: str, token_estimate: int):
        self._steps.append({"role": role, "content": content, "tokens": token_estimate})
 
        accumulated_tokens = sum(s["tokens"] for s in self._steps)
        if accumulated_tokens > self.compression_threshold:
            self._compress_and_checkpoint()
 
    def _compress_and_checkpoint(self):
        summary = self.summarize_fn(self._steps)
        self._compressed_history.append(summary)
        self._steps = []  # clear the window; summary preserves the substance
 
    def build_context(self) -> list[dict]:
        """Return a context that includes compressed history + current goal injection."""
        messages = [{"role": "system", "content": self.goal}]
 
        if self._compressed_history:
            history_text = "\n\n".join(
                f"[Completed phase {i+1}]: {s}"
                for i, s in enumerate(self._compressed_history)
            )
            messages.append({
                "role": "user",
                "content": f"Progress so far:\n{history_text}\n\nContinue with the next step."
            })
 
        messages.extend({"role": s["role"], "content": s["content"]} for s in self._steps)
        return messages
```

 

The second pattern is the speculative reasoning tax. Planning-first and tree-of-thought architectures generate internal reasoning tokens for paths that are ultimately not executed. When a planning agent uses a reasoning model to deliberate over five possible next actions before executing one, it spends four units of reasoning on hypotheticals that produce no output. This overhead only pays off when the task requires genuine uncertainty resolution. For well-specified subtasks with known parameters — a structured database query, a templated document generation, an API call with unambiguous inputs — the reasoning tokens represent pure waste. The model is working hard to confirm something that could have been hardcoded.

 

The fix is model routing by task certainty. Reserve reasoning models for planning under genuine uncertainty. Route well-specified, deterministic subtasks to smaller, cheaper models or to direct tool dispatch. Most frameworks implement this lazily — always the same model for all steps regardless of what each step actually requires.

 

```python

from enum import Enum
 
class TaskCertainty(Enum):
    DETERMINISTIC = "deterministic"   # known inputs, known outputs, no judgment needed
    CONSTRAINED = "constrained"       # bounded decision space, few valid paths
    OPEN = "open"                     # requires genuine reasoning under uncertainty
 
def route_by_certainty(
    task_description: str,
    certainty: TaskCertainty,
    available_models: dict
) -> str:
    """
    Select model based on task certainty, not habit.
    Most frameworks default to the same model for every step — this is rarely optimal.
    """
    routing_table = {
        TaskCertainty.DETERMINISTIC: available_models["small"],   # e.g. gpt-4o-mini
        TaskCertainty.CONSTRAINED: available_models["standard"],  # e.g. gpt-4o
        TaskCertainty.OPEN: available_models["reasoning"],        # e.g. o3, claude-opus
    }
    return routing_table[certainty]
 

# Example: a multi-step research workflow

# Step 1: formulate search queries  ->  OPEN (requires judgment)
# Step 2: execute web search        ->  DETERMINISTIC (tool call with known schema)
# Step 3: extract structured data   ->  CONSTRAINED (bounded output format)
# Step 4: synthesize findings       ->  OPEN (requires judgment)
 
# Without routing: all steps use the reasoning model
# naive_cost = 4_steps × 8000_tokens × $0.015_per_1k = $0.48
 
# With routing: only judgment steps use the expensive model
# routed_cost = (2 × 8000 × $0.015) + (1 × 2000 × $0.0001) + (1 × 3000 × $0.002)
# routed_cost = $0.24 + $0.0002 + $0.006 = ~$0.246
# ~49% reduction with zero capability loss on the deterministic steps
```

 

The third pattern, tool schema verbosity, is the least appreciated of the five and one of the highest-leverage targets. System prompts that include every available tool definition for every call waste input tokens on descriptions that provide no marginal value for the current reasoning step. In systems with twenty to fifty registered tools — increasingly common with the MCP ecosystem’s 5,800+ available servers — a flat tool inclusion policy adds two thousand to ten thousand tokens per inference call in overhead, even when the task step could only plausibly invoke two or three of those tools.

 

The fix is dynamic tool registration: a lightweight pre-step that classifies the current reasoning phase and exposes only the relevant tool subset to the main reasoning call.

 

```python

def build_dynamic_tool_manifest(
    all_tools: list[dict],
    task_phase: str,
    phase_tool_map: dict[str, list[str]]
) -> list[dict]:
    """
    Instead of passing all tool schemas on every call, expose only the tools
    plausibly relevant to the current task phase.
 
    In a 50-tool system this can reduce input token overhead by 4,000-8,000
    tokens per call with no change in agent behavior on in-scope tasks.
    """
    relevant_tool_names = set(phase_tool_map.get(task_phase, []))
 
    if not relevant_tool_names:
        # Fail safely: if phase is unknown, expose all tools rather than none
        return all_tools
 
    return [t for t in all_tools if t["name"] in relevant_tool_names]
 
# Example usage — a code review agent with phase-specific tools
PHASE_TOOL_MAP = {
    "discovery":   ["read_file", "list_directory", "search_codebase"],
    "analysis":    ["read_file", "run_linter", "search_codebase"],
    "suggestion":  ["read_file", "write_file", "create_comment"],
    "validation":  ["run_tests", "run_linter", "read_file"],
}
 
# discovery phase: 3 tools exposed instead of 20+
# saves ~8,500 tokens of tool schema per call in a typical MCP-connected setup
```

 

The fourth and fifth patterns — retry amplification without damping, and coordinator overhead in naive multi-agent setups — deserve more careful treatment because they are both nonlinear in their cost impact. The multi-agent context problem has its own compounding geometry that changes the calculation entirely.

 

The Multi-Agent Multiplication Problem

Multi-agent architectures introduce a cost pathology that does not appear in single-agent systems because the overhead mechanism is multiplicative, not additive. In a naive peer-to-peer configuration — AutoGen-style group chats without a strong orchestrator — each agent receives the accumulated conversation history and reasons over the full task context independently. An N-agent system with K messages exchanged costs O(N × K × average\_message\_length) in context overhead, versus O(K × average\_message\_length) for a single agent accomplishing the same work.

The concrete arithmetic: a document analysis workflow with 8,000 tokens of shared context delegated to six sub-agents generates 48,000 tokens of context loading before any agent produces a single token of output. If each sub-agent then produces a 500-token response that gets broadcast back to the group, the next round of reasoning starts with 8,000 + 6 × 500 = 11,000 tokens per agent — 66,000 tokens per round. By round three, the coordination overhead alone exceeds what a single capable agent on a long context would need for the entire task. You have built a system that is more expensive than the one it replaced, not because the agents are bad but because the architecture requires every agent to know everything everyone else knows.

The error dynamics compound this. Research on multi-agent scaling found that naive independent (peer-to-peer) configurations can amplify errors up to 17.2x relative to a single agent — not because each agent is less capable, but because independent errors compound rather than cancel. This finding comes from a controlled study across 180 configurations without validation gates, and production systems with proper guardrails behave differently. But the token consumption pattern holds regardless of error handling: the coordination overhead is structural, not incidental.

The architectural remedy is not to abandon multi-agent patterns but to change how context flows through them. Hierarchical orchestration, where a single orchestrator constructs minimal, task-specific context for each sub-agent rather than broadcasting full conversation state, changes the scaling behavior from multiplicative to additive. A coding sub-agent dispatched to implement a specific function needs the function specification, the relevant existing code, and the output schema — not the discovery conversation, the planning deliberation, or the outputs of parallel agents working on unrelated modules.

 

```python

from dataclasses import dataclass
from typing import Any
 
@dataclass
class SubAgentContext:
    """
    Construct the minimum viable context for a sub-agent invocation.
    Sub-agents do not need the orchestrator's full conversation history —
    they need their task specification and relevant data, nothing more.
 
    The principle is borrowed from API design: expose what's needed at the
    interface, not what exists in the implementation.
    """
    task_spec: str
    relevant_data: dict[str, Any]
    output_schema: dict
    constraints: list[str]

 

    def to_messages(self) -> list[dict]:
        """Build a focused context for this specific sub-agent role."""
        system_msg = {
            "role": "system",
            "content": (
                f"You are a specialized sub-agent. Your task:\n{self.task_spec}\n\n"
                f"Output must conform to this schema:\n{self.output_schema}\n\n"

                f"Constraints: {'; '.join(self.constraints)}"
            )
        }
        user_msg = {
            "role": "user",
            "content": f"Relevant context:\n{self.relevant_data}\n\nProceed."
        }
        return [system_msg, user_msg]
 
    def estimate_token_count(self) -> int:
        """Rough estimate — use a tokenizer in production."""
        content = str(self.task_spec) + str(self.relevant_data) + str(self.output_schema)
        return len(content) // 4  # rough chars-to-tokens approximation
 
def orchestrate_with_scoped_context(
    full_plan: dict,
    sub_agent_roles: list[str],
    invoke_agent_fn
) -> dict[str, Any]:
    """
    Orchestrate sub-agents with scoped, minimal context rather than
    broadcasting full state to every participant.

    In a 6-agent system with 8k shared context, this typically reduces
    total context tokens from ~48k (full fan-out) to ~12k (scoped delegation).
    
"""
    results = {}
    for role in sub_agent_roles:
        # Extract only what this role needs — not the full plan
        role_spec = full_plan["role_assignments"][role]
        role_data = {k: full_plan["shared_data"][k] for k in role_spec["required_data"]}
        context = SubAgentContext(
            task_spec=role_spec["task"],
            relevant_data=role_data,
            output_schema=role_spec["output_schema"],
            constraints=role_spec.get("constraints", [])
        )
        results[role] = invoke_agent_fn(role, context.to_messages())
    return results
```

 

There is a question worth sitting with before building any multi-agent system: if you had unlimited context capacity, would you still want multiple agents? When the answer is no — when the architecture exists solely to split a task that would fit in a single context if that context were large enough — you are building a workaround for a temporary constraint and locking in permanent coordination overhead. Claude’s 1M-token context and similar expansions from other providers are actively eliminating the class of use case where multi-agent was a context-overflow necessity. The remaining justifications for multi-agent are genuine parallelism, genuine domain specialisation, and genuinely independent subtask execution. “Our task was too big for the context window” is not surviving as a justification, and architectures built on it are now carrying debt.

 

Five Variables, One Ordering

The patterns above — task fit, reliability failure, context avalanche, multi-agent fan-out, tool schema verbosity, retry behavior, model routing — do not exist in isolation. They interact, and examining them together reveals a coherent structure: five variables that collectively determine total operating cost, interacting multiplicatively rather than additively.

Task fit is the first and the one most often skipped in architectural reviews. The question is whether the task you are asking the agent to perform is appropriate for LLM reasoning at all. A task that could be handled by a deterministic SQL query, a rule-based classifier, or a templated transformation is not an agent task — it is a task masquerading as one. The formalisability test is direct: can success be defined without human judgment? If yes, can that success be achieved with deterministic tools at a fraction of the cost? LangChain’s survey of agent practitioners found that approximately 92.5% of production agents deliver output to humans rather than to other software — meaning humans are serving as the reliability backstop. This is often a signal that success criteria were never formalised, not that agents are inherently unreliable. You cannot engineer your way out of a misspecified task. A perfectly efficient agent that reliably produces the wrong kind of output is not a cost problem; it is an architectural one.

Reliability rate is the second variable, and because it sits in the denominator of the cost function, it is the highest-leverage term in the formula. A 10-point improvement in reliability at 70% success rate reduces total cost by more than the equivalent reduction in token consumption — often by an order of magnitude — because every failure generates remediation cost that dwarfs the inference spend. There is a compounding mathematics here that is not intuitive: 85% per-step reliability across a ten-step workflow produces roughly 20% end-to-end success. That is not a survey statistic or a vendor claim. It is multiplicative probability — 0.85^10 — and it applies regardless of how carefully you have optimised the context window.

Architecture overhead is the third variable — the specific patterns detailed in the previous sections. Context accumulation, tool schema verbosity, speculative reasoning on deterministic steps, coordinator overhead in multi-agent setups. This is where most optimisation literature focuses, and it matters, but only after task fit and reliability are addressed. Reducing architectural overhead without first fixing task fit and reliability shifts cost from a visible line item to an invisible one. The LLMOps dashboard goes green. The total operating cost increases. This pattern — call it Dashboard Optimisation: the local improvement that produces a global cost increase by moving spend from observable infrastructure to unobservable human time — is common enough that it deserves a name and a dedicated check in any architecture review.

Context management is the fourth variable, and it splits across three distinct scale axes that produce three distinct cost pathologies. Volume scale — many short, similar tasks — has a pathology centered on redundant loading: the same system prompt, the same tool schemas, the same few-shot examples loaded fresh for each task when they could be cached. Complexity scale — few tasks requiring many sequential steps — has the context avalanche pathology. Time scale — single tasks running for hours or days — introduces goal drift, where the original task specification becomes diluted over long contexts and the agent begins optimising for something other than the original objective. Each scale axis requires categorically different strategy. Applying volume-scale solutions (prompt caching, batching) to complexity-scale problems provides no benefit. Applying complexity-scale solutions (context compression checkpoints) to volume-scale problems adds latency overhead without savings. The mismatch between scale axis and applied solution is one of the most common sources of optimisation work that produces no measurable improvement.

Execution environment is the fifth variable and often the highest-impact, lowest-effort lever: provider pricing tiers, model selection per step, prompt caching configuration, and for high-volume stable workloads, fine-tuned smaller models. Anthropic’s cache\_control and OpenAI’s automatic prompt caching can achieve 50–90% cost reduction on repeated prompts with minimal architectural change. This is often the right first move — not a complex architectural redesign.

The optimisation sequence these five variables imply is strict, and violating it has quantifiable costs. Fix task fit before reliability — because optimising retry logic for a task that should be a SQL query is pure waste. Fix reliability before architecture — because the 500:1 human:token ratio means every architectural token saving is overwhelmed by a single percentage point of reliability improvement. Match context and execution decisions to the specific scale axis you are actually operating at. This sequence is not a stylistic preference. It is a consequence of the cost formula’s structure: lower-numbered terms absorb higher-order optimisation without converting it to value.

 

The Retry Loop Nobody Stabilised

The most structurally interesting of the five architectural patterns is retry amplification, and it has a precise theoretical explanation that the agent engineering community has been approaching from entirely the wrong angle.

A retry loop in an agent system is a closed-loop feedback controller. The agent executes a step, observes the outcome, and adjusts the next attempt based on the observed error. This is the textbook definition of a feedback control system, and it is governed by the same stability criteria that control engineers have been working with since the 1940s.

The fundamental instability condition for a feedback loop is positive self-conditioning without sufficient damping. In an agent retry loop, this is the default behavior: the model generates a reasoning trace, produces an incorrect tool call, the error message enters the context window, and the next reasoning trace is conditioned on both the original context and the error description. Without explicit intervention, this is a high-gain loop with no derivative term — the control-theoretic conditions for oscillation. The agent generates increasingly elaborate wrong strategies, each retry accumulating more error context and conditioning subsequent attempts on a growing history of failure. Practitioners observe this as agents “spinning” on failed steps, consuming tokens on strategies that diverge rather than converge.

The PID controller solved this for physical systems by responding to the rate of error change (the D term), not just error magnitude (the P term). The agent equivalent is: respond to failure pattern, not just failure presence. If the same error occurs on three consecutive retries, the correct action is not to retry with the same strategy — it is to switch strategies or escalate. The damping mechanisms that prevent oscillation in a retry loop are: a step-level retry depth limit (not just a global limit on the overall workflow), a context modification on retry (compress and summarise the failure rather than appending the full error verbatim), and a strategy-switch trigger that detects non-convergence before the token budget is exhausted.

 

```python

import time
from enum import Enum
from typing import Callable, Any
 
class RetryState(Enum):
    CONVERGING = "converging"
    OSCILLATING = "oscillating"
    ESCALATE = "escalate"
 
class DampedRetryController:
    """
    A retry policy modeled as a damped feedback controller.

 

    The naive retry pattern (retry N times with backoff) has no derivative term:
    it responds to failure presence but not to failure pattern. This controller
    adds damping by detecting non-convergence (oscillation) and switching
    strategy rather than repeating the same approach.
 
    This is the PID controller's D-term applied to agent retry logic.
    """
 
    def __init__(
        self,
        max_retries_per_step: int = 3,
        oscillation_window: int = 2,
        backoff_base: float = 1.5
    ):
        self.max_retries = max_retries_per_step
        self.oscillation_window = oscillation_window
        self.backoff_base = backoff_base
        self._error_history: list[str] = []
 
    def _classify_failure_pattern(self) -> RetryState:
        """
        The D-term: detect whether failures are converging toward a solution
        or oscillating around the same error type.
        
"""
        if len(self._error_history) < self.oscillation_window:
            return RetryState.CONVERGING
 
        recent = self._error_history[-self.oscillation_window:]
        # If all recent errors are of the same type, we are oscillating
        if len(set(e.split(":")[0] for e in recent)) == 1:
            return RetryState.OSCILLATING
 
        return RetryState.CONVERGING
 
    def _compress_error_context(self, errors: list[str]) -> str:
        """
        On retry, summarize failures rather than appending them verbatim.
        Appending raw errors conditions the next attempt toward the same failure
        mode. A summary breaks the positive feedback loop.
        """
        error_types = {}
        for e in errors:
            key = e.split(":")[0]
            error_types[key] = error_types.get(key, 0) + 1
 
        summary_parts = [f"{etype} ({count}x)" for etype, count in error_types.items()]
        return f"Previous attempts failed with: {', '.join(summary_parts)}. Try a different approach."
 
    def execute_with_damping(
        self,
        action_fn: Callable[..., Any],
        fallback_fn: Callable[..., Any],
        *args,
        **kwargs
    ) -> Any:
        """
        Execute action_fn with damped retry logic.
        Switches to fallback_fn on detected oscillation.
        Escalates (raises) if max retries exhausted.
        """
        attempt = 0
        while attempt < self.max_retries:
            try:
                return action_fn(*args, **kwargs)
            except Exception as e:
                error_key = f"{type(e).__name__}: {str(e)[:50]}"
                self._error_history.append(error_key)
 
                pattern = self._classify_failure_pattern()
 
                if pattern == RetryState.OSCILLATING:
                    # Switch strategy: try the fallback rather than repeating
                    compressed = self._compress_error_context(self._error_history)
                    return fallback_fn(*args, error_context=compressed, **kwargs)
 
                # Exponential backoff — the P-term
                wait = self.backoff_base ** attempt
                time.sleep(wait)
                attempt += 1
 
        # Max retries exhausted: escalate, do not silently fail
        raise RuntimeError(
            f"Step failed to converge after {self.max_retries} attempts. "
            f"Error pattern: {self._compress_error_context(self._error_history)}. "
            f"Human escalation required."
        )
```

 

The backoff addresses the P-term: scale the wait time proportional to attempt count. The oscillation detection and strategy switch address the D-term: respond to failure pattern change, not just failure presence. The explicit escalation instead of silent failure addresses what aviation engineers call the go/no-go gate: there is a point past which autonomous retry is no longer rational, and the system must surface this to a human rather than consuming more budget in increasingly improbable recovery attempts.

The connection to distributed systems is direct and the agent community has mostly missed it. The circuit breaker pattern, standardised in microservices engineering through Netflix’s Hystrix and Michael Nygard’s Release It!, solves a structurally identical problem: preventing cascading failure when a downstream dependency is unreliable. After N consecutive failures, stop attempting and fail fast until recovery is confirmed. In an agent tool-call chain, an unreliable external API at step three generates the same cascading dynamics as an unreliable microservice. The circuit breaker logic — open, closed, half-open state machine — applies without modification. Rather than importing a thirty-year-old solution, the agent engineering community has been rediscovering it piecemeal under the label “retry problem,” which is a category error that resets the solution clock back to zero.

 

The Honest Admission

Here is what the data will not let you avoid: optimising architecture without first fixing task fit and reliability can make your economics definitively worse, not marginally worse.

When you reduce context window usage by 30% through compression while simultaneously reducing precision on a judgment-heavy task, you have shifted cost from a visible line item to an invisible one. The LLMOps dashboard goes green. The total operating cost increases. The team declares a successful optimisation and moves on. Dashboard Optimisation — optimising the metric you can see while making the metric you cannot see worse — is the dominant failure mode of agent cost engineering, and it is invisible precisely because the expensive part (human remediation time) is distributed across systems that were never instrumented as part of the cost model.

The statistics circulating in this space deserve calibration. The figure that “96% of organisations find costs higher than expected” originates from vendor surveys with selection bias toward organisations experiencing problems — the organisations where agents are quietly working well are underrepresented. The “280x cost decline in 2 years” comparison lacks matched baselines across task types and model families. The 85% per-step reliability mathematics is not a survey statistic — it is multiplicative probability, and it applies regardless of what any vendor claims about their system’s quality. The RPA-versus-LLM evidence is real: stable, structured enterprise workflows consistently show RPA outperforming LLM agents on reliability, at a fraction of the inference cost. These are the foundations worth building on.

There is also an honest reckoning due on optimisation investment itself. The engineering time required to implement sophisticated context management, multi-tier model routing, dynamic tool registration, and damped retry controllers is real. At low volume — hundreds of tasks per day — this investment may not recover its cost against savings in tokens that are already approaching commodity prices. Frontier-class capability that cost $30 per million tokens in 2023 is available below $1 per million tokens in early 2026. The economic case for architectural token optimisation weakens as token prices fall. What retains its value regardless of token price is the reliability engineering: error containment, validation gates, retry damping. These address human time, which does not follow the same cost curve as silicon inference. Token optimisation may have a maturity date; reliability engineering does not.

 

What Survives the Cheapening

The specific optimisation patterns in this article carry expiration dates. Context compression is valuable now because context re-processing is expensive now. As inference costs continue falling and context windows continue expanding — Claude already offers one million token contexts — some of the architectural complexity described above becomes scaffolding for constraints that are dissolving. A multi-agent architecture designed to split a task that fits in a single large context is carrying coordination overhead that becomes indefensible as context windows grow.

What does not dissolve is the diagnostic frame: measure total operating cost including human time; ask whether the task belongs to an agent before selecting a model; set reliability targets based on what failure costs, not what success looks like. These questions outlast any particular optimisation technique because they address the structure of the problem, not the current price of compute.

The METR research tracking AI task completion reliability found that the time horizon over which agents can complete tasks with 50% success has been doubling approximately every seven months across the last six years. If that trajectory holds, a meaningful portion of the reliability overhead you are engineering today is on a trajectory to be solved at the model layer by 2028 — and some of the oversight patterns you have hardcoded as permanent will become unnecessary. The right response is not to avoid building oversight infrastructure, but to build it in a way that can be decommissioned: keep the reliability instrumentation in place so you can tell when the model improvements have actually earned the autonomy, rather than assuming they have.

The agents that work in production share a structural characteristic: somewhere early in their design, someone asked three questions without flinching at the answers. Is this task appropriate for an agent, or is a deterministic alternative cheaper and more reliable? What does a failure cost, in analyst time, in downstream data quality, in customer impact? What does it take to make success reliable enough that the economics work? The architecture follows from honest answers to those questions. The specific patterns — context compression, model routing, damped retries — are implementations of answers, not the answers themselves. That distinction is what determines whether the optimisation work builds toward something durable or merely makes the dashboard look better.

 

Key Sources:

  • LangChain State of Agent Engineering 2025 (1,340 practitioners)
  • Why Do Multi-Agent LLM Systems Fail? (arxiv 2503.13657)
  • Towards a Science of Scaling Agent Systems” (arxiv 2512.08296)
  • The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs (arxiv 2509.09677)
  • Are LLM Agents the New RPA? (arxiv 2509.04198)
  • METR: Measuring AI Ability to Complete Long Tasks (2025)
  • Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems (arxiv 2511.14136)

Share this post

Do you have any questions?

Newsletter

Zartis Tech Review

Your monthly source for AI and software related news.