llm inference optimisation

You Don’t Have a Compute Problem. You Have a Bytes-Per-Token Problem.

Autoregressive LLM decode is memory-bandwidth-bound, not compute-bound — and once you internalize that single physical fact, most major inference optimizations reveal themselves as different geometric attacks on the same bottleneck: bytes moved per generated token.

The following scenario is representative of what happens when teams optimize for the wrong metric — and it plays out more often than it should.

 

The GPU Upgrade That Didn’t Help

A team we worked with recently ran into this problem — a textbook example of optimising the wrong constraint:

The team had done everything right. Three months of profiling showed their LLaMA-based service was slow — p50 decode latency hovering around 1.8 seconds, users visibly waiting for responses. The fix seemed obvious: upgrade from A100s to H100s. The H100 delivers roughly 3x the FLOP/s of the A100 on FP16 workloads. They expected something close to that 3x speedup. What they got was 1.3x.

Their GPU utilization dashboard had been showing 85%, a number that felt like a properly loaded system. In reality, that dashboard was measuring the wrong thing.

GPU utilization tracks compute core activity — how busy the tensor cores are. But LLM decode is not waiting on compute. It is waiting on memory. Every single token generated requires streaming the entire model’s weight matrix through high-bandwidth memory (HBM) from VRAM into the compute cores. For a 7B parameter model in FP16, that means moving approximately 14 gigabytes of weights per output token. For 70B in FP16, it is 140 gigabytes — per token, every token, regardless of batch size.

The A100 has 2.0 TB/s of HBM bandwidth. The H100 has 3.35 TB/s. The speedup for memory-bound workloads is not 3x from the FLOP/s upgrade; it is 1.68x from the bandwidth upgrade. The team got 1.3x because their workload was partially able to use the H100’s compute advantages, but the bandwidth gain was the binding factor. They had bought more flour when the recipe was already flour-constrained — what they needed was a faster sieve.

 

This mistake is so common, it has become the default error mode for teams scaling LLM inference. Teams buy compute when the bottleneck is bandwidth. They optimize matrix multiplications when the bottleneck is bytes moved. They add GPU cores when the constraint is gigabytes transferred.

Most effective LLM inference optimization techniques — quantization, speculative decoding, paging, routing, disaggregation — traces back to a single physical law: decode throughput scales with memory bandwidth, not with compute. The entire optimization landscape becomes coherent when you hold that fact as the organizing principle. The question is never “how do we get more FLOP/s?” The question is  “how do we move fewer bytes, or move the same bytes for more output?”

 

The Physics

To understand why most optimizations work the way they do, you need to understand arithmetic intensity — a measure of how many floating-point operations a workload performs per byte of data transferred from memory.

The roofline model captures this cleanly. On one axis, arithmetic intensity (FLOP per byte). On the other axis, attainable performance (FLOP/s). Every GPU sits under a roofline defined by two ceilings: its peak compute (FLOP/s) and its peak memory bandwidth (bytes/s) multiplied by arithmetic intensity. Below the ridge point on the arithmetic intensity axis, performance is bounded by bandwidth — no matter how fast the compute cores, the data pipeline chokes. Above that ridge point, performance is bounded by compute — bandwidth is no longer the problem.

The H100 SXM5’s ridge point sits at ~295 FLOP/byte in BF16 (989 TFLOPS dense against 3.35 TB/s of HBM3) and ~590 FLOP/byte in FP8. Autoregressive decode runs at ~1 FLOP/byte in BF16 and ~2 FLOP/byte in FP8 — roughly 295× below the ridge regardless of precision, because lowering precision moves the ridge and the workload together.

 

Prefill — the phase where the model processes your input tokens in parallel — is compute-bound (DistServe, Zhong et al., OSDI 2024, arXiv:2401.09670). Arithmetic intensity scales roughly as (batch × sequence length) / bytes_per_weight, which for typical prompt lengths puts it in the hundreds of FLOP/byte, comfortably above the H100’s ~295 FLOP/byte ridge point in BF16. This is why FlashAttention’s kernel efficiency matters for prefill, why tensor parallelism helps for long prompts, and why high-FLOP/s GPUs provide real benefits during this phase.

Decode — the phase where the model generates output tokens autoregressively, one at a time — has an arithmetic intensity of approximately 1 FLOP/byte in BF16. Not the ~1,024 of a 1K-token prefill. One. That is roughly 295× below the H100’s BF16 ridge — two full orders of magnitude — and the ratio doesn’t change when you switch precision, because lowering bytes-per-weight moves decode and the ridge together. Batching helps, but reaching the compute regime in BF16 would require hundreds of concurrent requests, not tens. In practice, well before that, the KV cache takes over as the dominant memory-bandwidth bottleneck. Decode doesn’t escape the bandwidth regime; it just changes which bandwidth is binding.

 

The physical formula is simple: decode throughput (tokens/second) ≈ GPU memory bandwidth (bytes/second) / model size (bytes/parameter × parameters). For a 7B model in FP16 on an A100:

“`

14 GB / 2.0 TB/s = 7 ms per token minimum (theory)

“`

 

That 7 ms is the hard floor, regardless of how many FLOP/s the A100 has. No software optimisation, no kernel tuning, no architecture change makes the A100 generate tokens faster than its bandwidth allows unless you reduce the bytes-per-weight. This is not a benchmark result. It is arithmetic.

 

AWQ INT4 quantisation reduces the 7B model from 14 GB to roughly 4 GB (including quantisation scales):

 

“`

4 GB / 2.0 TB/s = 2 ms per token minimum

“`

 

The ~3× to ~4× decode speedups AWQ measures across consumer GPUs (Lin et al., MLSys 2024, arXiv:2306.00978) are not surprising — they are the direct arithmetic consequence of shrinking the numerator. The formula predicted them before a single GPU was measured.

Two important scoping notes. First: arithmetic intensity scales with batch size. Batching 32 concurrent decode requests multiplies arithmetic intensity from 1 FLOP/byte to roughly 32 FLOP/byte, shifting the workload closer to compute-bound territory. The memory-bandwidth bottleneck analysis applies precisely to latency-sensitive serving where batch sizes at p50 latency targets are typically 4 to 16. For offline bulk inference where you can queue hundreds of requests, the computation shifts toward compute-bound, and optimisation priorities change accordingly — compute parallelism and tensor operations matter more than quantisation byte-reduction at those batch depths. The diagnostic question before any hardware or software decision is therefore: what is your actual p50 batch size under production load? If the answer is below 16, you are bandwidth-bound and every byte you can eliminate from model weights translates directly to decode throughput. If the answer is above 32, you are compute-bound and need a different framework entirely.

Second: the Blackwell B200 at 8 TB/s HBM bandwidth will deliver roughly 2.4x better decode throughput over the H100 from bandwidth alone — before any algorithmic optimization. That is not a compute decision. It is a bandwidth investment. The trend line from HBM2 to HBM3 to HBM3e tells the same story the roofline model does: memory bandwidth is where the AI hardware industry knows the real bottleneck lives. The GPU vendors are voting with their roadmaps.

 

The Memory Topology Stack

With the physics established, the optimization landscape resolves into six distinct layers of the memory hierarchy, each addressed by a different class of technique. Understanding which layer you are attacking — and whether it is currently the binding constraint — is the difference between compounding gains and wasted engineering quarters.

 

Layer 0: Weight Storage — AWQ INT4

The deepest layer is how much the model weighs. A 70B FP16 model weighs 140 GB. Fitting it on a single A100 80GB is impossible. Fitting it on four A100s means paying for four GPUs and absorbing their combined bandwidth cost per token. INT4 quantisation cuts the weight to 35 GB, fitting on a single A100 with room left for KV cache and batch accumulation.

AWQ (arXiv:2306.00978) solves the historical problem with INT4: naive quantisation destroyed accuracy because it treated all weights equally. The key insight of AWQ is that approximately 1% of weight channels are “salient” — they correspond to large activation magnitudes, and quantising them coarsely causes disproportionate accuracy loss. AWQ identifies these channels by examining activation statistics on a small unlabelled calibration set, then scales them up before quantisation (preserving effective precision where it matters most) without requiring mixed-precision kernels. The result: 4x memory reduction, more than 3× decode throughput (up to 3.9× on some hardware), and minimal accuracy degradation across general-purpose benchmarks — validated at MLSys 2024, where AWQ won Best Paper.

 

Layer 1: Weight Transfer — Batch Amortization

The second layer is how efficiently the bandwidth cost is amortised. When you serve a single user at a time, all 14 GB (or 4 GB after AWQ) is transferred to generate one token for one person. When you serve 8 users simultaneously, the same transfer generates 8 tokens — effectively 8x the throughput from the same bandwidth spend. This is why continuous batching is not optional. It is the mechanism by which bandwidth cost gets amortised across the requests your system is actually trying to serve. That said, batching is constrained by available VRAM, not just by demand — each concurrent request needs its own KV cache, and at sufficient context lengths, that capacity limit sets the practical ceiling on batch size.

 

Layer 2: KV Cache Allocation — PagedAttention

As sequences grow, the model must cache the key-value attention states for all previously generated tokens. In naive serving, this KV cache is pre-allocated as a contiguous block sized to the maximum possible sequence length. For a request whose actual sequence is 512 tokens against a 4,096-token maximum, 87.5% of that allocation sits empty — dead VRAM that cannot be used for batch accumulation or other requests.

PagedAttention (arXiv 2309.06180, the paper behind vLLM) applies the classic OS page table insight to GPU memory. Rather than contiguous pre-allocation, the KV cache is stored in fixed-size pages mapped through a per-sequence page table. Actual physical memory is allocated only when pages are needed. The result: 60-80% of previously wasted VRAM is recovered, enabling 2-4x more concurrent requests on the same hardware with near-zero overhead. The technique is exact — it changes no model behavior, introduces no approximation, and is now the default in every serious serving framework.

 

Layer 3: KV Recomputation — RadixAttention Prefix Caching

In any production system, requests share large prefixes: the system prompt, tool definitions for agents, few-shot examples. Without prefix caching, each new request recomputes the attention states for these shared tokens from scratch — burning TTFT on work already done for the previous thousand requests.

RadixAttention (arXiv 2312.07104, implemented in SGLang) maintains a radix tree of cached KV states indexed by token sequence. Incoming requests match against tree prefixes via O(k) prefix lookup where k is the incoming prefix length; any shared prefix reuses the cached KV without recomputation. For agentic workloads with 2,000-token tool definitions, this eliminates those 2,000 tokens of prefill computation from every single request, delivering substantial TTFT reduction on prefix-heavy workloads (up to 6.4× end-to-end throughput per the SGLang paper). For a system handling 1,000 requests per minute, that is 2 million tokens per minute of computation eliminated.

 

Layer 4: Attention Kernel — FlashAttention

The attention computation itself is memory-bound at the kernel level: standard attention materialises the full N×N score matrix in HBM, requiring O(N²) HBM reads and writes that dominate runtime for long sequences. FlashAttention (arXiv 2205.14135) tiles the Q, K, V matrices into SRAM-resident blocks, computes partial attention within SRAM, and accumulates running softmax statistics — the N×N matrix never touches HBM. This converts the attention operation from HBM-bound to SRAM-bound, delivering 2-4x wall-clock speedup and enabling 16k+ sequence lengths without the quadratic memory explosion.

Worth noting: FlashAttention solves O(N²) attention computation memory — it does not solve O(N) KV cache storage. For very long contexts (128k+), the KV cache for a 70B model approaches 335 GB, far exceeding any single GPU. FlashAttention alone does not solve this problem; KV cache compression via kvpress is required separately. The two are complementary, not synonymous.

 

Layer 5: Request Routing — RouteLLM and TALE

The outermost layer is whether expensive compute fires at all. For any production system with heterogeneous query difficulty — which is most production systems — routing is the highest-leverage optimisation that does not require touching serving infrastructure.

RouteLLM (arXiv:2406.18665) trains a lightweight classifier on preference pairs to route queries by difficulty, sending easy queries to a cheap small model and hard queries to the expensive frontier model. TALE (arXiv:2412.18547) applies related logic to reasoning costs: it estimates a per-query token budget from problem difficulty and caps reasoning length accordingly, cutting reasoning-token output by ~60–70% with under 5% accuracy loss.

 

The Techniques in Depth

 

AWQ INT4: The Arithmetic of Compression

The activation-aware insight of AWQ deserves more attention than the benchmark headline. Naive INT4 — rounding all weights to 16 discrete levels — causes large accuracy drops because the rounding error for any channel is bounded by the scale of the largest weight in that group. If a channel has weights ranging from -0.8 to +0.8, INT4 precision is ±0.05, which is tolerable. If a channel is multiplied against activations with outlier magnitudes, its quantisation error gets amplified in the output — and those are the channels where INT4 precision becomes catastrophic.

AWQ identifies these high-activation channels using activation statistics on 128-512 unlabelled samples — not labels, not task-specific data, just representative input text. It multiplies these channels’ weights by a scaling factor before quantisation (so they occupy more of the INT4 range) and applies the inverse scaling to the corresponding activations at inference time. The mathematical operation is equivalent; the numerical stability is dramatically better. The “salient 1%” finding is consistent with the broader insight that neural network weight distributions are not uniform in their sensitivity — targeted precision allocation delivers better accuracy per bit than uniform quantisation.

The production implication is direct: a 70B FP16 model requires 4x A100 80GB GPUs at roughly $12-16/hour cloud cost. AWQ INT4 reduces it to one A100 80GB at $3-4/hour — 4x GPU cost reduction from a 30-minute offline quantisation pipeline.

One concession that must be made explicitly: the “sub-1% accuracy degradation” figure is measured on MMLU and similar general-purpose benchmarks with a generic calibration set. For domain-specific precision tasks — biomedical entity extraction, financial regulatory document analysis, legal clause interpretation — the salient weight map derived from generic text may not reflect the activation distribution of your specialised corpus. Degradation can be significantly higher without domain-calibrated calibration. If you are operating in a regulated or precision-sensitive domain, validate AWQ on representative task samples before production deployment. The fallback — INT8 via SmoothQuant with domain-specific calibration — provides better precision at 2x rather than 4x compression.

 

PagedAttention / vLLM: The OS Insight

The analogy in the PagedAttention paper (arXiv 2309.06180) is instructive: early operating systems allocated contiguous physical memory to processes. When a process needed 100MB but only 30MB was available contiguously, allocation failed even if 70MB sat free across fragmented blocks. Virtual memory with page tables solved this by decoupling logical address space from physical allocation, allowing non-contiguous physical pages to appear contiguous to the process.

PagedAttention applies this insight to GPU memory management for KV caches. The LLM serving system pre-allocates fixed-size “blocks” (e.g., 16 or 32 tokens per block) rather than worst-case contiguous sequences. As a request’s sequence grows, it requests additional blocks from the allocator. Blocks can be anywhere in VRAM; the page table tracks their locations. When a request completes, its blocks are returned to the pool immediately.

The 60-80% memory waste elimination is not a marginal improvement — it is the difference between serving 4 concurrent requests and serving 12 on the same GPU. That 3x concurrency improvement translates directly to 3x throughput at the same hardware cost. For a self-hosted deployment, this effect alone often eliminates the need for additional GPU hardware.

 

EAGLE-3 Speculative Decoding: Multiple Tokens Per Weight-Load Cycle

Speculative decoding’s core insight is that decode’s bandwidth cost is per-forward-pass, not per-token. The target model must load all its weights once per forward pass. If that forward pass generates 5 tokens instead of 1, the effective bandwidth cost per token is reduced 5-fold.

EAGLE-3 (arXiv:2503.01840) achieves this with a small draft model (a fraction of the target model’s parameters) that predicts multiple candidate tokens cheaply, organised as a branching token tree rather than a linear sequence. The target model then performs a single parallel forward pass to verify the entire tree — accepting candidates whose probability exceeds the rejection sampling threshold, rejecting and resampling at the first divergence. Because EAGLE-3 uses rejection sampling, the output distribution is provably identical to the target model’s distribution. No approximation is made; no quality is traded away.

At low temperature with structured output (code generation, JSON extraction, tool calling), draft acceptance rates are high — candidate tokens closely match what the target model would generate — and EAGLE-3 achieves 5-6.5x latency reduction (arXiv:2503.01840). At high temperature with creative open-ended generation, the draft and target distributions diverge, and practical speedup is 1.5-2.5x. Speedup also degrades at high batch sizes — at batch 64, EAGLE-3’s throughput improvement drops to ~1.4×. Speculative decoding is a latency tool for single-user or low-batch serving, not a throughput multiplier for high-concurrency production.

 

The vLLM integration is a single flag:

 

```bash

# vLLM speculative decoding (verify exact flag syntax against current vLLM docs):

vllm serve <target-model> \

  --speculative-config '{"model": "<eagle3-draft-model>", "num_speculative_tokens": 5}' \

  --quantization awq

```

 

 

Pre-trained EAGLE-3 draft models are available for Llama 3.1 8B and 70B, Mistral, and Qwen families. For custom fine-tuned models, a new draft model must be trained — a one-time offline process, but a gap in the current tooling ecosystem.

 

DistServe P/D Disaggregation: Hardware Matched to Physics

DistServe (arXiv:2401.09670) extends the prefill-decode asymmetry insight to its logical conclusion: if prefill and decode have fundamentally different hardware requirements, they should run on fundamentally different hardware.

Co-location forces a compromise. A GPU handling both phases must balance its resources between compute-intensive prefill (benefits from high FLOP/s and tensor parallelism) and bandwidth-intensive decode (benefits from high HBM bandwidth and batch depth). During prefill-heavy periods, decode batches stall. During decode-heavy periods, long prefills queue behind them, spiking TTFT. The result is a system that is mediocre at both phases rather than optimal at either.

Disaggregation separates the serving system into a prefill pool and a decode pool, each sized and equipped for its workload. The prefill pool processes input sequences and transfers the resulting KV cache to the decode pool, which handles generation. The measured results: 7.4x more requests served within SLO, or equivalently, 12.6x tighter SLO attainment for the same request volume.

A real constraint must accompany this number. KV cache transfer between pools happens over the network fabric. InfiniBand at ~200 GB/s is two orders of magnitude slower than HBM at 2 TB/s. For long-context requests, per-request KV cache can reach tens of GB, making transfer times over InfiniBand non-trivial and sometimes comparable to decode TTFT savings. This network bottleneck is non-trivial and can negate the benefit in shared-fabric environments. P/D disaggregation belongs in the advanced infrastructure tier: teams at 1M+ requests per month, with dedicated network fabric, on serving stacks with mature P/D disaggregation support (maturity varies across frameworks).

 

RouteLLM: The Routing Layer Nobody Thinks to Build First

RouteLLM (arXiv:2406.18665) from the LM-Sys team operationalises a simple observation: most production queries do not require frontier model capability. A request to “reformat this JSON to CSV,” “summarise this paragraph,” or “classify this support ticket” will get an indistinguishable answer from a small self-hosted model at a fraction of the cost of a frontier API model — often an order of magnitude or more in price difference per token. The question is not whether to use cheaper models — of course you should. The question is which requests they can handle reliably.

RouteLLM trains a matrix factorisation router on Bradley-Terry preference data: pairs of (question, which model answered correctly). This approach learns difficulty from outcome signals rather than surface features, making it more robust to query distribution shifts than discriminative classifiers. The router adds approximately 1ms of overhead at inference time — negligible. The threshold parameter controls the strong/weak split and is adjustable at runtime without redeployment.

The benchmark figure — 40-85% cost reduction with 95%+ of GPT-4 quality — is derived primarily from MMLU, which is structured and domain-labelled. Production queries are unstructured, ambiguous, and multi-intent in ways that MMLU does not capture. The honest production estimate is 15-40% cost reduction for most teams. Start with a 20% routing threshold (20% of traffic to weak model), measure quality regression on your specific task distribution, and expand the threshold empirically. The savings compound quickly once calibrated: routing 40% of traffic to a substantially cheaper model produces meaningful blended-cost reductions, compounding across request volume.

 

Code Examples

 

AWQ Quantization Pipeline

Before you can deploy a quantised model, you need to quantise it. The AutoAWQ pipeline (casper-hansen/AutoAWQ) handles this in under 30 minutes on a single GPU:

 

```python

from awq import AutoAWQForCausalLM

from transformers import AutoTokenizer

 

model_path = 'meta-llama/Meta-Llama-3.1-8B-Instruct'

quant_path = './llama3.1-8b-awq-int4'

 

model = AutoAWQForCausalLM.from_pretrained(model_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)

 

quant_config = {

    'zero_point': True,

    'q_group_size': 128,   # 128 is standard; 64 for higher accuracy

    'w_bit': 4,

    'version': 'GEMM'      # GEMM for throughput, GEMV for single-request latency

}

 

# Calibration with 128 samples — no labels required

model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(quant_path)

tokenizer.save_pretrained(quant_path)

 

# Result: 16GB → ~5GB, ~3-4× decode speedup, sub-1% accuracy loss on general benchmarks

# Deploy: vllm serve ./llama3.1-8b-awq-int4 --quantization awq

```

 

vLLM: AWQ + PagedAttention + Prefix Caching

The production-standard deployment combines quantisation, paged memory, and prefix caching in a single initialisation. Note: pass a pre-quantised AWQ checkpoint — `quantization=’awq’` does not quantise on the fly.

 

 

```python

from vllm import LLM, SamplingParams

 

llm = LLM(

    model='hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4',

    quantization='awq',           # AWQ INT4: 4x memory reduction

    gpu_memory_utilization=0.90,  # Reserve 90% of VRAM for model + KV cache

    enable_prefix_caching=True,   # Hash-based prefix cache for repeated system prompts

    max_model_len=32768

)

 

sampling_params = SamplingParams(temperature=0.7, max_tokens=512)

prompts = [

    'Summarise the following document: ...',

    'Extract key entities from: ...'

]

outputs = llm.generate(prompts, sampling_params)

 

# For the OpenAI-compatible server:

# vllm serve hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \

#   --quantization awq \

#   --enable-prefix-caching \

#   --gpu-memory-utilization 0.90

```

 

SGLang RadixAttention for Agentic Workloads

For systems with repeated tool definitions or system prompts, SGLang’s RadixAttention provides substantial TTFT reduction on prefix-heavy workloads (up to 6.4× end-to-end throughput per the SGLang paper) with zero code changes at the model level. The radix tree maintains KV states across requests automatically.

 

Note: SGLang’s current recommended interface is the OpenAI-compatible server; the decorator API shown here remains supported but is less commonly used in new deployments.

 

 

```python

import sglang as sgl

 

@sgl.function

def structured_extraction(s, document):

    s += sgl.system('You are a data extraction assistant.')

    s += sgl.user(f'Extract from: {document}')

    s += sgl.assistant(

        sgl.gen(

            'extraction',

            max_tokens=512,

            # Regex constraint for structured output — near-zero overhead

            regex=r'{"entity": "[A-Za-z ]+", "type": "(person|org|location)"}'

        )

    )

 

# Batched execution — RadixAttention reuses cached KV across all calls

# sharing the same system prompt prefix

state = structured_extraction.run(

    document='Apple CEO Tim Cook announced...',

    backend=sgl.RuntimeEndpoint('http://localhost:30000')

)

 

# Server launch:

# python -m sglang.launch_server \

#   --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \

#   --port 30000 \

#   --attention-backend flashinfer

```

 

RouteLLM: Drop-In Routing

RouteLLM wraps the OpenAI client interface entirely — the calling code does not change:

 

 

```python

from routellm.controller import Controller

 

client = Controller(

    routers=['mf'],  # matrix factorisation router: best cost-quality tradeoff

    strong_model='gpt-4o',

    weak_model='meta-llama/Meta-Llama-3.1-8B-Instruct',

)

 

# Identical API to openai.chat.completions.create

response = client.chat.completions.create(

    model='router-mf-0.11875',   # threshold encoded in model name

    messages=[{'role': 'user', 'content': 'What is the capital of France?'}]

)

 

# The response.model field tells you which model was actually called

# Start conservative: 0.11875 routes ~40% to weak model on RouteLLM's benchmark

# distribution; your traffic may route differently

# Expand threshold empirically as you validate quality on your task distribution

print(f'Routed to: {response.model}')

```

 

 

The Honest Cost Decomposition

The headline claim floating around AI infrastructure discussions — “you can reduce inference costs 100-500x” — is technically defensible and practically misleading. It is the ceiling of a theoretical maximum obtained by stacking every optimisation’s peak number from its best-case benchmark, ignoring that each optimisation shifts the bottleneck and that subsequent optimisations therefore operate on a changed system. The honest figure for most engineering teams, applying a well-sequenced stack over three months, is 4-20x.

 

Here is what that looks like in practice, decomposed by tier:

 

Tier Techniques Expected Gain When It Applies Complexity
Tier 1: Universal AWQ INT4 + PagedAttention (vLLM) 3–5x Every self-hosted workload Low — one quantization run + deploy to vLLM
Tier 2a: Prefix-heavy RadixAttention (SGLang) Additional 2–4x on TTFT Agentic systems, repeated system prompts Low — switch serving backend
Tier 2b: Latency-sensitive EAGLE-3 speculative decoding Additional 3–5x latency (structured output) Interactive, low-temperature, structured generation; single-user or low-batch serving Medium — draft model + vLLM flag
Tier 2c: Routing opportunity RouteLLM query routing Additional 15–40% on blended cost Heterogeneous query distribution Low — drop-in controller wrapper
Tier 3: Scale P/D disaggregation (DistServe) Additional 7x on requests-within-SLO 1M+ requests/month, dedicated fabric High — infrastructure redesign

 

The sequencing rule is as important as the techniques themselves. Apply routing (Layer 0) before self-hosting investment — there is no point in building a highly optimised serving stack for requests that should have gone to a cheap hosted API. Apply PagedAttention (Layer 1) before quantisation (Layer 2) — recovering VRAM fragmentation changes the effective batch size, which changes AWQ’s marginal benefit calculation. Apply EAGLE-3 (Layer 3) after quantisation — a smaller quantised model produces a more favourable draft-to-target compute ratio. Diagnose after each step. The bottleneck that was binding before Layer 1 is often not the bottleneck after it.

The self-hosting break-even requires an honest accounting. The steelmanned case for managed APIs: cloud providers amortise GPU hardware across thousands of customers at 80-95% utilisation. A self-hosted cluster at p50 loads typically runs 20-40% utilisation during off-peak hours, paying for idle capacity. The operational overhead — MLOps engineers, CUDA debugging, quantisation calibration and validation, model update pipelines, SLO monitoring, on-call rotations — carries a fully-loaded cost of $400-800K per engineer per year. For teams generating less than $500K in annual LLM API spend, self-hosting the full optimisation stack is often more expensive in total cost than paying managed API pricing. The 4-20x inference-level saving does not survive the engineering overhead tax at smaller scale.

The break-even for serious self-hosting investment typically sits around $500K-$1M in annual API spend. Below that threshold, the highest-ROI moves are routing (RouteLLM on managed endpoints), careful model tier selection (using smaller models for simpler queries), and exploiting prompt caching offered by managed providers. Above that threshold, the full self-hosted stack becomes economically compelling — a single engineer week deploying vLLM with AWQ can deliver $200K+ in annualised savings for a team at that spend level, making the engineering investment trivially justified.

One figure to anchor the calculation: self-hosted Llama 3.1 8B on a single A100, running with AWQ INT4, typically costs $0.10-0.30 per million output tokens in cloud GPU time. GPT-4o-mini is priced at $0.60 per million output tokens. That 2-6× compute-cost gap is real. The total-cost gap narrows further after infrastructure and operations — often to 1.5-3× at typical scale. Still compelling at scale. Not compelling at $50K/year API spend.

 

The Acceptance-Rate Universality

There is a pattern running through EAGLE-3, RouteLLM, and TALE that is easy to miss because the three papers address seemingly different problems. Strip away the domain-specific language and the same mathematical structure emerges in all three.

Each system proposes a cheap candidate (draft token, weak model answer, compressed reasoning path). Each system evaluates the candidate against a quality signal (target model probability, strong model outcome, estimated query complexity). Each system accepts the cheap candidate when the threshold is met and falls back to — or expands — the expensive alternative when it is not. The economic gain in each case is proportional to how often the cheap path suffices.

EAGLE-3 and RouteLLM share this structure directly: cheap candidate, verify, accept or fall back. TALE is closer in spirit — it estimates how much expensive computation a query warrants — but compresses a single path rather than verifying against a fallback.

EAGLE-3’s acceptance rate is driven by output entropy: low-entropy, structured generation (code, JSON, templates) produces high acceptance rates because the draft model predicts the constrained output well. RouteLLM’s routing rate is driven by query difficulty: simple, well-structured queries route to the weak model at high rates. TALE’s budget reduction is driven by query complexity: simpler queries receive smaller token budgets, compressing the reasoning trace rather than replacing it.

The compounding insight is that the same workload signal — low entropy, structured output, simple routing — makes all three compound simultaneously. A system generating structured JSON responses to straightforward API queries will achieve high EAGLE-3 acceptance rates, high RouteLLM routing rates, and meaningful TALE reasoning-token reduction at once. The three mechanisms are not independent choices for independent workloads. They compound when the workload properties that drive the cheap path are the same across layers.

If you are building an agentic system that generates structured outputs from function calls — a document processing pipeline, a classification service, a structured data extraction API — you are sitting at the intersection where all three mechanisms deliver peak returns. The queries are structured, the outputs are constrained, the reasoning need is low, and the patterns repeat. EAGLE-3’s draft acceptance rate is high. RouteLLM routes aggressively to the cheap model. TALE compresses reasoning budgets for the predictable requests. The compounding from Tier 1 (AWQ + PagedAttention) through these three mechanisms is where the top of the 4-20× range of realistic production gains actually lives.

If you are building a general-purpose creative chatbot, none of the three will deliver their benchmark numbers — and that is exactly what the math predicts. High-temperature creative generation means low draft acceptance rates, unpredictable query difficulty for routing, and genuine reasoning needs for complex multi-turn interactions. The techniques still provide some benefit, but the headline figures in the papers apply to a different workload profile than yours. Understanding which end of this spectrum your workload sits on is the single most useful calibration exercise before allocating engineering time to any of these techniques.

 

Research Gaps Worth Watching

The research-to-production gap for core techniques has closed. AWQ ships in vLLM, TGI, and llama.cpp. PagedAttention is the default. EAGLE-3 integrates via a single flag. RouteLLM is an OpenAI-compatible drop-in. The remaining gaps are real and worth watching.

 

TALE has no production implementation. The paper (arXiv:2412.18547) demonstrates ~60-70% reduction in reasoning-token output with under 5% accuracy loss on math benchmarks — but the per-query reasoning-budget estimation is not packaged as a deployable service. RouteLLM partially fills this gap by routing on model capability, but TALE’s per-query reasoning-budget estimation (deciding how many reasoning tokens to allocate within a single model call) has no standalone implementation. The first team to package this as an OpenAI-compatible middleware service has a genuine first-mover opportunity.

 

Adaptive speculative token budget. EAGLE-3 uses a fixed number of speculative tokens (typically 5). The optimal number is workload-dependent: structured outputs benefit from more speculative tokens (higher acceptance rates, better amortisation), while high-temperature creative generation benefits from fewer (lower wasted computation on rejected tokens). The problem is a newsvendor problem — you are deciding in advance how much “inventory” (speculative tokens) to generate against an uncertain “demand” (acceptance rate), with an asymmetric cost structure (under-provisioning loses latency benefit, over-provisioning wastes compute). A learned adaptive budget that adjusts speculative depth based on real-time acceptance rate signals would extract significant additional efficiency from speculative decoding deployments.

 

Domain-specific AWQ calibration tooling. The gap between AWQ’s “sub-1% general benchmark degradation” and its potential domain-specific degradation is well understood in the research community. What does not exist is standardised tooling for domain-specific calibration dataset construction, salient-weight detection on domain-shifted corpora, and automated quality gates that validate degradation before production deployment. Teams in regulated industries — healthcare, legal, financial services — currently navigate this manually: collect representative samples, run quantised and unquantised models side by side, measure task-specific metrics, iterate on calibration datasets. It is slow and error-prone. The problem is not that AWQ fails on domain-specific tasks by default; it is that the failure mode is silent, systematic bias rather than obvious catastrophic degradation. A model that has quietly lost 3% accuracy on biomedical entity classification will pass generic benchmark evaluation and still fail in production. Tooling that automates domain-shift detection in quantised model quality would significantly lower the barrier to INT4 deployment in high-value precision contexts.

 

Conclusion

The organising question for any LLM inference infrastructure decision is: which layer of the memory hierarchy is the binding constraint right now?

If the answer is weight storage — the model does not fit or requires too many GPUs — the decision is AWQ INT4 before any other conversation. If the answer is VRAM fragmentation limiting batch size — the decision is PagedAttention, and you should already be running vLLM. If the answer is repeated prefill computation on shared context — the decision is RadixAttention, and you should evaluate SGLang. If the answer is per-token latency for interactive users — the decision is EAGLE-3 for structured workloads. If the answer is the blended cost of expensive API calls on heterogeneous traffic — the decision is RouteLLM, and it should be evaluated before any serving infrastructure investment.

The compounding gains are real. A team that moves from a naive managed-API deployment to self-hosted open models with Tier 1 (AWQ + PagedAttention) plus one matched Tier 2 technique against their primary bottleneck will see 4-10× cost reduction within three months — at scales above the break-even threshold discussed earlier. The 100-500x ceiling exists for idealised agentic pipelines simultaneously constrained on every dimension — it is not the expectation for most teams, and representing it as such sets up expensive disappointments.

The more important insight is the conceptual one. Every memory bandwidth optimisation you apply compounds with every other one, because they all reduce the same quantity: bytes per output token. The team that adds H100s expecting 3× decode speedup will get ~1.7× from the bandwidth ratio alone. The team that runs AWQ first will get 3-4× on the cheaper hardware. The difference between those outcomes is not engineering skill or hardware budget. It is knowing which problem you are solving.

 

Technique Reference

Technique Paper Repo Primary Benefit Best For Production Maturity
AWQ INT4 arXiv:2306.00978 casper-hansen/AutoAWQ 4x memory, 3-4× decode Universal self-hosted Very High — vLLM integrated
PagedAttention arXiv 2309.06180 vllm-project/vllm 2-4x throughput Universal — eliminate fragmentation Very High — de facto standard
RadixAttention arXiv 2312.07104 sgl-project/sglang Up to 6.4× throughput on prefix-heavy workloads Agentic, repeated prefixes High — production-ready
EAGLE-3 arXiv 2503.01840 SafeAILab/EAGLE 3-6.5x latency Interactive, structured output, code High — vLLM flag available
DistServe P/D arXiv 2401.09670 vllm-project/vllm 7.4x requests/SLO 1M+ req/month, scale Medium — support varies across serving frameworks
RouteLLM arXiv 2406.18665 lm-sys/RouteLLM 15-40% API cost Mixed-difficulty query distributions High — OpenAI-compatible drop-in
FlashAttention arXiv 2205.14135 (built into vLLM/SGLang) 2-4x attention kernel Long context, prefill-heavy Very High — auto-enabled
SmoothQuant arXiv 2211.10438 (integrated into serving frameworks) W8A8 prefill speedup Prefill-dominated, precision-sensitive High — production standard
kvpress huggingface/kvpress 2-4x KV compression Long context 32k+ with kvpress Medium — official HF library
TALE arXiv 

2412.18547

GeniusHTX/TALE ~60-70% reasoning-token reduction Reasoning model deployments Low — research only

 

Figures cited: AWQ memory and speedup (Lin et al., MLSys 2024, arXiv:2306.00978); PagedAttention throughput (arXiv 2309.06180); EAGLE-3 latency (arXiv:2503.01840); DistServe SLO (Zhong et al., OSDI 2024, arXiv:2401.09670); RouteLLM cost reduction (arXiv:2406.18665); RadixAttention throughput (arXiv 2312.07104); TALE reasoning-token reduction (arXiv:2412.18547).*

Share this post

Do you have any questions?

Newsletter

Zartis Tech Review

Your monthly source for AI and software related news.