Running a language model in a Docker container is straightforward. Running it at scale – handling 500 concurrent requests, autoscaling with demand, recovering gracefully from node failures, and not spending $40k a month on idle GPUs – is a different problem entirely.
The standard Kubernetes playbook doesn’t transfer cleanly to LLM workloads. The resource model is different, the startup characteristics are different, and the autoscaling signals you’ve been using for stateless web services will actively mislead you here. Most teams discover this not by reading documentation, but by watching their cluster do the wrong thing in production.
This guide covers the specific decisions (serving framework selection, autoscaling strategy, GPU resource configuration, cost optimization, and observability) that determine whether your LLM deployment runs well or runs expensive.
Why the Standard Kubernetes Playbook Breaks
Three things about LLMs behave differently from every other workload you’ve run on Kubernetes.
GPU is the binding resource, and it lies. CPU and memory HPA, the default autoscaling mechanism, is meaningless for LLM inference. GPU utilization sits at 100% whether your instance is efficiently serving 50 concurrent requests or saturated and dropping them. The metric you’ve always used to know whether to scale tells you nothing useful here.
Static batching wastes most of your GPU. Traditional inference servers batch requests at arrival, process the whole batch, and return results. The problem is that language model sequences have variable lengths; a batch of 32 requests might finish at 10 different times, leaving the GPU executing padding tokens until the last sequence completes. Studies show static batching wastes 60-70% of GPU capacity on workloads with variable-length sequences. This is the norm, not the exception.
Pod startup takes 20 minutes, not 20 seconds. A containerized web service starts in seconds. A 7B parameter model could take upto 5-10 minutes to load from network storage into GPU memory. A 70B parameter model might take a bit longer. The Kubernetes autoscaling assumption, that you can absorb a traffic spike by spinning up new pods while the old ones handle the load, collapses when “spinning up a new pod” means a 20-minute wait.
Understanding these three constraints shapes every decision that follows.
The Serving Framework Decision
The first architectural decision is which inference serving framework to run. There are three main options, and the right answer depends on how many models you’re serving and what else is in your fleet.
vLLM is the right default for the vast majority of teams. It introduced PagedAttention — a memory management technique inspired by operating system virtual memory paging — which reduced KV-cache memory waste from 60-80% under naive allocation to under 4%. The practical effect is dramatically larger batch sizes per GPU and commensurately higher throughput: up to 24x improvement over HuggingFace Transformers on equivalent hardware. Stripe, serving 50 million daily API calls, achieved 73% inference cost reduction and reduced their GPU fleet to one-third of its previous size after switching to vLLM. It exposes an OpenAI-compatible REST API out of the box, which means most applications utilising OpenAI API format can be pointed at a self-hosted vLLM endpoint with limited changes required (* depends on LLM feature compatibility, e.g. structured output or context size) .
KServe is a serving platform that sits above inference engines like vLLM and Triton rather than replacing them. It adds a standardized control plane: model versioning, traffic splitting for canary deployments, ModelMesh for serving hundreds of models on shared GPU infrastructure, and Kubernetes-native lifecycle management. The CNCF accepted KServe as an Incubating project in November 2025, which signals long-term governance stability. The tradeoff is operational complexity: KServe requires Knative and Istio, adding three to four extra control plane components to manage. If you’re serving one to five large language models for a focused use case, that overhead is not justified. If you have ten or more models deployed across multiple teams with governance, canary, and traffic routing requirements, it is.
NVIDIA Triton (rebranded Dynamo-Triton in March 2025) excels when your inference fleet is heterogeneous; when you’re running LLMs alongside vision models, speech models, or embedding models and want a single serving platform across all of them. For LLM inference specifically, Triton uses vLLM as a backend engine, so it doesn’t deliver fundamentally different performance; it delivers a unified management layer at the cost of additional configuration complexity. TensorRT-LLM backend achieves near-bare-metal GPU performance on NVIDIA hardware, validated in MLPerf Inference v4.1. If you need commercial SLA coverage, NVIDIA AI Enterprise wraps Triton with support contracts.
The practical decision matrix:
| Scenario | Recommended |
|---|---|
| 1–5 LLMs, focused use case | vLLM + KEDA |
| 10+ models, multi-team governance | KServe + vLLM runtime |
| Heterogeneous fleet (LLMs + vision + embeddings) | Triton + vLLM backend |
| Commercial support required | NVIDIA AI Enterprise (Dynamo-Triton) |
Autoscaling That Actually Works: KEDA
The correct tool for autoscaling LLM workloads on Kubernetes is KEDA (Kubernetes Event-Driven Autoscaling), and the correct metric is inference queue depth, specifically `vllm:num_requests_waiting` per replica.
Standard HPA scales on CPU utilization or memory pressure. As established earlier, GPU inference workloads saturate GPU compute while CPU stays comparatively low. By the time CPU is high enough to trigger HPA, your inference queue is already deep and your users are already experiencing degraded latency. You’re scaling reactively to the wrong signal.
Queue depth scales proactively to the right signal. When `vllm:num_requests_waiting` per replica exceeds your threshold, requests are accumulating faster than they’re being processed – that’s when you need more capacity, before the latency impact reaches users.
A minimal KEDA ScaledObject for vLLM:
```yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaler
spec:
scaleTargetRef:
name: vllm-deployment
minReplicaCount: 1 # Never 0 in production
maxReplicaCount: 10
cooldownPeriod: 300 # 5 minutes — critical for LLMs
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_requests_waiting
query: |
sum(vllm:num_requests_waiting) /
count(vllm:num_requests_waiting)
threshold: "5" # Scale when avg queue depth > 5 per replica
```
Two configuration decisions here that are easy to get wrong.
`minReplicaCount: 1`, not zero. Scale-to-zero is attractive on paper: no idle traffic, no idle GPU cost. In practice, the first request after a scale-to-zero event waits for pod startup plus model weight loading: 5-20 minutes depending on model size. For any user-facing application, that wait is fatal. Scale-to-zero belongs in development environments and batch offline processing, not production serving.
`cooldownPeriod: 300`. This tells KEDA to wait 5 minutes after traffic drops before scaling down. Without it, KEDA will scale down as soon as the queue empties – tearing down pods that just finished loading model weights, wasting the startup cost, and leaving you understaffed for the next traffic surge. 300 seconds is a reasonable minimum for most model sizes; increase it for 70B+ models where startup takes longer.
The vLLM Production Stack Helm chart (from the `vllm-project/production-stack` repository, backed by IBM Research) includes a preconfigured KEDA ScaledObject as of v0.1.9. If you’re starting fresh, this is a reasonable baseline to build from.
GPU Memory Is the Real Constraint
GPU memory, not compute, is what limits how many requests your deployment can handle simultaneously. Getting resource allocation right requires understanding where that memory goes.
The quick sizing rule: multiply model parameter count in billions by two to get the minimum GPU memory in gigabytes for model weights at FP16 precision. A 7B model needs at minimum 14GB. A 70B model needs 140GB, which means either two A100 80GB GPUs running tensor-parallel, or a single H100 80GB. Beyond the weights, KV cache needs an additional 20-30% of GPU memory at your target concurrency. Tune `–gpu-memory-utilization` in vLLM (default 0.90, safe range 0.80-0.90) to control how aggressively memory is allocated.
For GPU allocation strategy, there are three options:
- Full GPU allocation (`nvidia.com/gpu: 1`) gives a pod exclusive access to one physical GPU. This is the right choice for 7B+ production LLMs: the full memory pool is available for PagedAttention’s KV cache management, and there are no noisy-neighbor effects.
- MIG (Multi-Instance GPU) partitions a single high-end data center GPU (those that support MIG, such as the A100) into up to seven hardware-isolated instances, each with dedicated compute, memory bandwidth, and L2 cache. For example, an A100 40GB can be split into seven 5GB instances, three 10GB instances, or two 20GB instances. The isolation is real (hardware-level, not software-enforced) which matters for multi-tenant environments with compliance requirements. The limitation is that MIG profiles must be configured before workloads run; changing them requires draining the node.
- Time-slicing exposes multiple virtual GPUs from a single physical GPU via software context switching. It’s cheap to configure, delivers 4-8x pod density improvement, and has zero isolation guarantees. A memory leak or OOM in one pod can kill all pods sharing the same physical GPU. Time-slicing belongs in development clusters and non-critical batch workloads.
Note for Azure AKS users specifically: MIG partitioning is not supported on Azure Linux nodes and is incompatible with the AKS cluster autoscaler. Plan accordingly.
The Cost Optimization Playbook
LLM inference costs have been declining at roughly 10x per year since 2022; GPT-4 equivalent capability that cost $20 per million tokens in 2022 costs around $0.40 today. The trajectory continues. But that trajectory doesn’t help with this month’s GPU bill, so here are the levers available now, ranked by impact.
- Spot instances: 60-90% savings. GPU spot instances are the highest-impact cost lever available, but they are not a clear-cut choice. A spot instance can be interrupted when load is high in a given datacenter, and this must be actively mitigated. The viability depends on instance type. G5 and G6 instances (A10G and L4 GPUs) on AWS have good spot availability and are appropriate for most 7B-30B model deployments. P5 instances (H100) have severely constrained spot availability. For those, Capacity Block Reservations are the more predictable option. Whatever instance type, use PodDisruptionBudgets to ensure Kubernetes doesn’t interrupt active inference pods at moments that degrade user experience, and implement fallback logic to redirect traffic to on-demand instances during spot interruptions.
- Continuous batching: 85% per-token cost reduction. This is not a configuration knob, it’s the architecture vLLM uses by default. The comparison point is static batching, which most older inference setups use. At a batch size of 32 concurrent requests, continuous batching reduces per-token cost by 85% compared to static batching, at a 20% latency increase. If you’re running anything other than vLLM or another continuous-batching engine, switching is the highest-ROI infrastructure change available.
- Quantization: right-size for your quality requirements. INT8 quantization halves GPU memory requirements and doubles throughput with under 1% accuracy degradation on most benchmarks, appropriate for most enterprise workloads. INT4 quantization (AWQ or GPTQ format) reduces memory by 75% and delivers 2.7x throughput improvement, enabling a 70B model to fit on a single A100 80GB that would otherwise require two. The tradeoff is measurable (though small) quality degradation, which matters more in regulated industries like Fintech or MedTech. Test against your specific evaluation criteria before deploying quantized models in production.
- Semantic caching: 61-68% API call reduction. For workloads with repetitive query patterns (documentation assistants, internal knowledge bases, support bots, etc.) semantic caching at the application layer reduces the number of requests that reach the inference engine. GPTCache and similar libraries intercept semantically similar queries and return cached responses without model inference. The cost reduction is real; the applicability depends on your query distribution.
- Managed inference API vs self-hosted: know your breakeven. For Azure specifically, the self-hosted breakeven point against Azure OpenAI API pricing is approximately 22 million words per day or $200-250k per year in total infrastructure cost. Below that volume, the managed API is almost certainly cheaper when you factor in engineering time, GPU procurement, and operational overhead.
AWS EKS vs Azure AKS: The LLM-Specific Differences
Both platforms run vLLM on GPUs. The differences that matter for LLM workloads are narrower than the platform marketing suggests.
- Node provisioning speed. Karpenter on EKS provisions new GPU nodes in approximately 60 seconds. The AKS Cluster Autoscaler takes 5-15 minutes. For LLM workloads where you’re already dealing with 5-20 minute model loading times, this difference is less dramatic than it sounds, but it does affect how aggressively you can allow the cluster to scale down during low-traffic periods.
- MIG support. MIG works normally on EKS. On AKS, MIG partitioning is not supported on Azure Linux nodes and is incompatible with the AKS cluster autoscaler. If hardware-level GPU isolation is a compliance requirement, EKS has the advantage.
- Spot VM naming. Azure retired Low-Priority VMs in March 2026. If you’re on AKS and still using Low-Priority VM node pools, you need to migrate to Spot VMs now; they’ve been the correct option for some time.
- Reference architectures. AWS provides an official quickstart for vLLM on EKS in the EKS user guide. For AKS, cloudthrill.ca published a well-regarded end-to-end Terraform walkthrough for deploying the vLLM Production Stack on AKS with KEDA autoscaling. The infrastructure as code covers GPU node pools, the vLLM Helm chart, and KEDA configuration in one place.
Observability: The Three Metrics That Matter
Most teams instrument LLM deployments with whatever their existing Kubernetes monitoring covers – CPU, memory, pod restarts. None of these tell you whether your inference system is healthy.
The three metrics that actually matter, and how to collect them:
- GPU utilization and memory via DCGM Exporter. Deploy the NVIDIA DCGM Exporter as a DaemonSet. It exposes per-GPU Prometheus metrics: `dcgm_fi_dev_gpu_util` (compute utilization, target 60-80% sustained), `dcgm_fi_dev_fb_free` and `dcgm_fi_dev_fb_used` (free and used GPU memory), and `dcgm_fi_dev_mem_copy_util` (memory bandwidth). Alert on KV cache utilization above 85% — at that point, vLLM starts rejecting new requests.
- Inference performance via vLLM `/metrics`. vLLM exposes a Prometheus-compatible metrics endpoint out of the box. The critical metrics: `vllm:time_to_first_token_seconds` (TTFT — how long before the first token appears; target p95 under 500ms for chat, under 5 seconds for RAG), `vllm:time_per_output_token_seconds` (TPOT — generation speed; target p95 under 50ms per token), `vllm:num_requests_waiting` (queue depth — your KEDA scaling trigger), and `vllm:gpu_cache_usage_perc` (KV cache fill percentage).
- Cost-per-token as a first-class metric. This is the metric CTOs should be watching and most teams don’t track. The formula is straightforward:
“`
cost_per_token = (gpu_instance_hourly_cost / 3600) / tokens_per_second
“`
Implement this as a Prometheus recording rule. It makes the impact of every infrastructure optimization — quantization, batching configuration, spot instance migration — directly visible in cost terms rather than abstract throughput numbers.
SLO reference targets for production LLM serving:
| Metric | Target |
|||
| TTFT p95 (chat) | < 500ms |
| TTFT p95 (RAG) | < 5s |
| TPOT p95 | < 50ms/token |
| GPU utilization (sustained) | 60-80% |
| Queue depth | < 10 requests |
| KV cache utilization | < 85% |
| Error rate | < 0.1% |
Start Here
If you’re building from scratch or diagnosing an underperforming deployment:
- Start with vLLM + KEDA. Don’t evaluate KServe or Triton until you’ve validated your use case needs what they provide. Most teams don’t.
- Set `minReplicaCount: 1` and `cooldownPeriod: 300`. These two KEDA settings prevent the most common production failures before they happen.
- Size GPU memory correctly before launch. Use the `params_B × 2` rule for weights, add 25% for KV cache, set `–gpu-memory-utilization` to 0.85. Wrong resource limits are the most common cause of vLLM OOM crashes.
- Deploy DCGM Exporter and wire vLLM metrics to Prometheus on day one. You will not be able to diagnose production issues without visibility into GPU memory, queue depth, and TTFT.
- Test quantization before production. INT8 is almost always safe. INT4 needs validation against your quality criteria, but the cost case is compelling.
- Evaluate spot instances for your instance type. G5/G6 availability is good. P5 is constrained. Check current spot pricing and interruption rates in your target region before committing.
LLM infrastructure is still maturing quickly. The vLLM Production Stack, llm-d’s disaggregated prefill/decode architecture, and the Kubernetes Gateway API Inference Extension are all moving fast — what represents best practice today will look different in twelve months. The fundamentals here — correct autoscaling signals, GPU memory management, and cost visibility — will remain stable even as the tooling evolves.
Zartis builds and scales AI infrastructure for engineering teams in Fintech, MedTech, and enterprise software. If you’re working through any of the challenges in this post, get in touch.
References
- Efficient Memory Management for LLM Serving with PagedAttention] (https://arxiv.org/abs/2309.06180). arXiv:2309.06180, 2023.
- DistServe: Disaggregating Prefill and Decoding (https://arxiv.org/abs/2401.09670). arXiv:2401.09670, 2024.
- vLLM Production Stack (https://github.com/vllm-project/production-stack). IBM Research / vLLM Project, 2025.
- llm-d: Kubernetes-Native Distributed LLM Inference (https://github.com/llm-d/llm-d). Red Hat / Google / IBM / NVIDIA, 2025.
- KServe v0.15 — CNCF Incubating (https://www.cncf.io/blog/2025/11/11/kserve-becomes-a-cncf-incubating-project/). CNCF, 2025.
- Scaling LLMs with NVIDIA Triton and TensorRT-LLM on Kubernetes (https://developer.nvidia.com/blog/scaling-llms-with-nvidia-triton-and-nvidia-tensorrt-llm-using-kubernetes/). NVIDIA Developer Blog, 2024.
- Autoscaling GPU Workloads with KEDA and HPA (https://dasroot.net/posts/2026/02/autoscaling-gpu-workloads-keda-hpa/). dasroot.net, February 2026.
- vLLM on Amazon EKS Quickstart (https://docs.aws.amazon.com/eks/latest/userguide/ml-realtime-inference-llm-inference-vllm.html). AWS Documentation, 2025.
- vLLM Production Stack on Azure AKS with Terrafor ](https://cloudthrill.ca/vllm-production-stack-on-aks-terraform). cloudthrill.ca, 2025.
- Scaling LLM Inference: Multi-Node on Amazon EKS (https://aws.amazon.com/blogs/hpc/scaling-your-llm-inference-workloads-multi-node-deployment-with-tensorrt-llm-and-triton-on-amazon-eks/). AWS HPC Blog, 2024.
- KEDA Autoscaling for vLLM (https://docs.vllm.ai/projects/production-stack/en/latest/use_cases/autoscaling-keda.html). vLLM Documentation, 2025.
- Best Practices: Autoscaling LLM Inference on GKE (https://docs.cloud.google.com/kubernetes-engine/docs/best-practices/machine-learning/inference/autoscaling). Google Cloud, 2025.
- Cost-Optimized GPU Deployments for LLMs on AWS EKS (https://medium.com/@thevowelsofx/cost-optimized-gpu-deployments-for-llms-on-aws-eks-e1024e38fc2b). Medium, 2024.