The Complete Guide to Scaling LLM Inference on Kubernetes in 2026
> Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.
The Complete Guide to Scaling LLM Inference on Kubernetes in 2026
Meta Description: Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.
Primary Keyword: LLM inference Kubernetes Related Keywords: vLLM production, KServe LLMInferenceService, Kubernetes GPU scheduling, llm-d distributed inference, Kueue batch scheduling, AI infrastructure 2026, MIG partitioning, DRA Kubernetes, model serving autoscaling, PagedAttention
Tags: technical, tutorial, deep-dive, devops, ai-infrastructure
H1: Why Kubernetes Won the AI Inference War in 2026
Let me paint you a picture. It's 3 AM. Your single-node vLLM instance just OOM-killed on a 70B parameter model because a traffic spike hit while you were sleeping. Your GPU utilisation graph looks like a heart monitor — 95% one minute, 12% the next. You're losing money on idle silicon and losing users on latency spikes.
I lived this at AutoBlogging.Pro when we moved from rented GPU VMs to a self-orchestrated inference layer. The fix wasn't bigger GPUs. It was better orchestration.
By mid-2026, Kubernetes has become the undisputed runtime for production AI inference. Not because it's trendy — because the ecosystem matured. Between the Gateway API Inference Extension (GA since February 2026), NVIDIA's Dynamic Resource Allocation (DRA) going stable, and the CNCF accepting llm-d into the sandbox, the stack is now enterprise-ready.
This guide is the architecture I wish I had six months ago. No marketing fluff. Just YAML, numbers, and production scars.
H2: The 2026 AI Inference Stack — What Actually Works
Before we write a single manifest, let's map the battlefield. The modern inference stack on Kubernetes has four layers:
| Layer | Tool | Purpose | Maturity |
|---|---|---|---|
| Inference Engine | vLLM | PagedAttention, continuous batching, OpenAI-compatible API | Production |
| Serving Framework | KServe | Model lifecycle, autoscaling, canary deploys, multi-model endpoints | Production |
| Distributed Router | llm-d | Prefix-cache-aware routing, KV-cache optimisation, multi-node load balancing | CNCF Sandbox |
| Scheduler | Kueue | GPU/batch workload scheduling, fair sharing, queueing | Production |
| GPU Management | NVIDIA GPU Operator + DRA | Driver management, MIG partitioning, topology-aware allocation | Production |
The mental model: vLLM runs the model. KServe orchestrates it. llm-d routes traffic intelligently. Kueue decides who gets the GPU. The GPU Operator keeps the hardware honest.
H2: vLLM on Kubernetes — The Foundation
vLLM isn't just a Python server. In 2026, it's the default inference engine for production LLM serving because of PagedAttention — a memory management technique that reduces GPU memory waste from ~50% to near-zero by treating KV cache like virtual memory pages.
H3: Deployment Manifest
Here's the minimal vLLM deployment I run on GKE A100 nodes:
yaml1apiVersion: apps/v1 2kind: Deployment 3metadata: 4 name: vllm-llama-70b 5 namespace: inference 6spec: 7 replicas: 1 8 selector: 9 matchLabels: 10 app: vllm-llama-70b 11 template: 12 metadata: 13 labels: 14 app: vllm-llama-70b 15 spec: 16 nodeSelector: 17 cloud.google.com/gke-accelerator: nvidia-a100-80gb 18 containers: 19 - name: vllm 20 image: vllm/vllm-openai:v0.7.3 21 args: 22 - --model 23 - meta-llama/Llama-3.1-70B-Instruct 24 - --tensor-parallel-size 25 - "4" 26 - --gpu-memory-utilization 27 - "0.92" 28 - --max-num-seqs 29 - "256" 30 - --enable-prefix-caching 31 resources: 32 limits: 33 nvidia.com/gpu: "4" 34 ports: 35 - containerPort: 8000 36 env: 37 - name: HF_TOKEN 38 valueFrom: 39 secretKeyRef: 40 name: huggingface-token 41 key: token 42 livenessProbe: 43 httpGet: 44 path: /health 45 port: 8000 46 initialDelaySeconds: 120 47 periodSeconds: 30
H3: Key Configuration Flags Explained
| Flag | What It Does | Why It Matters |
|---|---|---|
--tensor-parallel-size 4 | Splits model layers across 4 GPUs | Required for 70B+ models on A100s |
--gpu-memory-utilization 0.92 | Uses 92% of GPU VRAM | Leaves headroom for CUDA overhead without wasting silicon |
--max-num-seqs 256 | Max concurrent sequences in batching | Higher = better throughput, but watch latency P99 |
--enable-prefix-caching | Reuses KV cache for shared prefixes | Critical for RAG / multi-turn chat workloads |
Production tip: Never set --gpu-memory-utilization to 1.0. CUDA allocates scratch space dynamically. You'll hit OOM during peak batch sizes.
H2: KServe — The Model Serving Control Plane
Raw vLLM deployments are fine for one model. Run five models with canary rollouts, A/B testing, and autoscaling? You need KServe.
KServe's 2026 killer feature is LLMInferenceService — a custom resource that wraps vLLM with Kubernetes-native governance.
H3: LLMInferenceService Manifest
yaml1apiVersion: serving.kserve.io/v1beta1 2kind: InferenceService 3metadata: 4 name: llama-70b-chat 5 namespace: inference 6 annotations: 7 serving.kserve.io/autoscalerClass: kpa.autoscaling.knative.dev 8 serving.kserve.io/targetUtilizationPercentage: "70" 9spec: 10 predictor: 11 model: 12 modelFormat: 13 name: huggingface 14 runtime: kserve-llm-d-runtime 15 storageUri: hf://meta-llama/Llama-3.1-70B-Instruct 16 resources: 17 limits: 18 nvidia.com/gpu: "4" 19 requests: 20 nvidia.com/gpu: "4" 21 minReplicas: 1 22 maxReplicas: 6 23 containerConcurrency: 64
H3: What KServe Handles for You
- Scale-to-zero — When no requests hit for 60s, KServe parks the pod. Next request triggers cold start (~8s on vLLM with model pre-loaded to host RAM).
- Canary rollouts — Shift 10% traffic to a new model version. Roll back if P50 latency regresses.
- Multi-model endpoints — One ingress, multiple models. Route by header or payload field.
- Token-aware routing — The Gateway API Inference Extension (v1.3.1) routes based on expected KV cache hit rate.
The win: KServe turns vLLM from a container into a service mesh citizen.
H2: llm-d — Distributed Intelligence Beyond Single-Node
Single-node vLLM tops out at 8 GPUs (H100 NVLink domain). What happens when you need to serve a 405B model or load-balance across 20 nodes? That's where llm-d (Kubernetes-native distributed inference) enters.
llm-d is now a CNCF sandbox project. It provides:
- Prefix-cache-aware routing — Sends requests to the node that already has the prompt in KV cache
- Disaggregated serving — Separates prefill (compute-heavy) from decode (memory-heavy) across different GPU pools
- Predictive latency scheduling — Routes based on queue depth + estimated token generation time
H3: llm-d Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │──────│ llm-d │──────│ Prefill │
│ Request │ │ Router │ │ Pool (H100) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Decode │
│ Pool (A100) │
└─────────────┘
H3: Deploying llm-d with KServe
yaml1apiVersion: llm-d.io/v1alpha1 2kind: LLMInferenceService 3metadata: 4 name: disaggregated-llama-405b 5spec: 6 predictor: 7 prefill: 8 runtime: vllm-prefill 9 resources: 10 limits: 11 nvidia.com/gpu: "8" 12 nodeSelector: 13 node-type: h100-prefill 14 decode: 15 runtime: vllm-decode 16 resources: 17 limits: 18 nvidia.com/gpu: "4" 19 nodeSelector: 20 node-type: a100-decode 21 minReplicas: 2 22 maxReplicas: 12 23 router: 24 type: prefix-cache-aware 25 kvCacheAffinity: true
Real talk: llm-d is where the field is heading. If you're serving 100+ concurrent users on a single model, disaggregated architecture is the only way to maintain sub-200ms time-to-first-token (TTFT) without burning cash on over-provisioned H100s.
H2: GPU Scheduling with Kueue — Stop the Free-for-All
Kubernetes' default scheduler treats a GPU like a generic resource. It doesn't understand that a training job needs 8 GPUs on the same node with NVLink, while an inference job can tolerate 2 GPUs anywhere.
H3: Kueue ClusterQueue for Mixed Workloads
yaml1apiVersion: kueue.x-k8s.io/v1beta1 2kind: ClusterQueue 3metadata: 4 name: gpu-cluster-queue 5spec: 6 namespaceSelector: {} 7 resourceGroups: 8 - coveredResources: 9 - "nvidia.com/gpu" 10 - "cpu" 11 - "memory" 12 flavors: 13 - name: a100-80gb 14 resources: 15 - name: "nvidia.com/gpu" 16 nominalQuota: 32 17 - name: "cpu" 18 nominalQuota: 512 19 - name: "memory" 20 nominalQuota: 2Ti 21 - name: h100-80gb 22 resources: 23 - name: "nvidia.com/gpu" 24 nominalQuota: 16 25 - name: "cpu" 26 nominalQuota: 256 27 - name: "memory" 28 nominalQuota: 1Ti 29 queueingStrategy: BestEffortFIFO 30 preemption: 31 reclaimWithinCohort: Any 32 withinClusterQueue: LowerPriority
H3: LocalQueue for Team Isolation
yaml1apiVersion: kueue.x-k8s.io/v1beta1 2kind: LocalQueue 3metadata: 4 name: inference-team 5 namespace: inference 6spec: 7 clusterQueue: gpu-cluster-queue
H3: Workload Priority and Preemption
yaml1apiVersion: kueue.x-k8s.io/v1beta1 2kind: WorkloadPriorityClass 3metadata: 4 name: realtime-inference 5spec: 6 value: 1000 7 description: "Production inference — preempt training if needed" 8--- 9apiVersion: kueue.x-k8s.io/v1beta1 10kind: WorkloadPriorityClass 11metadata: 12 name: batch-training 13spec: 14 value: 100 15 description: "Training jobs — yield to inference"
The magic: Kueue lets inference workloads preempt training jobs when latency spikes. Your users get served. Your fine-tuning job resumes later. No manual intervention.
H2: Dynamic Resource Allocation (DRA) and MIG — Squeeze Every Dollar
In 2026, GPU sharing isn't optional. It's economics. NVIDIA's DRA (Dynamic Resource Allocation) went GA in Kubernetes 1.36, and it's a game-changer.
H3: MIG Partitioning for Multi-Tenant Inference
yaml1apiVersion: resource.nvidia.com/v1beta1 2kind: DeviceClaim 3metadata: 4 name: mig-3g-40gb 5spec: 6 devices: 7 requests: 8 - name: mig-3g-40gb 9 deviceClassName: gpu.nvidia.com 10 selectors: 11 - cel: 12 expression: device.attributes["gpu.nvidia.com"].productName == "NVIDIA-A100-80GB" 13 results: 14 - name: mig-3g-40gb 15 deviceClaimName: mig-3g-40gb 16--- 17apiVersion: apps/v1 18kind: Deployment 19metadata: 20 name: small-model-inference 21spec: 22 template: 23 spec: 24 containers: 25 - name: vllm 26 image: vllm/vllm-openai:v0.7.3 27 resources: 28 claims: 29 - name: mig-3g-40gb
H3: Cost Comparison — Full GPU vs. MIG
| Scenario | GPU Config | Monthly Cost | Utilisation |
|---|---|---|---|
| 7B model, low traffic | Full A100 | $2,500 | 18% |
| 7B model, low traffic | MIG 3g.40gb | $900 | 72% |
| 70B model, high traffic | Full A100 x4 | $10,000 | 85% |
| 70B model, high traffic | MIG 7g.80gb | N/A (needs full GPU) | — |
Rule of thumb: Models under 13B parameters run beautifully on MIG slices. Anything larger needs full GPU + tensor parallelism.
H2: Autoscaling — From Reactive to Predictive
Kubernetes HPA with CPU metrics is useless for LLMs. The right signals are:
- GPU utilisation — Trigger scale-up at 75%, scale-down at 30%
- Request queue depth — If vLLM's internal queue > 16, add replicas
- KV cache hit rate — Low hit rate = traffic pattern mismatch = consider prefix-aware routing
- Token generation latency (P95) — Scale before users complain
H3: Custom Metrics HPA with Prometheus Adapter
yaml1apiVersion: autoscaling/v2 2kind: HorizontalPodAutoscaler 3metadata: 4 name: vllm-gpu-hpa 5 namespace: inference 6spec: 7 scaleTargetRef: 8 apiVersion: apps/v1 9 kind: Deployment 10 name: vllm-llama-70b 11 minReplicas: 1 12 maxReplicas: 10 13 metrics: 14 - type: Pods 15 pods: 16 metric: 17 name: vllm_gpu_utilization_percent 18 target: 19 type: AverageValue 20 averageValue: "75" 21 - type: Pods 22 pods: 23 metric: 24 name: vllm_request_queue_length 25 target: 26 type: AverageValue 27 averageValue: "16" 28 behavior: 29 scaleUp: 30 stabilizationWindowSeconds: 60 31 policies: 32 - type: Pods 33 value: 2 34 periodSeconds: 60 35 scaleDown: 36 stabilizationWindowSeconds: 300 37 policies: 38 - type: Pods 39 value: 1 40 periodSeconds: 120
Critical: Scale-down stabilization of 300s prevents flapping. LLM cold starts are expensive. Don't thrash.
H2: Security and Observability — The Production Checklist
H3: Network Policies
Isolate inference pods. They hold model weights and may process sensitive prompts.
yaml1apiVersion: networking.k8s.io/v1 2kind: NetworkPolicy 3metadata: 4 name: inference-isolate 5 namespace: inference 6spec: 7 podSelector: 8 matchLabels: 9 app: vllm-llama-70b 10 policyTypes: 11 - Ingress 12 ingress: 13 - from: 14 - namespaceSelector: 15 matchLabels: 16 name: gateway 17 ports: 18 - protocol: TCP 19 port: 8000
H3: Monitoring Stack
| Component | Tool | Metric |
|---|---|---|
| GPU metrics | DCGM Exporter | GPU util, memory, temperature, NVLink bandwidth |
| Inference metrics | vLLM Prometheus endpoint | TTFT, TPOT, queue depth, batch size |
| Cluster metrics | kube-prometheus-stack | Node util, pod scheduling latency |
| Log aggregation | Loki + Grafana | Error rates, model load failures |
Alert I always set: vllm_gpu_memory_usage_percent > 95 for 2m. OOM is coming. Evacuate.
H2: FAQ — Kubernetes LLM Inference in 2026
H3: Why not just use a managed API like OpenAI?
Cost and control. At 10M tokens/day, self-hosted vLLM on spot GPU instances is 60-70% cheaper. Plus you control data residency, model versions, and fine-tuned weights.
H3: Do I need Kubernetes for a single model on one GPU?
No. Docker + systemd is fine for prototyping. Kubernetes pays off at 2+ models or any autoscaling requirement.
H3: What's the difference between KServe and vLLM production-stack?
vLLM production-stack is a batteries-included Helm chart for single-cluster vLLM. KServe is a model-agnostic serving framework with enterprise features (canary, auth, monitoring). Use KServe for heterogeneous models; use production-stack for quick vLLM-only deployments.
H3: Can I run training and inference on the same cluster?
Yes, with Kueue. Use separate ClusterQueues with preemption policies. Mark inference as higher priority. Training jobs yield GPU gracefully.
H3: Is MIG supported on H100s?
Yes, and it's better than A100 MIG. H100 supports up to 7 MIG instances with multi-tenant performance isolation. A100 supports 7 MIG instances but with less isolation granularity.
H3: How do I handle model weight storage?
Use Fluid or a ReadWriteMany PVC backed by high-throughput NFS/parallel filesystem. For multi-zone, sync weights to each zone's object storage bucket and mount via CSI.
H3: What's the latency cost of KServe + llm-d vs. raw vLLM?
~3-5ms added per hop (router + envoy). For user-facing chat, this is invisible. For high-frequency trading or real-time code completion, run raw vLLM with a lightweight load balancer.
H2: Conclusion — Build the Fleet, Not the Boat
If there's one lesson from running AutoBlogging.Pro's inference layer at scale: a single powerful GPU node is a liability. It fails. It bottlenecks. It burns cash when idle.
A Kubernetes-orchestrated fleet of vLLM + KServe + llm-d + Kueue is an asset. It scales elastically. It shares hardware across teams. It routes intelligently. It fails gracefully.
The 2026 stack is no longer experimental. Red Hat ships it. CoreWeave runs it. Google Cloud's GKE has it in one-click. The primitives are stable. The only question is whether your architecture is ready.
Next Steps:
- Deploy vLLM on Kubernetes with the official production-stack
- Read the KServe LLMInferenceService docs
- Explore llm-d on the CNCF sandbox
Internal Links:
- See how we orchestrate AI agents at AutoBlogging.Pro
- Read our guide on MCP for AI-native applications
Written by Essa Mamdani — AI Engineer, Software Architect, and builder of AutoBlogging.Pro. For more infrastructure deep-dives, subscribe to the newsletter.