$ ls ./menu

© 2025 ESSA MAMDANI

cd ../blog
10 min read
Artificial Intelligence

The Complete Guide to Scaling LLM Inference on Kubernetes in 2026

> Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.

Audio version coming soon
The Complete Guide to Scaling LLM Inference on Kubernetes in 2026
Verified by Essa Mamdani

The Complete Guide to Scaling LLM Inference on Kubernetes in 2026

Meta Description: Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.

Primary Keyword: LLM inference Kubernetes Related Keywords: vLLM production, KServe LLMInferenceService, Kubernetes GPU scheduling, llm-d distributed inference, Kueue batch scheduling, AI infrastructure 2026, MIG partitioning, DRA Kubernetes, model serving autoscaling, PagedAttention

Tags: technical, tutorial, deep-dive, devops, ai-infrastructure


H1: Why Kubernetes Won the AI Inference War in 2026

Let me paint you a picture. It's 3 AM. Your single-node vLLM instance just OOM-killed on a 70B parameter model because a traffic spike hit while you were sleeping. Your GPU utilisation graph looks like a heart monitor — 95% one minute, 12% the next. You're losing money on idle silicon and losing users on latency spikes.

I lived this at AutoBlogging.Pro when we moved from rented GPU VMs to a self-orchestrated inference layer. The fix wasn't bigger GPUs. It was better orchestration.

By mid-2026, Kubernetes has become the undisputed runtime for production AI inference. Not because it's trendy — because the ecosystem matured. Between the Gateway API Inference Extension (GA since February 2026), NVIDIA's Dynamic Resource Allocation (DRA) going stable, and the CNCF accepting llm-d into the sandbox, the stack is now enterprise-ready.

This guide is the architecture I wish I had six months ago. No marketing fluff. Just YAML, numbers, and production scars.


H2: The 2026 AI Inference Stack — What Actually Works

Before we write a single manifest, let's map the battlefield. The modern inference stack on Kubernetes has four layers:

LayerToolPurposeMaturity
Inference EnginevLLMPagedAttention, continuous batching, OpenAI-compatible APIProduction
Serving FrameworkKServeModel lifecycle, autoscaling, canary deploys, multi-model endpointsProduction
Distributed Routerllm-dPrefix-cache-aware routing, KV-cache optimisation, multi-node load balancingCNCF Sandbox
SchedulerKueueGPU/batch workload scheduling, fair sharing, queueingProduction
GPU ManagementNVIDIA GPU Operator + DRADriver management, MIG partitioning, topology-aware allocationProduction

The mental model: vLLM runs the model. KServe orchestrates it. llm-d routes traffic intelligently. Kueue decides who gets the GPU. The GPU Operator keeps the hardware honest.


H2: vLLM on Kubernetes — The Foundation

vLLM isn't just a Python server. In 2026, it's the default inference engine for production LLM serving because of PagedAttention — a memory management technique that reduces GPU memory waste from ~50% to near-zero by treating KV cache like virtual memory pages.

H3: Deployment Manifest

Here's the minimal vLLM deployment I run on GKE A100 nodes:

yaml
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: vllm-llama-70b
5  namespace: inference
6spec:
7  replicas: 1
8  selector:
9    matchLabels:
10      app: vllm-llama-70b
11  template:
12    metadata:
13      labels:
14        app: vllm-llama-70b
15    spec:
16      nodeSelector:
17        cloud.google.com/gke-accelerator: nvidia-a100-80gb
18      containers:
19      - name: vllm
20        image: vllm/vllm-openai:v0.7.3
21        args:
22          - --model
23          - meta-llama/Llama-3.1-70B-Instruct
24          - --tensor-parallel-size
25          - "4"
26          - --gpu-memory-utilization
27          - "0.92"
28          - --max-num-seqs
29          - "256"
30          - --enable-prefix-caching
31        resources:
32          limits:
33            nvidia.com/gpu: "4"
34        ports:
35        - containerPort: 8000
36        env:
37        - name: HF_TOKEN
38          valueFrom:
39            secretKeyRef:
40              name: huggingface-token
41              key: token
42        livenessProbe:
43          httpGet:
44            path: /health
45            port: 8000
46          initialDelaySeconds: 120
47          periodSeconds: 30

H3: Key Configuration Flags Explained

FlagWhat It DoesWhy It Matters
--tensor-parallel-size 4Splits model layers across 4 GPUsRequired for 70B+ models on A100s
--gpu-memory-utilization 0.92Uses 92% of GPU VRAMLeaves headroom for CUDA overhead without wasting silicon
--max-num-seqs 256Max concurrent sequences in batchingHigher = better throughput, but watch latency P99
--enable-prefix-cachingReuses KV cache for shared prefixesCritical for RAG / multi-turn chat workloads

Production tip: Never set --gpu-memory-utilization to 1.0. CUDA allocates scratch space dynamically. You'll hit OOM during peak batch sizes.


H2: KServe — The Model Serving Control Plane

Raw vLLM deployments are fine for one model. Run five models with canary rollouts, A/B testing, and autoscaling? You need KServe.

KServe's 2026 killer feature is LLMInferenceService — a custom resource that wraps vLLM with Kubernetes-native governance.

H3: LLMInferenceService Manifest

yaml
1apiVersion: serving.kserve.io/v1beta1
2kind: InferenceService
3metadata:
4  name: llama-70b-chat
5  namespace: inference
6  annotations:
7    serving.kserve.io/autoscalerClass: kpa.autoscaling.knative.dev
8    serving.kserve.io/targetUtilizationPercentage: "70"
9spec:
10  predictor:
11    model:
12      modelFormat:
13        name: huggingface
14      runtime: kserve-llm-d-runtime
15      storageUri: hf://meta-llama/Llama-3.1-70B-Instruct
16    resources:
17      limits:
18        nvidia.com/gpu: "4"
19      requests:
20        nvidia.com/gpu: "4"
21    minReplicas: 1
22    maxReplicas: 6
23    containerConcurrency: 64

H3: What KServe Handles for You

  1. Scale-to-zero — When no requests hit for 60s, KServe parks the pod. Next request triggers cold start (~8s on vLLM with model pre-loaded to host RAM).
  2. Canary rollouts — Shift 10% traffic to a new model version. Roll back if P50 latency regresses.
  3. Multi-model endpoints — One ingress, multiple models. Route by header or payload field.
  4. Token-aware routing — The Gateway API Inference Extension (v1.3.1) routes based on expected KV cache hit rate.

The win: KServe turns vLLM from a container into a service mesh citizen.


H2: llm-d — Distributed Intelligence Beyond Single-Node

Single-node vLLM tops out at 8 GPUs (H100 NVLink domain). What happens when you need to serve a 405B model or load-balance across 20 nodes? That's where llm-d (Kubernetes-native distributed inference) enters.

llm-d is now a CNCF sandbox project. It provides:

  • Prefix-cache-aware routing — Sends requests to the node that already has the prompt in KV cache
  • Disaggregated serving — Separates prefill (compute-heavy) from decode (memory-heavy) across different GPU pools
  • Predictive latency scheduling — Routes based on queue depth + estimated token generation time

H3: llm-d Architecture

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Client    │──────│  llm-d      │──────│ Prefill     │
│   Request   │      │  Router     │      │ Pool (H100) │
└─────────────┘      └─────────────┘      └─────────────┘
                           │
                           ▼
                     ┌─────────────┐
                     │ Decode      │
                     │ Pool (A100) │
                     └─────────────┘

H3: Deploying llm-d with KServe

yaml
1apiVersion: llm-d.io/v1alpha1
2kind: LLMInferenceService
3metadata:
4  name: disaggregated-llama-405b
5spec:
6  predictor:
7    prefill:
8      runtime: vllm-prefill
9      resources:
10        limits:
11          nvidia.com/gpu: "8"
12      nodeSelector:
13        node-type: h100-prefill
14    decode:
15      runtime: vllm-decode
16      resources:
17        limits:
18          nvidia.com/gpu: "4"
19      nodeSelector:
20        node-type: a100-decode
21      minReplicas: 2
22      maxReplicas: 12
23  router:
24    type: prefix-cache-aware
25    kvCacheAffinity: true

Real talk: llm-d is where the field is heading. If you're serving 100+ concurrent users on a single model, disaggregated architecture is the only way to maintain sub-200ms time-to-first-token (TTFT) without burning cash on over-provisioned H100s.


H2: GPU Scheduling with Kueue — Stop the Free-for-All

Kubernetes' default scheduler treats a GPU like a generic resource. It doesn't understand that a training job needs 8 GPUs on the same node with NVLink, while an inference job can tolerate 2 GPUs anywhere.

H3: Kueue ClusterQueue for Mixed Workloads

yaml
1apiVersion: kueue.x-k8s.io/v1beta1
2kind: ClusterQueue
3metadata:
4  name: gpu-cluster-queue
5spec:
6  namespaceSelector: {}
7  resourceGroups:
8  - coveredResources:
9    - "nvidia.com/gpu"
10    - "cpu"
11    - "memory"
12    flavors:
13    - name: a100-80gb
14      resources:
15      - name: "nvidia.com/gpu"
16        nominalQuota: 32
17      - name: "cpu"
18        nominalQuota: 512
19      - name: "memory"
20        nominalQuota: 2Ti
21    - name: h100-80gb
22      resources:
23      - name: "nvidia.com/gpu"
24        nominalQuota: 16
25      - name: "cpu"
26        nominalQuota: 256
27      - name: "memory"
28        nominalQuota: 1Ti
29  queueingStrategy: BestEffortFIFO
30  preemption:
31    reclaimWithinCohort: Any
32    withinClusterQueue: LowerPriority

H3: LocalQueue for Team Isolation

yaml
1apiVersion: kueue.x-k8s.io/v1beta1
2kind: LocalQueue
3metadata:
4  name: inference-team
5  namespace: inference
6spec:
7  clusterQueue: gpu-cluster-queue

H3: Workload Priority and Preemption

yaml
1apiVersion: kueue.x-k8s.io/v1beta1
2kind: WorkloadPriorityClass
3metadata:
4  name: realtime-inference
5spec:
6  value: 1000
7  description: "Production inference — preempt training if needed"
8---
9apiVersion: kueue.x-k8s.io/v1beta1
10kind: WorkloadPriorityClass
11metadata:
12  name: batch-training
13spec:
14  value: 100
15  description: "Training jobs — yield to inference"

The magic: Kueue lets inference workloads preempt training jobs when latency spikes. Your users get served. Your fine-tuning job resumes later. No manual intervention.


H2: Dynamic Resource Allocation (DRA) and MIG — Squeeze Every Dollar

In 2026, GPU sharing isn't optional. It's economics. NVIDIA's DRA (Dynamic Resource Allocation) went GA in Kubernetes 1.36, and it's a game-changer.

H3: MIG Partitioning for Multi-Tenant Inference

yaml
1apiVersion: resource.nvidia.com/v1beta1
2kind: DeviceClaim
3metadata:
4  name: mig-3g-40gb
5spec:
6  devices:
7    requests:
8    - name: mig-3g-40gb
9      deviceClassName: gpu.nvidia.com
10      selectors:
11      - cel:
12          expression: device.attributes["gpu.nvidia.com"].productName == "NVIDIA-A100-80GB"
13    results:
14    - name: mig-3g-40gb
15      deviceClaimName: mig-3g-40gb
16---
17apiVersion: apps/v1
18kind: Deployment
19metadata:
20  name: small-model-inference
21spec:
22  template:
23    spec:
24      containers:
25      - name: vllm
26        image: vllm/vllm-openai:v0.7.3
27        resources:
28          claims:
29          - name: mig-3g-40gb

H3: Cost Comparison — Full GPU vs. MIG

ScenarioGPU ConfigMonthly CostUtilisation
7B model, low trafficFull A100$2,50018%
7B model, low trafficMIG 3g.40gb$90072%
70B model, high trafficFull A100 x4$10,00085%
70B model, high trafficMIG 7g.80gbN/A (needs full GPU)

Rule of thumb: Models under 13B parameters run beautifully on MIG slices. Anything larger needs full GPU + tensor parallelism.


H2: Autoscaling — From Reactive to Predictive

Kubernetes HPA with CPU metrics is useless for LLMs. The right signals are:

  1. GPU utilisation — Trigger scale-up at 75%, scale-down at 30%
  2. Request queue depth — If vLLM's internal queue > 16, add replicas
  3. KV cache hit rate — Low hit rate = traffic pattern mismatch = consider prefix-aware routing
  4. Token generation latency (P95) — Scale before users complain

H3: Custom Metrics HPA with Prometheus Adapter

yaml
1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: vllm-gpu-hpa
5  namespace: inference
6spec:
7  scaleTargetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: vllm-llama-70b
11  minReplicas: 1
12  maxReplicas: 10
13  metrics:
14  - type: Pods
15    pods:
16      metric:
17        name: vllm_gpu_utilization_percent
18      target:
19        type: AverageValue
20        averageValue: "75"
21  - type: Pods
22    pods:
23      metric:
24        name: vllm_request_queue_length
25      target:
26        type: AverageValue
27        averageValue: "16"
28  behavior:
29    scaleUp:
30      stabilizationWindowSeconds: 60
31      policies:
32      - type: Pods
33        value: 2
34        periodSeconds: 60
35    scaleDown:
36      stabilizationWindowSeconds: 300
37      policies:
38      - type: Pods
39        value: 1
40        periodSeconds: 120

Critical: Scale-down stabilization of 300s prevents flapping. LLM cold starts are expensive. Don't thrash.


H2: Security and Observability — The Production Checklist

H3: Network Policies

Isolate inference pods. They hold model weights and may process sensitive prompts.

yaml
1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: inference-isolate
5  namespace: inference
6spec:
7  podSelector:
8    matchLabels:
9      app: vllm-llama-70b
10  policyTypes:
11  - Ingress
12  ingress:
13  - from:
14    - namespaceSelector:
15        matchLabels:
16          name: gateway
17    ports:
18    - protocol: TCP
19      port: 8000

H3: Monitoring Stack

ComponentToolMetric
GPU metricsDCGM ExporterGPU util, memory, temperature, NVLink bandwidth
Inference metricsvLLM Prometheus endpointTTFT, TPOT, queue depth, batch size
Cluster metricskube-prometheus-stackNode util, pod scheduling latency
Log aggregationLoki + GrafanaError rates, model load failures

Alert I always set: vllm_gpu_memory_usage_percent > 95 for 2m. OOM is coming. Evacuate.


H2: FAQ — Kubernetes LLM Inference in 2026

H3: Why not just use a managed API like OpenAI?

Cost and control. At 10M tokens/day, self-hosted vLLM on spot GPU instances is 60-70% cheaper. Plus you control data residency, model versions, and fine-tuned weights.

H3: Do I need Kubernetes for a single model on one GPU?

No. Docker + systemd is fine for prototyping. Kubernetes pays off at 2+ models or any autoscaling requirement.

H3: What's the difference between KServe and vLLM production-stack?

vLLM production-stack is a batteries-included Helm chart for single-cluster vLLM. KServe is a model-agnostic serving framework with enterprise features (canary, auth, monitoring). Use KServe for heterogeneous models; use production-stack for quick vLLM-only deployments.

H3: Can I run training and inference on the same cluster?

Yes, with Kueue. Use separate ClusterQueues with preemption policies. Mark inference as higher priority. Training jobs yield GPU gracefully.

H3: Is MIG supported on H100s?

Yes, and it's better than A100 MIG. H100 supports up to 7 MIG instances with multi-tenant performance isolation. A100 supports 7 MIG instances but with less isolation granularity.

H3: How do I handle model weight storage?

Use Fluid or a ReadWriteMany PVC backed by high-throughput NFS/parallel filesystem. For multi-zone, sync weights to each zone's object storage bucket and mount via CSI.

H3: What's the latency cost of KServe + llm-d vs. raw vLLM?

~3-5ms added per hop (router + envoy). For user-facing chat, this is invisible. For high-frequency trading or real-time code completion, run raw vLLM with a lightweight load balancer.


H2: Conclusion — Build the Fleet, Not the Boat

If there's one lesson from running AutoBlogging.Pro's inference layer at scale: a single powerful GPU node is a liability. It fails. It bottlenecks. It burns cash when idle.

A Kubernetes-orchestrated fleet of vLLM + KServe + llm-d + Kueue is an asset. It scales elastically. It shares hardware across teams. It routes intelligently. It fails gracefully.

The 2026 stack is no longer experimental. Red Hat ships it. CoreWeave runs it. Google Cloud's GKE has it in one-click. The primitives are stable. The only question is whether your architecture is ready.

Next Steps:

Internal Links:


Written by Essa Mamdani — AI Engineer, Software Architect, and builder of AutoBlogging.Pro. For more infrastructure deep-dives, subscribe to the newsletter.

#technical#tutorial#deep-dive#devops#ai-infrastructure