June 14, 2026

10 min read

Artificial Intelligence

The Complete Guide to Scaling LLM Inference on Kubernetes in 2026

> Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.

ShareX LinkedIn

🎧 Listen — ~10 min

Audio summary not available yet

~10 min

Verified by Essa Mamdani

Meta Description: Running LLMs in production? Learn the definitive 2026 Kubernetes stack for AI inference — vLLM, KServe, llm-d, Kueue, and GPU scheduling with real YAML configs. Cut costs, boost throughput, and stop guessing.

Primary Keyword: LLM inference Kubernetes Related Keywords: vLLM production, KServe LLMInferenceService, Kubernetes GPU scheduling, llm-d distributed inference, Kueue batch scheduling, AI infrastructure 2026, MIG partitioning, DRA Kubernetes, model serving autoscaling, PagedAttention

Tags: technical, tutorial, deep-dive, devops, ai-infrastructure

H1: Why Kubernetes Won the AI Inference War in 2026

Let me paint you a picture. It's 3 AM. Your single-node vLLM instance just OOM-killed on a 70B parameter model because a traffic spike hit while you were sleeping. Your GPU utilisation graph looks like a heart monitor — 95% one minute, 12% the next. You're losing money on idle silicon and losing users on latency spikes.

I lived this at AutoBlogging.Pro when we moved from rented GPU VMs to a self-orchestrated inference layer. The fix wasn't bigger GPUs. It was better orchestration.

By mid-2026, Kubernetes has become the undisputed runtime for production AI inference. Not because it's trendy — because the ecosystem matured. Between the Gateway API Inference Extension (GA since February 2026), NVIDIA's Dynamic Resource Allocation (DRA) going stable, and the CNCF accepting llm-d into the sandbox, the stack is now enterprise-ready.

This guide is the architecture I wish I had six months ago. No marketing fluff. Just YAML, numbers, and production scars.

H2: The 2026 AI Inference Stack — What Actually Works

Before we write a single manifest, let's map the battlefield. The modern inference stack on Kubernetes has four layers:

Layer	Tool	Purpose	Maturity
Inference Engine	vLLM	PagedAttention, continuous batching, OpenAI-compatible API	Production
Serving Framework	KServe	Model lifecycle, autoscaling, canary deploys, multi-model endpoints	Production
Distributed Router	llm-d	Prefix-cache-aware routing, KV-cache optimisation, multi-node load balancing	CNCF Sandbox
Scheduler	Kueue	GPU/batch workload scheduling, fair sharing, queueing	Production
GPU Management	NVIDIA GPU Operator + DRA	Driver management, MIG partitioning, topology-aware allocation	Production

The mental model: vLLM runs the model. KServe orchestrates it. llm-d routes traffic intelligently. Kueue decides who gets the GPU. The GPU Operator keeps the hardware honest.

H2: vLLM on Kubernetes — The Foundation

vLLM isn't just a Python server. In 2026, it's the default inference engine for production LLM serving because of PagedAttention — a memory management technique that reduces GPU memory waste from ~50% to near-zero by treating KV cache like virtual memory pages.

H3: Deployment Manifest

Here's the minimal vLLM deployment I run on GKE A100 nodes:

yaml

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: vllm-llama-70b
5  namespace: inference
6spec:
7  replicas: 1
8  selector:
9    matchLabels:
10      app: vllm-llama-70b
11  template:
12    metadata:
13      labels:
14        app: vllm-llama-70b
15    spec:
16      nodeSelector:
17        cloud.google.com/gke-accelerator: nvidia-a100-80gb
18      containers:
19      - name: vllm
20        image: vllm/vllm-openai:v0.7.3
21        args:
22          - --model
23          - meta-llama/Llama-3.1-70B-Instruct
24          - --tensor-parallel-size
25          - "4"
26          - --gpu-memory-utilization
27          - "0.92"
28          - --max-num-seqs
29          - "256"
30          - --enable-prefix-caching
31        resources:
32          limits:
33            nvidia.com/gpu: "4"
34        ports:
35        - containerPort: 8000
36        env:
37        - name: HF_TOKEN
38          valueFrom:
39            secretKeyRef:
40              name: huggingface-token
41              key: token
42        livenessProbe:
43          httpGet:
44            path: /health
45            port: 8000
46          initialDelaySeconds: 120
47          periodSeconds: 30

H3: Key Configuration Flags Explained

Flag	What It Does	Why It Matters
`--tensor-parallel-size 4`	Splits model layers across 4 GPUs	Required for 70B+ models on A100s
`--gpu-memory-utilization 0.92`	Uses 92% of GPU VRAM	Leaves headroom for CUDA overhead without wasting silicon
`--max-num-seqs 256`	Max concurrent sequences in batching	Higher = better throughput, but watch latency P99
`--enable-prefix-caching`	Reuses KV cache for shared prefixes	Critical for RAG / multi-turn chat workloads

Production tip: Never set --gpu-memory-utilization to 1.0. CUDA allocates scratch space dynamically. You'll hit OOM during peak batch sizes.

H2: KServe — The Model Serving Control Plane

Raw vLLM deployments are fine for one model. Run five models with canary rollouts, A/B testing, and autoscaling? You need KServe.

KServe's 2026 killer feature is LLMInferenceService — a custom resource that wraps vLLM with Kubernetes-native governance.

H3: LLMInferenceService Manifest

yaml

1apiVersion: serving.kserve.io/v1beta1
2kind: InferenceService
3metadata:
4  name: llama-70b-chat
5  namespace: inference
6  annotations:
7    serving.kserve.io/autoscalerClass: kpa.autoscaling.knative.dev
8    serving.kserve.io/targetUtilizationPercentage: "70"
9spec:
10  predictor:
11    model:
12      modelFormat:
13        name: huggingface
14      runtime: kserve-llm-d-runtime
15      storageUri: hf://meta-llama/Llama-3.1-70B-Instruct
16    resources:
17      limits:
18        nvidia.com/gpu: "4"
19      requests:
20        nvidia.com/gpu: "4"
21    minReplicas: 1
22    maxReplicas: 6
23    containerConcurrency: 64

H3: What KServe Handles for You

Scale-to-zero — When no requests hit for 60s, KServe parks the pod. Next request triggers cold start (~8s on vLLM with model pre-loaded to host RAM).
Canary rollouts — Shift 10% traffic to a new model version. Roll back if P50 latency regresses.
Multi-model endpoints — One ingress, multiple models. Route by header or payload field.
Token-aware routing — The Gateway API Inference Extension (v1.3.1) routes based on expected KV cache hit rate.

The win: KServe turns vLLM from a container into a service mesh citizen.

H2: llm-d — Distributed Intelligence Beyond Single-Node

Single-node vLLM tops out at 8 GPUs (H100 NVLink domain). What happens when you need to serve a 405B model or load-balance across 20 nodes? That's where llm-d (Kubernetes-native distributed inference) enters.

llm-d is now a CNCF sandbox project. It provides:

Prefix-cache-aware routing — Sends requests to the node that already has the prompt in KV cache
Disaggregated serving — Separates prefill (compute-heavy) from decode (memory-heavy) across different GPU pools
Predictive latency scheduling — Routes based on queue depth + estimated token generation time

H3: llm-d Architecture

architecture.map

┌─────────────┐      ┌─────────────┐      ┌─────────────┐
│   Client    │──────│  llm-d      │──────│ Prefill     │
│   Request   │      │  Router     │      │ Pool (H100) │
└─────────────┘      └─────────────┘      └─────────────┘
                           │
                           ▼
                     ┌─────────────┐
                     │ Decode      │
                     │ Pool (A100) │
                     └─────────────┘

H3: Deploying llm-d with KServe

yaml

1apiVersion: llm-d.io/v1alpha1
2kind: LLMInferenceService
3metadata:
4  name: disaggregated-llama-405b
5spec:
6  predictor:
7    prefill:
8      runtime: vllm-prefill
9      resources:
10        limits:
11          nvidia.com/gpu: "8"
12      nodeSelector:
13        node-type: h100-prefill
14    decode:
15      runtime: vllm-decode
16      resources:
17        limits:
18          nvidia.com/gpu: "4"
19      nodeSelector:
20        node-type: a100-decode
21      minReplicas: 2
22      maxReplicas: 12
23  router:
24    type: prefix-cache-aware
25    kvCacheAffinity: true

Real talk: llm-d is where the field is heading. If you're serving 100+ concurrent users on a single model, disaggregated architecture is the only way to maintain sub-200ms time-to-first-token (TTFT) without burning cash on over-provisioned H100s.

H2: GPU Scheduling with Kueue — Stop the Free-for-All

Kubernetes' default scheduler treats a GPU like a generic resource. It doesn't understand that a training job needs 8 GPUs on the same node with NVLink, while an inference job can tolerate 2 GPUs anywhere.

H3: Kueue ClusterQueue for Mixed Workloads

yaml

1apiVersion: kueue.x-k8s.io/v1beta1
2kind: ClusterQueue
3metadata:
4  name: gpu-cluster-queue
5spec:
6  namespaceSelector: {}
7  resourceGroups:
8  - coveredResources:
9    - "nvidia.com/gpu"
10    - "cpu"
11    - "memory"
12    flavors:
13    - name: a100-80gb
14      resources:
15      - name: "nvidia.com/gpu"
16        nominalQuota: 32
17      - name: "cpu"
18        nominalQuota: 512
19      - name: "memory"
20        nominalQuota: 2Ti
21    - name: h100-80gb
22      resources:
23      - name: "nvidia.com/gpu"
24        nominalQuota: 16
25      - name: "cpu"
26        nominalQuota: 256
27      - name: "memory"
28        nominalQuota: 1Ti
29  queueingStrategy: BestEffortFIFO
30  preemption:
31    reclaimWithinCohort: Any
32    withinClusterQueue: LowerPriority

H3: LocalQueue for Team Isolation

yaml

1apiVersion: kueue.x-k8s.io/v1beta1
2kind: LocalQueue
3metadata:
4  name: inference-team
5  namespace: inference
6spec:
7  clusterQueue: gpu-cluster-queue

H3: Workload Priority and Preemption

yaml

1apiVersion: kueue.x-k8s.io/v1beta1
2kind: WorkloadPriorityClass
3metadata:
4  name: realtime-inference
5spec:
6  value: 1000
7  description: "Production inference — preempt training if needed"
8---
9apiVersion: kueue.x-k8s.io/v1beta1
10kind: WorkloadPriorityClass
11metadata:
12  name: batch-training
13spec:
14  value: 100
15  description: "Training jobs — yield to inference"

The magic: Kueue lets inference workloads preempt training jobs when latency spikes. Your users get served. Your fine-tuning job resumes later. No manual intervention.

H2: Dynamic Resource Allocation (DRA) and MIG — Squeeze Every Dollar

In 2026, GPU sharing isn't optional. It's economics. NVIDIA's DRA (Dynamic Resource Allocation) went GA in Kubernetes 1.36, and it's a game-changer.

H3: MIG Partitioning for Multi-Tenant Inference

yaml

1apiVersion: resource.nvidia.com/v1beta1
2kind: DeviceClaim
3metadata:
4  name: mig-3g-40gb
5spec:
6  devices:
7    requests:
8    - name: mig-3g-40gb
9      deviceClassName: gpu.nvidia.com
10      selectors:
11      - cel:
12          expression: device.attributes["gpu.nvidia.com"].productName == "NVIDIA-A100-80GB"
13    results:
14    - name: mig-3g-40gb
15      deviceClaimName: mig-3g-40gb
16---
17apiVersion: apps/v1
18kind: Deployment
19metadata:
20  name: small-model-inference
21spec:
22  template:
23    spec:
24      containers:
25      - name: vllm
26        image: vllm/vllm-openai:v0.7.3
27        resources:
28          claims:
29          - name: mig-3g-40gb

H3: Cost Comparison — Full GPU vs. MIG

Scenario	GPU Config	Monthly Cost	Utilisation
7B model, low traffic	Full A100	$2,500	18%
7B model, low traffic	MIG 3g.40gb	$900	72%
70B model, high traffic	Full A100 x4	$10,000	85%
70B model, high traffic	MIG 7g.80gb	N/A (needs full GPU)	—

Rule of thumb: Models under 13B parameters run beautifully on MIG slices. Anything larger needs full GPU + tensor parallelism.

H2: Autoscaling — From Reactive to Predictive

Kubernetes HPA with CPU metrics is useless for LLMs. The right signals are:

GPU utilisation — Trigger scale-up at 75%, scale-down at 30%
Request queue depth — If vLLM's internal queue > 16, add replicas
KV cache hit rate — Low hit rate = traffic pattern mismatch = consider prefix-aware routing
Token generation latency (P95) — Scale before users complain

H3: Custom Metrics HPA with Prometheus Adapter

yaml

1apiVersion: autoscaling/v2
2kind: HorizontalPodAutoscaler
3metadata:
4  name: vllm-gpu-hpa
5  namespace: inference
6spec:
7  scaleTargetRef:
8    apiVersion: apps/v1
9    kind: Deployment
10    name: vllm-llama-70b
11  minReplicas: 1
12  maxReplicas: 10
13  metrics:
14  - type: Pods
15    pods:
16      metric:
17        name: vllm_gpu_utilization_percent
18      target:
19        type: AverageValue
20        averageValue: "75"
21  - type: Pods
22    pods:
23      metric:
24        name: vllm_request_queue_length
25      target:
26        type: AverageValue
27        averageValue: "16"
28  behavior:
29    scaleUp:
30      stabilizationWindowSeconds: 60
31      policies:
32      - type: Pods
33        value: 2
34        periodSeconds: 60
35    scaleDown:
36      stabilizationWindowSeconds: 300
37      policies:
38      - type: Pods
39        value: 1
40        periodSeconds: 120

Critical: Scale-down stabilization of 300s prevents flapping. LLM cold starts are expensive. Don't thrash.

H2: Security and Observability — The Production Checklist

H3: Network Policies

Isolate inference pods. They hold model weights and may process sensitive prompts.

yaml

1apiVersion: networking.k8s.io/v1
2kind: NetworkPolicy
3metadata:
4  name: inference-isolate
5  namespace: inference
6spec:
7  podSelector:
8    matchLabels:
9      app: vllm-llama-70b
10  policyTypes:
11  - Ingress
12  ingress:
13  - from:
14    - namespaceSelector:
15        matchLabels:
16          name: gateway
17    ports:
18    - protocol: TCP
19      port: 8000

H3: Monitoring Stack

Component	Tool	Metric
GPU metrics	DCGM Exporter	GPU util, memory, temperature, NVLink bandwidth
Inference metrics	vLLM Prometheus endpoint	TTFT, TPOT, queue depth, batch size
Cluster metrics	kube-prometheus-stack	Node util, pod scheduling latency
Log aggregation	Loki + Grafana	Error rates, model load failures

Alert I always set: vllm_gpu_memory_usage_percent > 95 for 2m. OOM is coming. Evacuate.

H2: FAQ — Kubernetes LLM Inference in 2026

H3: Why not just use a managed API like OpenAI?

Cost and control. At 10M tokens/day, self-hosted vLLM on spot GPU instances is 60-70% cheaper. Plus you control data residency, model versions, and fine-tuned weights.

H3: Do I need Kubernetes for a single model on one GPU?

No. Docker + systemd is fine for prototyping. Kubernetes pays off at 2+ models or any autoscaling requirement.

H3: What's the difference between KServe and vLLM production-stack?

vLLM production-stack is a batteries-included Helm chart for single-cluster vLLM. KServe is a model-agnostic serving framework with enterprise features (canary, auth, monitoring). Use KServe for heterogeneous models; use production-stack for quick vLLM-only deployments.

H3: Can I run training and inference on the same cluster?

Yes, with Kueue. Use separate ClusterQueues with preemption policies. Mark inference as higher priority. Training jobs yield GPU gracefully.

H3: Is MIG supported on H100s?

Yes, and it's better than A100 MIG. H100 supports up to 7 MIG instances with multi-tenant performance isolation. A100 supports 7 MIG instances but with less isolation granularity.

H3: How do I handle model weight storage?

Use Fluid or a ReadWriteMany PVC backed by high-throughput NFS/parallel filesystem. For multi-zone, sync weights to each zone's object storage bucket and mount via CSI.

H3: What's the latency cost of KServe + llm-d vs. raw vLLM?

~3-5ms added per hop (router + envoy). For user-facing chat, this is invisible. For high-frequency trading or real-time code completion, run raw vLLM with a lightweight load balancer.

H2: Conclusion — Build the Fleet, Not the Boat

If there's one lesson from running AutoBlogging.Pro's inference layer at scale: a single powerful GPU node is a liability. It fails. It bottlenecks. It burns cash when idle.

A Kubernetes-orchestrated fleet of vLLM + KServe + llm-d + Kueue is an asset. It scales elastically. It shares hardware across teams. It routes intelligently. It fails gracefully.

The 2026 stack is no longer experimental. Red Hat ships it. CoreWeave runs it. Google Cloud's GKE has it in one-click. The primitives are stable. The only question is whether your architecture is ready.

Next Steps:

Internal Links:

See how we orchestrate AI agents at AutoBlogging.Pro
Read our guide on MCP for AI-native applications

Written by Essa Mamdani — AI Engineer, Software Architect, and builder of AutoBlogging.Pro. For more infrastructure deep-dives, subscribe to the newsletter.

Keep reading

AI Dev Containers for Reproducible Rust DebuggingBuild a reproducible Rust debugging stack with Dev Containers, Cargo, GitHub Actions, artifacts, and a read-only AI review loop for on-call backend work.DeepSeek Retires Aliases as V4 LandsDeepSeek retired deepseek-chat and deepseek-reasoner on July 24, replacing them with V4-Flash and V4-Pro. Here’s what API teams must change now.vLLM PagedAttention and Continuous BatchingLearn how vLLM's PagedAttention, continuous batching, prefix caching, and speculative decoding raise throughput without wasting KV cache memory in production.

#technical#tutorial#deep-dive#devops#ai-infrastructure

ShareX LinkedIn

⚡ Daily AI Model Drop — Get Kimi K3 benchmarks before Twitter

Join 2,400+ AI engineers. 1 email/day, no spam, unsubscribe anytime

H1: Why Kubernetes Won the AI Inference War in 2026

H2: The 2026 AI Inference Stack — What Actually Works

H2: vLLM on Kubernetes — The Foundation

H3: Deployment Manifest

H3: Key Configuration Flags Explained

H2: KServe — The Model Serving Control Plane

H3: LLMInferenceService Manifest

H3: What KServe Handles for You

H2: llm-d — Distributed Intelligence Beyond Single-Node

H3: llm-d Architecture

H3: Deploying llm-d with KServe

H2: GPU Scheduling with Kueue — Stop the Free-for-All

H3: Kueue ClusterQueue for Mixed Workloads

H3: LocalQueue for Team Isolation

H3: Workload Priority and Preemption

H2: Dynamic Resource Allocation (DRA) and MIG — Squeeze Every Dollar

H3: MIG Partitioning for Multi-Tenant Inference

H3: Cost Comparison — Full GPU vs. MIG

H2: Autoscaling — From Reactive to Predictive

H3: Custom Metrics HPA with Prometheus Adapter

H2: Security and Observability — The Production Checklist

H3: Network Policies

H3: Monitoring Stack

H2: FAQ — Kubernetes LLM Inference in 2026

H3: Why not just use a managed API like OpenAI?

H3: Do I need Kubernetes for a single model on one GPU?

H3: What's the difference between KServe and vLLM production-stack?

H3: Can I run training and inference on the same cluster?

H3: Is MIG supported on H100s?

H3: How do I handle model weight storage?

H3: What's the latency cost of KServe + llm-d vs. raw vLLM?

H2: Conclusion — Build the Fleet, Not the Boat

Related reading

⚡ Daily AI Model Drop — Get Kimi K3 benchmarks before Twitter

Comments