Kubernetes Autoscaling Without the Surprises
HPA, VPA, and Cluster Autoscaler operate on three different axes. Confuse them and you get 2 a.m. pages. Get them right and the cluster breathes on its own.
Autoscaling is one of those Kubernetes features that looks trivial in a demo and quietly causes 2 a.m. pages in production. The gap is almost never the controller itself, it's the request/limit hygiene, the metrics pipeline, and the assumptions baked into the thresholds.
There are three autoscalers, they operate on different axes, and confusing them is the root of most scaling pain. Let's separate them cleanly, then watch one of them work.
Three autoscalers, three axes
- Horizontal Pod Autoscaler (HPA) changes the number of pods based on a metric, typically CPU, memory, or a custom/external signal like queue depth.
- Vertical Pod Autoscaler (VPA) changes the requests and limits
of pods, it rightsizes them. Running VPA in
Automode and HPA on CPU at the same time fights itself; keep VPA inOff/recommendation mode there. - Cluster Autoscaler changes the number of nodes when pods can't be scheduled, it's what makes HPA's new pods actually land somewhere.
HPA is only as good as your requests
HPA on CPU computes utilisation as usage ÷ request. If your CPU requests are wrong, your scaling is wrong. Garbage requests in, garbage replicas out.
Watch an HPA breathe
The simulator below runs the real HPA arithmetic. Drag traffic up and the
controller computes a desired replica count from observed CPU and your target, clamped
between minReplicas and maxReplicas. Toggle Live traffic
to watch it react to a load wave on its own.
A production-grade HPA
The defaults are not production defaults. The behavior block is
the part most teams skip and the part that prevents replica thrashing, scale up fast,
scale down slow.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef: { kind: Deployment, name: checkout-api }
minReplicas: 2
maxReplicas: 12
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # react to spikes immediately
policies: [ { type: Percent, value: 100, periodSeconds: 30 } ]
scaleDown:
stabilizationWindowSeconds: 300 # cool down slowly, avoid flapping
policies: [ { type: Percent, value: 50, periodSeconds: 60 } ]
The pitfalls that actually page you
- No metrics server / wrong metric. HPA needs a working metrics pipeline. CPU is a poor proxy for many web workloads, scale on requests-per-second or queue depth via custom metrics when CPU and load decouple.
- Cluster Autoscaler can't keep up. HPA asks for 10 pods, but if nodes
take 90 seconds to join, your pods sit
Pendingthrough the spike. Over-provision a small buffer of warm capacity for latency-critical paths. - Missing PodDisruptionBudgets. Aggressive scale-down plus a node drain can take you below quorum. A PDB protects minimum availability.
- Slow start, fast traffic. If a pod needs 40 seconds to warm up but starts taking traffic at second 2, you autoscale into a wall of cold errors. Get readiness probes honest.
Key takeaways
- HPA scales pods, VPA rightsizes them, Cluster Autoscaler scales nodes, don't conflate them.
- HPA-on-CPU is only as accurate as your CPU requests; rightsize first.
- Use the
behaviorblock: scale up fast, scale down slow, to stop flapping. - Pair autoscaling with PDBs, honest readiness probes, and warm node capacity.
Autoscaling that doesn't page you at 2 a.m.?
We tune HPA behavior, rightsize requests, and pair it with warm capacity and disruption budgets so your clusters scale predictably under real load.
Review my cluster