Kubernetes

Kubernetes Autoscaling Without the Surprises

HPA, VPA, and Cluster Autoscaler operate on three different axes. Confuse them and you get 2 a.m. pages. Get them right and the cluster breathes on its own.

Cloud X Ops TeamDevOps & SRE Consultancy
May 21, 2026
8 min read

Autoscaling is one of those Kubernetes features that looks trivial in a demo and quietly causes 2 a.m. pages in production. The gap is almost never the controller itself, it's the request/limit hygiene, the metrics pipeline, and the assumptions baked into the thresholds.

There are three autoscalers, they operate on different axes, and confusing them is the root of most scaling pain. Let's separate them cleanly, then watch one of them work.

Three autoscalers, three axes

  • Horizontal Pod Autoscaler (HPA) changes the number of pods based on a metric, typically CPU, memory, or a custom/external signal like queue depth.
  • Vertical Pod Autoscaler (VPA) changes the requests and limits of pods, it rightsizes them. Running VPA in Auto mode and HPA on CPU at the same time fights itself; keep VPA in Off/recommendation mode there.
  • Cluster Autoscaler changes the number of nodes when pods can't be scheduled, it's what makes HPA's new pods actually land somewhere.

HPA is only as good as your requests

HPA on CPU computes utilisation as usage ÷ request. If your CPU requests are wrong, your scaling is wrong. Garbage requests in, garbage replicas out.

Watch an HPA breathe

The simulator below runs the real HPA arithmetic. Drag traffic up and the controller computes a desired replica count from observed CPU and your target, clamped between minReplicas and maxReplicas. Toggle Live traffic to watch it react to a load wave on its own.

HPA Simulator
Requests / sec200
Target CPU70%
Replica range2, 12
RPS / pod120
Observed CPU0%
Stable
Pods2 running
Live, desired replicas = ceil(current × observedCPU / targetCPU)

A production-grade HPA

The defaults are not production defaults. The behavior block is the part most teams skip and the part that prevents replica thrashing, scale up fast, scale down slow.

hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef: { kind: Deployment, name: checkout-api }
  minReplicas: 2
  maxReplicas: 12
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 70 } }
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0      # react to spikes immediately
      policies: [ { type: Percent, value: 100, periodSeconds: 30 } ]
    scaleDown:
      stabilizationWindowSeconds: 300    # cool down slowly, avoid flapping
      policies: [ { type: Percent, value: 50, periodSeconds: 60 } ]

The pitfalls that actually page you

  1. No metrics server / wrong metric. HPA needs a working metrics pipeline. CPU is a poor proxy for many web workloads, scale on requests-per-second or queue depth via custom metrics when CPU and load decouple.
  2. Cluster Autoscaler can't keep up. HPA asks for 10 pods, but if nodes take 90 seconds to join, your pods sit Pending through the spike. Over-provision a small buffer of warm capacity for latency-critical paths.
  3. Missing PodDisruptionBudgets. Aggressive scale-down plus a node drain can take you below quorum. A PDB protects minimum availability.
  4. Slow start, fast traffic. If a pod needs 40 seconds to warm up but starts taking traffic at second 2, you autoscale into a wall of cold errors. Get readiness probes honest.

Key takeaways

  • HPA scales pods, VPA rightsizes them, Cluster Autoscaler scales nodes, don't conflate them.
  • HPA-on-CPU is only as accurate as your CPU requests; rightsize first.
  • Use the behavior block: scale up fast, scale down slow, to stop flapping.
  • Pair autoscaling with PDBs, honest readiness probes, and warm node capacity.

Autoscaling that doesn't page you at 2 a.m.?

We tune HPA behavior, rightsize requests, and pair it with warm capacity and disruption budgets so your clusters scale predictably under real load.

Review my cluster
connect.sh
SECURE
cloudxops@client:~$ kubectl get hpa -A
# Auditing autoscaling config...
[OK] behavior blocks tuned (up:fast / down:slow)
[INFO] CPU requests rightsized via VPA recommendations
[READY] PDBs + warm pool in place
$
Connection strength