Kubernetes VPA vs HPA vs KEDA: Which Autoscaler Actually Cuts Your Bill

The average Kubernetes cluster runs at 13% CPU utilization and 20% memory utilization. That means 87% of provisioned compute sits idle. Three autoscalers exist to close that gap — VPA, HPA, and KEDA — and each attacks the problem from a different angle. VPA shrinks oversized pods. HPA adds and removes pod replicas. KEDA extends HPA with event-driven triggers and the ability to scale to zero.

Choosing the wrong autoscaler does not just leave money on the table. It creates production incidents. VPA evicts pods to resize them, which restarts JVM applications cold. HPA thrashes replica counts without stabilization windows, causing 5xx errors during scale-down. KEDA’s scale-to-zero adds cold start latency that breaks SLA commitments. The cost savings are real, but only if the autoscaler matches the workload.

Your Cluster Is Running at 13% CPU. The Autoscaler Choice Determines What Happens Next

The CNCF 2024 Kubernetes Benchmark Report analyzed 4,000 clusters. Average CPU utilization: 13%. Average memory utilization: 20%. Memory overprovisioning by cloud provider: Azure at 65%, AWS at 58%, GCP at 53%. Large clusters with 30,000+ CPUs reach 44% utilization — better, but still more than half idle.

Three autoscaler paths from an overprovisioned cluster

The waste is not hypothetical. An estimated 35% of Kubernetes spending goes to overprovisioned resources. A ZeonEdge case study documented a reduction from $50,000 to $22,000 per month — 56% — using VPA right-sizing as a core component. The autoscaler choice is the primary lever for closing the gap between what you provision and what you use.

How Each Autoscaler Works

VPA (Vertical Pod Autoscaler) adjusts CPU and memory requests on individual pods. The recommender component maintains a decaying histogram of actual usage over an 8-day window with a 24-hour half-life. When current requests diverge from recommendations, VPA evicts the pod and recreates it with updated resource values via the admission controller. VPA does not add pods. It makes existing pods the right size.

HPA (Horizontal Pod Autoscaler) scales the number of pod replicas. It polls metrics every 15 seconds, compares observed CPU or memory utilization against a target threshold, and adjusts the replica count. When utilization exceeds 70% (a common target), HPA adds replicas. When it drops, HPA removes them. The scaling loop takes 2-4 minutes end-to-end from metric observation to new pod serving traffic.

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 70+ external event sources. Instead of scaling on CPU utilization, KEDA scales on queue depth (SQS, Kafka, RabbitMQ), custom Prometheus queries, cron schedules, or HTTP request rates. KEDA creates and manages an HPA resource internally. Its defining capability: scaling to zero replicas when no events are present, and back to one or more when events arrive.

VPA, HPA, and KEDA internal flows side by side

The Numbers: Cost Savings, Latency, and Reaction Speed

Dimension	VPA	HPA	KEDA
What it scales	Pod CPU/memory requests	Pod replica count	Replica count (including to zero)
Cost savings range	30-40% compute reduction	30-50% vs static provisioning	50-70% for idle workloads
Reaction speed	24-48 hours warmup for recommendations	2-4 minutes end-to-end	2-5 seconds cold start (simple apps)
Scale to zero	No	No (minimum 1 replica)	Yes
Latency impact	Pod eviction + restart	P95 latency: 398ms to 750ms during scaling	15-60+ seconds cold start for heavy apps
Data source	8-day usage histogram	CPU/memory metrics (15s poll)	70+ event sources
Best workload type	Memory-bound, stable patterns	CPU-bound web/API servers	Event-driven, batch, intermittent

The numbers expose a fundamental tradeoff. VPA delivers the deepest per-pod savings but cannot react to real-time spikes — it is backward-looking by design. HPA responds within minutes but adds latency during scaling events. KEDA offers the most aggressive cost savings through scale-to-zero but introduces cold start latency that can break SLA commitments for latency-sensitive services.

A concrete example: a team running a queue processor that handles 10,000 messages during business hours and zero messages overnight. With HPA, the minimum replica stays running 24/7 — roughly 128 idle hours per week. With KEDA scaling to zero overnight, those 128 hours cost nothing. At $0.05 per pod-hour, that is $6.40 per week per service. Across 50 microservices, that is $16,640 per year from one configuration change.

The Compatibility Matrix: What You Can and Cannot Combine

Combination	Works?	Why
VPA + HPA on same metric (CPU)	No	Death spiral: VPA lowers requests, HPA sees higher utilization %, scales up, VPA lowers again
VPA + HPA on different metrics	Yes	HPA scales horizontally on CPU, VPA right-sizes memory with controlledResources: [“memory”]
KEDA + manual HPA	No	KEDA creates its own HPA internally — two HPAs on the same deployment causes erratic scaling
KEDA + VPA	Yes	KEDA handles replica count, VPA handles resource requests — different dimensions, no conflict
VPA + HPA + KEDA all three	Partial	KEDA replaces HPA; combine KEDA + VPA on different resources only

The death spiral between VPA and HPA on the same metric is the most common production mistake. VPA reduces CPU requests based on historical usage. This lowers the denominator in HPA’s utilization calculation. HPA now sees artificially high utilization and scales up replicas. More replicas distribute load, reducing per-pod usage. VPA sees lower per-pod usage and reduces requests further. The cycle repeats until pods are tiny and replica count is enormous.

The safe combination: HPA (or KEDA) for horizontal scaling decisions, VPA for memory right-sizing only. Set VPA’s controlledResources to ["memory"] so it ignores CPU entirely. This lets HPA own the CPU scaling decision without interference.

Production Gotchas That Will Bite You

Autoscaler	Gotcha	Symptom	Fix
VPA	Eviction cascades	JVM loses JIT-compiled code, Redis drops cache, WebSocket sessions disconnect	Use VPA in “Off” mode for recommendations only; apply during maintenance windows for stateful workloads
VPA	Recommendations exceed node capacity	Pods become permanently unschedulable	Set maxAllowed in VPA policy to stay within node instance type limits
VPA	JVM heap mismatch	Container gets more memory but JVM heap flags unchanged — JVM does not use it	Update JVM -Xmx flags in deployment spec when VPA increases memory
HPA	Thrashing without stabilization	3-second traffic spike scales from 3 to 10 replicas and back, causing 5xx errors	Set behavior.scaleDown.stabilizationWindowSeconds: 300 and scaleUp.stabilizationWindowSeconds: 60
HPA	Slow response to traffic spikes	2-4 minutes from spike to new pod serving	Pre-scale before known traffic events; use KEDA cron scaler for predictable patterns
KEDA	Cold start on scale from zero	First requests get 503 until pod passes readiness probe	Set minReplicaCount: 1 for latency-sensitive services; accept scale-to-zero only for async workloads
KEDA	Cooldown misconfiguration	Pods scale to zero during brief pauses between events	Increase cooldownPeriod beyond the maximum gap between events (default: 300 seconds)

The VPA eviction problem deserves emphasis. When VPA decides a pod needs different resources, it evicts the pod. The new pod starts fresh. For a JVM application, this means losing all JIT-compiled bytecode — the application runs interpreted for minutes until the JIT warms up again. For a Redis instance, the in-memory cache is gone, causing a thundering herd of cache misses. For a WebSocket server, every connected client disconnects.

The fix: use VPA in “Off” mode. It still generates recommendations. Apply them manually during maintenance windows or CI deploys rather than letting VPA evict pods autonomously. Kubernetes 1.35+ supports in-place pod resizing, which eliminates the eviction requirement — but adoption is still early.

The Decision Framework: Match Your Workload to Your Autoscaler

Decision tree: workload type to recommended autoscaler

CPU-bound workloads (web servers, API gateways, compute services): HPA with a CPU target of 60-70%. Add VPA in memory-only mode for right-sizing.

Memory-bound workloads (JVM applications, caches, data processing): VPA in “Off” or “Auto” mode. Avoid VPA for stateful workloads where eviction causes data loss.

Event-driven workloads (queue processors, batch jobs, webhooks, scheduled tasks): KEDA with the appropriate scaler. Scale-to-zero for intermittent workloads. KEDA cron scaler for predictable traffic patterns.

Mixed workloads: KEDA replaces HPA for the horizontal scaling decision. Add VPA with controlledResources: ["memory"] for vertical right-sizing. This combination covers both dimensions without conflict.

The 87% of idle CPU in your cluster is not a monitoring problem. It is an autoscaling problem. VPA closes the gap between what pods request and what they use. HPA matches replica count to actual traffic. KEDA eliminates cost for workloads that spend most of their time idle. Pick the autoscaler that matches your workload pattern, combine them safely where needed, and watch for the gotchas that turn cost savings into production incidents. The numbers are clear — 30-70% savings are achievable. The engineering is in choosing the right tool and configuring it correctly.