The average Kubernetes cluster runs at 13% CPU utilization and 20% memory utilization. That means 87% of provisioned compute sits idle. Three autoscalers exist to close that gap — VPA, HPA, and KEDA — and each attacks the problem from a different angle. VPA shrinks oversized pods. HPA adds and removes pod replicas. KEDA extends HPA with event-driven triggers and the ability to scale to zero.
Choosing the wrong autoscaler does not just leave money on the table. It creates production incidents. VPA evicts pods to resize them, which restarts JVM applications cold. HPA thrashes replica counts without stabilization windows, causing 5xx errors during scale-down. KEDA’s scale-to-zero adds cold start latency that breaks SLA commitments. The cost savings are real, but only if the autoscaler matches the workload.
Your Cluster Is Running at 13% CPU. The Autoscaler Choice Determines What Happens Next
The CNCF 2024 Kubernetes Benchmark Report analyzed 4,000 clusters. Average CPU utilization: 13%. Average memory utilization: 20%. Memory overprovisioning by cloud provider: Azure at 65%, AWS at 58%, GCP at 53%. Large clusters with 30,000+ CPUs reach 44% utilization — better, but still more than half idle.

The waste is not hypothetical. An estimated 35% of Kubernetes spending goes to overprovisioned resources. A ZeonEdge case study documented a reduction from $50,000 to $22,000 per month — 56% — using VPA right-sizing as a core component. The autoscaler choice is the primary lever for closing the gap between what you provision and what you use.
How Each Autoscaler Works
VPA (Vertical Pod Autoscaler) adjusts CPU and memory requests on individual pods. The recommender component maintains a decaying histogram of actual usage over an 8-day window with a 24-hour half-life. When current requests diverge from recommendations, VPA evicts the pod and recreates it with updated resource values via the admission controller. VPA does not add pods. It makes existing pods the right size.
HPA (Horizontal Pod Autoscaler) scales the number of pod replicas. It polls metrics every 15 seconds, compares observed CPU or memory utilization against a target threshold, and adjusts the replica count. When utilization exceeds 70% (a common target), HPA adds replicas. When it drops, HPA removes them. The scaling loop takes 2-4 minutes end-to-end from metric observation to new pod serving traffic.
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with 70+ external event sources. Instead of scaling on CPU utilization, KEDA scales on queue depth (SQS, Kafka, RabbitMQ), custom Prometheus queries, cron schedules, or HTTP request rates. KEDA creates and manages an HPA resource internally. Its defining capability: scaling to zero replicas when no events are present, and back to one or more when events arrive.

The Numbers: Cost Savings, Latency, and Reaction Speed
| Dimension | VPA | HPA | KEDA |
|---|---|---|---|
| What it scales | Pod CPU/memory requests | Pod replica count | Replica count (including to zero) |
| Cost savings range | 30-40% compute reduction | 30-50% vs static provisioning | 50-70% for idle workloads |
| Reaction speed | 24-48 hours warmup for recommendations | 2-4 minutes end-to-end | 2-5 seconds cold start (simple apps) |
| Scale to zero | No | No (minimum 1 replica) | Yes |
| Latency impact | Pod eviction + restart | P95 latency: 398ms to 750ms during scaling | 15-60+ seconds cold start for heavy apps |
| Data source | 8-day usage histogram | CPU/memory metrics (15s poll) | 70+ event sources |
| Best workload type | Memory-bound, stable patterns | CPU-bound web/API servers | Event-driven, batch, intermittent |
The numbers expose a fundamental tradeoff. VPA delivers the deepest per-pod savings but cannot react to real-time spikes — it is backward-looking by design. HPA responds within minutes but adds latency during scaling events. KEDA offers the most aggressive cost savings through scale-to-zero but introduces cold start latency that can break SLA commitments for latency-sensitive services.
A concrete example: a team running a queue processor that handles 10,000 messages during business hours and zero messages overnight. With HPA, the minimum replica stays running 24/7 — roughly 128 idle hours per week. With KEDA scaling to zero overnight, those 128 hours cost nothing. At $0.05 per pod-hour, that is $6.40 per week per service. Across 50 microservices, that is $16,640 per year from one configuration change.
The Compatibility Matrix: What You Can and Cannot Combine
| Combination | Works? | Why |
|---|---|---|
| VPA + HPA on same metric (CPU) | No | Death spiral: VPA lowers requests, HPA sees higher utilization %, scales up, VPA lowers again |
| VPA + HPA on different metrics | Yes | HPA scales horizontally on CPU, VPA right-sizes memory with controlledResources: [“memory”] |
| KEDA + manual HPA | No | KEDA creates its own HPA internally — two HPAs on the same deployment causes erratic scaling |
| KEDA + VPA | Yes | KEDA handles replica count, VPA handles resource requests — different dimensions, no conflict |
| VPA + HPA + KEDA all three | Partial | KEDA replaces HPA; combine KEDA + VPA on different resources only |
The death spiral between VPA and HPA on the same metric is the most common production mistake. VPA reduces CPU requests based on historical usage. This lowers the denominator in HPA’s utilization calculation. HPA now sees artificially high utilization and scales up replicas. More replicas distribute load, reducing per-pod usage. VPA sees lower per-pod usage and reduces requests further. The cycle repeats until pods are tiny and replica count is enormous.
The safe combination: HPA (or KEDA) for horizontal scaling decisions, VPA for memory right-sizing only. Set VPA’s controlledResources to ["memory"] so it ignores CPU entirely. This lets HPA own the CPU scaling decision without interference.
Production Gotchas That Will Bite You
| Autoscaler | Gotcha | Symptom | Fix |
|---|---|---|---|
| VPA | Eviction cascades | JVM loses JIT-compiled code, Redis drops cache, WebSocket sessions disconnect | Use VPA in “Off” mode for recommendations only; apply during maintenance windows for stateful workloads |
| VPA | Recommendations exceed node capacity | Pods become permanently unschedulable | Set maxAllowed in VPA policy to stay within node instance type limits |
| VPA | JVM heap mismatch | Container gets more memory but JVM heap flags unchanged — JVM does not use it | Update JVM -Xmx flags in deployment spec when VPA increases memory |
| HPA | Thrashing without stabilization | 3-second traffic spike scales from 3 to 10 replicas and back, causing 5xx errors | Set behavior.scaleDown.stabilizationWindowSeconds: 300 and scaleUp.stabilizationWindowSeconds: 60 |
| HPA | Slow response to traffic spikes | 2-4 minutes from spike to new pod serving | Pre-scale before known traffic events; use KEDA cron scaler for predictable patterns |
| KEDA | Cold start on scale from zero | First requests get 503 until pod passes readiness probe | Set minReplicaCount: 1 for latency-sensitive services; accept scale-to-zero only for async workloads |
| KEDA | Cooldown misconfiguration | Pods scale to zero during brief pauses between events | Increase cooldownPeriod beyond the maximum gap between events (default: 300 seconds) |
The VPA eviction problem deserves emphasis. When VPA decides a pod needs different resources, it evicts the pod. The new pod starts fresh. For a JVM application, this means losing all JIT-compiled bytecode — the application runs interpreted for minutes until the JIT warms up again. For a Redis instance, the in-memory cache is gone, causing a thundering herd of cache misses. For a WebSocket server, every connected client disconnects.
The fix: use VPA in “Off” mode. It still generates recommendations. Apply them manually during maintenance windows or CI deploys rather than letting VPA evict pods autonomously. Kubernetes 1.35+ supports in-place pod resizing, which eliminates the eviction requirement — but adoption is still early.
The Decision Framework: Match Your Workload to Your Autoscaler

CPU-bound workloads (web servers, API gateways, compute services): HPA with a CPU target of 60-70%. Add VPA in memory-only mode for right-sizing.
Memory-bound workloads (JVM applications, caches, data processing): VPA in “Off” or “Auto” mode. Avoid VPA for stateful workloads where eviction causes data loss.
Event-driven workloads (queue processors, batch jobs, webhooks, scheduled tasks): KEDA with the appropriate scaler. Scale-to-zero for intermittent workloads. KEDA cron scaler for predictable traffic patterns.
Mixed workloads: KEDA replaces HPA for the horizontal scaling decision. Add VPA with controlledResources: ["memory"] for vertical right-sizing. This combination covers both dimensions without conflict.
The 87% of idle CPU in your cluster is not a monitoring problem. It is an autoscaling problem. VPA closes the gap between what pods request and what they use. HPA matches replica count to actual traffic. KEDA eliminates cost for workloads that spend most of their time idle. Pick the autoscaler that matches your workload pattern, combine them safely where needed, and watch for the gotchas that turn cost savings into production incidents. The numbers are clear — 30-70% savings are achievable. The engineering is in choosing the right tool and configuring it correctly.