The average mid-size production EC2 fleet runs at 12 to 23 percent utilization. The remaining 77 to 88 percent is idle compute that ran continuously, billed continuously, and produced nothing. On a 200-instance fleet of m5.xlarge equivalents, that idle slice is worth $150,000 to $250,000 a year. The numbers come from AWS Trusted Advisor and the Flexera 2025 State of the Cloud report and they have not moved much in five years.
Right-sizing should fix this. Most teams treat it as judgement-driven: a senior engineer looks at peak CPU on a dashboard and picks an instance. Peak is the wrong signal. A workload with one weekly traffic spike to 80% CPU and a baseline of 12% gets sized to 80% by the senior engineer, over-provisioned by 5x for the 167 hours a week it is not spiking. Multiply across 200 instances and that is where the $200k lives.
The fix is to stop using vibes. Five CloudWatch numbers per instance are enough to derive the right instance type from a lookup table. Same workload shape always picks the same instance. The procedure is deterministic, repeatable, and does not need a senior engineer’s approval. This post is the recipe, the lookup, and the 7-day sentinel that keeps right-sizing from reversing the first time someone says “this feels slow.”
The pattern composes with VPA, HPA, and KEDA scaling. Right-sizing sets the per-instance baseline; the autoscalers handle the variation around it.
12% utilization, 100% billing
The reverse-cost property is what makes right-sizing high-impact. The 200 instances that cost the most are not the ones with high utilization. They are the ones with low utilization. High-utilization instances are running flat-out and are by definition appropriately sized. Low-utilization instances are over-provisioned and continuously paying for headroom they do not use.
| Average CPU utilization | Share of fleet | Share of spend that is recoverable |
|---|---|---|
| Above 60% (right-sized) | 5-15% | <5% |
| 30-60% (mildly over) | 25-35% | 15-25% |
| 10-30% (significantly over) | 35-45% | 50-65% |
| Below 10% (extreme over) | 15-25% | 25-35% |
The 15-25% of the fleet running below 10% CPU is where the extreme savings live. These are the instances that got provisioned for a launch that never came, a workload that moved to a different platform, or a “what if traffic 10x” scenario that did not happen. Right-sizing them first concentrates the saving without touching the rest.
The five numbers
Five CloudWatch metrics per instance, computed over a 30-day window:
| Metric | What it tells you | CloudWatch query |
|---|---|---|
| p50 CPU | Typical-day load; baseline for sizing | Statistics: p50, Period: 60, MetricName: CPUUtilization |
| p99 CPU | Worst case the workload sees | Statistics: p99, Period: 60 |
| p50 memory | Baseline memory footprint (CloudWatch agent required) | MetricName: mem_used_percent, Statistics: p50 |
| p99 memory | Worst memory pressure | MetricName: mem_used_percent, Statistics: p99 |
| Burst duration | How long does p99 actually last | Custom: count consecutive 1-min samples above 85% CPU |
CloudWatch native metric resolution is 60 seconds with the standard agent. Over 30 days that is 43,200 data points per metric, plenty for stable percentile calculation. The memory metric requires the CloudWatch unified agent installed on the instance; without it, memory is invisible to AWS and you have to fall back to instance-internal collection.
The burst duration is the one most teams skip and it is the one that makes the recipe work. Two workloads can have identical p50 and p99 CPU but completely different shapes. One bursts to p99 for 30 seconds at a time twice a day; the other sustains p99 for 2 hours during business peak. The right instance type is different for each.
Burst duration is the missing axis
The T-family versus M-family question hinges on burst duration. T-family instances accumulate CPU credits during low utilization and spend them during bursts. M-family instances do not; they have full CPU available continuously, billed at a higher rate per hour.
Credit math for the popular T3.medium:
A workload bursting to p99 CPU for 2 minutes once a day uses 60 credits per burst. Idle time accrues credits at 24/hour, so the workload accrues 24 × 22 = 528 credits during the rest of the day. Net credit balance grows. T3.medium is the right instance.
A workload bursting to p99 CPU for 30 minutes during business peak uses 900 credits per burst. The credit cap is 576. The workload exhausts credits in 38 minutes and stalls back to the 20% baseline for the rest of the burst, causing user-facing latency degradation and pages. T3.medium is the wrong instance. M5.large at 2x the hourly rate but with no credit math is correct.
| Burst duration (p99 sustained) | Family | Why |
|---|---|---|
| <5 minutes, infrequent | T3 / T4g | Credits comfortably cover bursts |
| 5-30 minutes, daily | T3 / T4g with monitoring | Credit balance under pressure; alarm on credit drain |
| 30+ minutes, sustained | M5 / M6i / M7i | Credit math is hostile; M-family avoids it |
| Steady high CPU | C5 / C6i (compute-optimized) | High CPU at lower per-vCPU price |
| Memory-bound (p99 mem >70%) | R5 / R6i (memory-optimized) | Memory ratio matters more than CPU |
The lookup table
Given the five numbers, instance selection is deterministic.
| p99 CPU | Burst duration | p99 memory | Recommended class |
|---|---|---|---|
| <40% | any | <50% | T3.small or smaller |
| 40-70% | <30 min | 50-70% | T3.medium / T3.large |
| 40-70% | 30+ min | <70% | M5.large / M5.xlarge |
| 70-90% | any | <70% | M5 sized to p99 / 0.85 |
| 70-90% | any | 70-85% | R5 sized to memory |
| >90% sustained | any | any | C5 or specialized; investigate workload |
The “size to p99 / 0.85” rule keeps a 15% headroom above worst-case observed CPU. This is enough to absorb noise, leave room for periodic batch jobs, and survive a 1.5x traffic spike before HPA kicks in. Teams that size to p99 exactly run into pages on minor anomalies; teams that size to p99 / 0.5 are back to over-provisioning.
The same lookup applies to Kubernetes node pools. The five numbers are computed at the node level instead of per-pod, the burst duration is the cluster-wide burst (since pods are bin-packed), and the recommended instance is the node-pool default.
The 7-day sentinel
Right-sizing without rollback is a one-way street. The first time a senior engineer says “this feels slow,” the change reverses, and the team learns to never right-size again. The sentinel is what keeps the data winning.
The procedure:
- Deploy the smaller instance to a single canary instance in the affected ASG
- Watch p99 application latency on that instance for 7 days
- If latency degrades by more than 5% versus the control group, auto-revert to the original instance
- If latency stays within 5%, roll out to the rest of the ASG
- After full rollout, watch p99 latency for another 7 days; auto-revert if degraded
The 5% threshold is the right number because it is wider than measurement noise (typically 1-2%) but narrower than user-perceptible degradation (typically 10%+). The 7-day window catches weekly traffic patterns, including the Sunday batch jobs that often cause the surprise burst that broke previous right-sizing attempts.
The sentinel runs as an autonomous closed-loop. Detect: latency-degradation alarm. Decide: roll back the right-size. Act: terminate the smaller instance and let the ASG re-launch on the original type. Verify: latency returns to baseline. The auto-revert is what makes right-sizing safe to ship at fleet scale.
Where the $200k actually lives
Run the recipe across the fleet, sort by potential saving descending, and 80% of the dollars come from 20% of the instances. Those instances are easy to identify: they have p99 CPU below 30%, p99 memory below 50%, and current instance class at xlarge or larger.
The CloudWatch Insights query is short. Group by InstanceId, compute avg(CPUUtilization) and max(CPUUtilization) and avg(mem_used_percent), filter where the average CPU is below 30 percent, sort ascending, limit 50. The top 50 instances from this query are the right-sizing candidates worth touching first. For a 200-instance fleet at average m5.xlarge cost ($140 per month), right-sizing them to m5.large or smaller saves $4,000 to $7,000 per month. Across the 50, that compounds to $200,000 to $350,000 per year.
The remaining 150 instances are mostly already right-sized or close to it. Touching them produces less saving and more risk. The reverse-cost property tells you to stop after the long-tail 20% rather than try to optimize every instance.
Right-sizing without vibes is the same five numbers, the same lookup table, and the same sentinel for every workload. The senior engineer’s judgement is replaced by a procedure that runs to completion in an afternoon, ships measurable saving in 14 days, and reverses safely if any single canary degrades. The $200k a year exists because the procedure has been treated as a vibes-driven exercise; it does not have to be.

