Stop/Start Is Not Right-Sizing: Two Rule Classes, Not One

The most common form of cloud auto-remediation is stop/start. Find an idle resource, turn it off. Need it later, turn it on. It is real savings, and it is the easy half.

The hard half never goes idle. According to Flexera’s 2025 State of the Cloud report, 27% of cloud spend is wasted, and most of that waste is busy, not idle. An overprovisioned database runs at full price around the clock, serving real traffic on twice the instance it needs. A log group keeps years of logs nobody reads. An S3 bucket stores cold data on a hot tier, paying a premium rate for objects read once a quarter. None of those resources are idle, so a shutdown rule never touches them. They sit at the wrong size, the wrong tier, or the wrong retention setting, billing full price every hour of every day. ZopNight v2.0 makes fixing them its own rule class and takes the remediable-rule count to roughly 132 across AWS, Azure, and GCP.

Stop/Start Fixes the Easy Half of Cloud Waste

Start with where the money actually leaks. Per Flexera, inside that 27% of wasted cloud spend, 35% is idle compute and 25% is overprovisioned instances, so idle and overprovisioned together are about 60% of all cloud waste.

Stop/start attacks the idle 35%. That is genuine, and it is the part most tools ship first because it is the safest: powering off a resource nobody is using is easy to justify and easy to reverse. This is the same logic behind automated cloud scheduling for non-prod, where the whole environment is idle overnight.

The overprovisioned 25% is a different animal. It is busy. It serves traffic. It just does so on twice the instance it needs, or with a retention window measured in years, or on a storage class priced for data you read once a quarter. A rule that looks for idleness will never flag it, because by every utilization signal it is in use. That is the trap behind the right-sizing problem: a resource can be fully utilized at the wrong size.

Waste type	Example	Does stop/start fix it?
Idle resource	Dev box left running over the weekend	Yes
Overprovisioned size	RDS on a 2x instance at 30% load	No
Wrong storage tier	Cold objects on S3 Standard	No
Excess retention	CloudWatch Logs kept indefinitely	No
Misconfigured feature	ELB cross-zone off, gp2 instead of gp3	No

Four of those five rows are invisible to stop/start. That is the gap the config rule class fills.

Rightsizing Is a Different Operation Than Power Toggling

Stop/start is a power-state change. It is reversible by definition: the instance you stopped is the instance you start, unchanged. It is also stateless from the resource’s point of view. Nothing about the resource’s shape is different afterward.

A right-sizing or config change is neither reversible by default nor stateless. You are mutating a live resource: shrinking an instance, switching a storage class, lowering a retention window, flipping a database flag. Some of those reverse cleanly, and you can flip them back with no cost. Some do not, or reverse only at a real cost. Lowering log retention deletes the logs past the new window, and those logs are gone for good. Switching an EBS volume type or changing a database parameter can force a reboot, which is a maintenance event, not a silent tweak. The operation looks small in the console and is anything but small in its consequences.

Because the blast radius is different, the gate must be different. Folding both into one bucket forces a bad choice. Gate everything like the risky changes, and you smother the safe stop/start wins under approval friction. Gate everything like the safe ones, and an unattended config change eventually breaks production. The fix is to split them into two rule classes that share one executor but carry different gates, which is the same separation that makes one-click certified remediation safe to run at all.

What the Config Rule Class Actually Covers

The new class is broad on purpose, because misconfiguration is broad. It groups into five categories, and it spans all 3 clouds rather than bolting onto AWS alone. Amazon S3 provides several storage classes, and intelligent-tiering moves objects between them automatically based on access patterns, so cold data stops paying hot-tier prices without anyone watching it.

Category	Example rules	Clouds
Storage lifecycle	S3 intelligent-tiering, noncurrent-version, deep-archive; EBS snapshot archive; GCS lifecycle	AWS, GCP
Retention	CloudWatch Logs retention; Azure Log Analytics retention	AWS, Azure
Network and cache	ELB cross-zone; CloudFront cache TTL	AWS
Compute config	Lambda timeout and concurrency; ASG floor; EC2 two-tier resize; Cloud Run	AWS, GCP
Database config	RDS backup-retention, PITR, gp3; Azure MySQL; GCP Cloud SQL	AWS, Azure, GCP

A generalized lifecycle template sits under the storage rules, so adding the next tiering policy is configuration rather than a new adapter. The point of the breadth is that the 60% of waste stop/start cannot reach is not one problem. It is dozens of small, specific misconfigurations, and you need a rule for each shape. The full set ships with ZopNight v2.0.

Mutating a Live Resource Demands a Confirmation Gate

Breadth without a gate is how autonomous systems cause incidents. So the riskier the change, the more deliberate the gate. In ZopNight v2.0 every database-config rule is GUIDED: it requires a type-to-confirm step and never runs unattended.

The reason is blast radius. Stopping an idle box affects the idle box. Changing a database’s backup retention, recovery window, or storage type touches durability guarantees and can require a reboot. That is not a decision to take silently at 2 AM, even when the recommendation is correct. This is the same instinct behind a blast radius check before any automated action.

Rule example	Blast radius	Gate
Stop an idle EC2 instance	Low, reversible	Unattended
Add an S3 lifecycle policy	Low to medium	Unattended or notify
Lower CloudWatch Logs retention	Medium, deletes old logs	Notify or guided
Change RDS PITR or backup-retention	High, durability impact	Guided, type-to-confirm

The gate is not a brake on autonomy. The caveat is the gate itself: this works when the safe rules run unattended and the risky ones sit behind a typed confirmation. This breaks when you collapse the two, because then either the safe wins stall under needless approvals or the durability-affecting changes run silently at 2 AM.

Cold-Resize: The Race That Kept Resize From Being Autonomous

Rightsizing an EC2 instance is not one action. It is three: stop the instance, change its type, start it again. That sequence is exactly where autonomy usually breaks, and the release fixes two failures that kept it from working.

First, permission. The executor has to perform the stop and the start on the instance’s behalf, mid-resize, without a human approving each power toggle. The release grants this through an internal remediation actor exemption, so the resize owns its own stop/start rather than stalling on a per-step approval.

Second, a state race. AWS reports an instance as stopping before it is stopped. The earlier code tried to resize while the instance was still stopping, the API rejected it, and the action stranded on stopping forever. The fix waits for the stopped state before issuing the resize.

This is the unglamorous half of autonomous cloud. The recommendation can be perfect, but if the executor acts one state too early, the action hangs. Rightsizing becomes real only when the rule class, the gate, and the state machine all hold, the discipline that turns recommendations into a closed-loop remediation you can trust.