Every cost tool ranks cleanup by dollar impact. A 200 USD idle instance sits at the top of the list. An unattached disk that bills 0 today sits at the bottom, below everything, where no one scrolls. So it stays. Then a thousand of them stay.
That is the failure we set out to close. In ZopNight v1.16.0 we shipped 13 new orphan and idle cleanup rules with auto-remediation across AWS and Azure, and we extended them to cover free and zero-cost orphans. The thesis is simple: a resource that bills 0 is not a resource that is safe to ignore, and cost-ranked cleanup is structurally blind to it.
The zero-cost orphan trap is built into how cleanup tools rank work
Call it the zero-cost orphan trap. A cleanup tool sorts findings by current spend, because spend is the number a finance dashboard rewards. Anything scoring 0 falls to the floor of that list. The list gets actioned from the top down, and the floor is never reached.
This is not a bug in any one tool. It is the predictable output of dollar-ranked triage. The expensive head of the distribution gets attention. The long tail of free orphans accumulates, because each one, judged alone, looks like nothing.
The accumulation is the problem. One unattached IP is noise. Four hundred unattached IPs across thirty accounts is an audit finding, a deletion blocker, and a billing event waiting for a trigger. Cost ranking measures each item at the moment it is found, not what the pile becomes. We treat the backlog itself as the liability. That framing is the same one behind our work on killing zombie cloud resources before they compound.
A 0 orphan is not a free orphan, because cost is only one of its three liabilities
The word “free” hides three separate liabilities. Current cost is one axis. The other two are why a 0 finding still earns a remediation.
A zero-cost orphan blocks deletion. An unattached disk or a retained snapshot keeps a parent resource group or volume from being torn down, so cleanup of the expensive thing stalls behind the free thing. A zero-cost orphan widens the attack surface. An orphaned security group, a dangling DNS record, or an unused access key is a 0 line item and an open door at the same time. And a zero-cost orphan is latent spend. A disk outside a free tier, a snapshot aging toward a billed retention class, or an IP that starts charging the moment it reattaches all flip from 0 to billed without anyone touching them.
| Orphan class | Bills today | Why it still matters at 0 |
|---|---|---|
| Unattached disk or volume | 0 or free tier | Blocks parent deletion, starts billing once tier limits pass |
| Reserved public IP, unassociated | 0 while held | Charges the moment it attaches or the grace window closes |
| Orphaned security group or firewall rule | 0 | Open path with no owner, a standing audit finding |
| Stale snapshot in retention | 0 in window | Rolls into a billed retention class on schedule |
| Dangling DNS record | 0 | Subdomain takeover surface, points at nothing |
Rank these by today’s bill and every row scores 0. Rank them by liability and every row earns action. That is the gap our cloud governance guardrails are meant to catch, not just the spend.
Thirteen rules across AWS and Azure widen the net to the long tail
The release adds 13 rules. They run across both AWS and Azure, and they now include the free and zero-cost classes above, not only the resources with an obvious price tag.
Two clouds matters here because orphan shapes differ by provider. An unattached managed disk in Azure and an available EBS volume in AWS are the same liability with different APIs and different free-tier rules. Detecting both from one rule set means the long tail gets one consistent policy instead of two half-built ones.
I will be honest about what we do not publish. The release notes do not include per-rule false-positive rates or a recovered-spend figure, and inventing one would be dishonest. The measurable claim is the count and the coverage: 13 rules, AWS and Azure, free orphans now in scope.
Auto-remediation closes the loop, because a finding that waits on a ticket is not remediated
Detection without action is a longer list. The point of auto-remediation is that the loop acts on its own: it detects the orphan, decides against a policy, acts to remove or release it, then verifies the result. No engineer opens a cleanup ticket, so the act step never sits in a queue behind a sprint board.
Ticket queue time is where free orphans die. A 0 finding never wins prioritization against a feature, so it ages in a backlog forever. Removing the human dispatch step is the whole mechanism. This is the same detect, decide, act, verify shape behind auto-remediating pod crashloops before the on-call pages.

The verify step is the guardrail. If the action did not produce the expected state, the loop holds rather than charging ahead, which keeps a bad rule from cascading. The same instinct drives our blast radius check before any autonomous action runs.
Turn it on for safe classes first, because blast radius is the constraint, not detection
Auto-remediation works when the orphan class is unambiguous and reversible. Unattached IPs, dangling DNS records, and clearly idle disks are safe starting classes, because the decision is mechanical and the action is cheap to undo.
It breaks when the class is ambiguous. A disk detached for an in-flight migration looks identical to an abandoned one, so an over-eager rule deletes live data. The fix is policy, not faith: keep ambiguous classes on a hold-for-review branch and let only the safe classes act automatically. Start narrow, watch the verify step, and widen the auto path as each rule earns trust.


