Auto-Remediation: One-Click Cloud Action With 20 Certified Rules

Every cost-optimisation product has shipped an “Apply Fix” button at some point. Most teams do not click them. Either the button is a stub that opens a ticket somewhere (and nothing happens until a human reads the ticket), or it calls a cloud SDK without a precondition check, succeeds 60% of the time, and fails the other 40% with errors the operator does not know how to interpret. After two or three opaque failures, the team stops trusting the button. After five, they hide it from the dashboard.

ZopNight v1.7 ships the opposite shape. Click Remediate on a recommendation card and the platform runs a precondition check against live cloud state, optionally routes to an admin for approval, executes the cloud action, validates the result, and writes the whole sequence to the audit log. Twenty rules are certified end-to-end on real AWS, GCP, and Azure with zero residue across the certification runs. Rules that have not been certified render the recommendation card without the Remediate button.

This post walks through what auto-remediation actually does, what the 20 certified rules are, what “certified” means in this context, the three-category error model that tells operators what to do when something fails, and how to use the workflow day to day.

Why most ‘Apply Fix’ buttons go unclicked

A button that has not been validated on the customer’s actual rule + actual cloud + actual resource creates a specific kind of trust problem. The first click works, the second fails, the third works again, and now the operator does not know whether the next click will succeed. By the fourth opaque failure, the team has decided the button is unreliable and reverts to manual remediation.

Pattern	What it does on click	Why it fails the trust check
Stub button (ticketing only)	Opens a ticket in Jira / Linear / GitHub Issues	Nothing actually changes; the recommendation sits open for weeks
SDK passthrough (no precondition)	Calls `aws ec2 stop-instances` (or equivalent) and reports success/failure	No precondition check, so the call fails on stale state, missing permissions, or wrong resource version
Best-effort with rollback (partial)	Tries the action, attempts rollback on failure	Rollback paths are rarely tested; failures leak orphaned resources
End-to-end certified (ZopNight v1.7)	Precondition → action → post-condition → audit log	Each step is observable; failures map to a categorised reason; zero residue verified per rule

ZopNight’s auto-remediation lands in the last row because the engineering work happened before the button shipped, not after operators complained about it. Twenty rules went through certification on real cloud resources across at least three environments per rule. Rules that did not pass certification do not get the button.

What auto-remediation actually does end to end

The Remediate button kicks off a four-step pipeline. Each step is observable in the audit log; each step can fail in a known way.

Precondition check. The rule’s preconditions run against live cloud state. For a “stop idle EC2” rule, this means re-checking that the instance still exists, is still running, still has zero connections, still matches the rule’s filter criteria (env, age, utilisation). If any precondition fails, the action does not run, and the operator sees a banner explaining what changed. Precondition failure is the most common reason for a no-op click, and it is the right behaviour: stale recommendations should not act on stale data.

Approval gate (optional, per rule). An admin can require approval on any rule for any reason. When set, the click queues the action in an approval inbox; an admin reviews the pre-filled context (resource, action, expected effect, blast radius) and clicks Approve or Reject. The action runs only after Approve. The approval gate is opt-in per rule; the default is no approval required on the certified rule set.

Cloud action. The actual API call: stop the instance, scale the service to zero, pause the pool. Idempotent by design (re-running a stop on an already-stopped instance is a no-op, not an error).

Post-condition validation. A few seconds after the action, ZopNight re-queries the resource state. The instance state is now stopped. The service replica count is now zero. The pool is now paused. If the post-condition does not match what the action was supposed to produce, the audit log captures the discrepancy and the operator is paged. In the certification runs across 20 rules, post-condition validation succeeded 100% of the time on first attempt.

Audit log. The full sequence is written: which rule fired, which resource, the precondition result, the approval decision (if any), the action call and its parameters, the post-condition result, the operator who clicked, the timestamp. Compliance teams can replay any remediation; FinOps teams can quantify savings per rule per month.

The 20 certified rules, organised by action class

The 20 certified rules fall into three action classes. The provider coverage is intentional: every action class works on AWS, GCP, and Azure where the resource exists.

Action class	AWS	GCP	Azure
Stop compute	EC2 instance, SageMaker notebook	GCE instance	VM, VMSS
Scale-to-zero	Lambda concurrency, ECS service, Step Functions, SageMaker endpoint	Cloud Run, Vertex endpoint	Container Apps, Function Apps, Azure Batch
Pause service-tier	(none in this batch)	Dataproc, Databricks (managed)	Synapse pool, Cognitive Services, Data Factory, Azure ML compute

Stop compute is the simplest action: send the cloud API a stop signal, the instance transitions to a stopped state, billing for compute halts (storage stays). EC2, GCE, and Azure VMs all support this directly. SageMaker notebook instances expose the same pattern through a different API surface; the rule wraps it.

Scale-to-zero is the right pattern for elastic services where “stop” is not a concept. A Lambda function does not stop; you set its concurrency to zero, which causes new invocations to throttle without invoking the function. Cloud Run and Container Apps work the same way at the service level. ECS service can scale its desired-count to zero. Step Functions is more nuanced (you cannot pause a state machine, but you can prevent new executions from starting); the rule disables the EventBridge schedule that triggers it.

Pause service-tier is the third pattern, used for managed analytics and ML resources where the cloud provider exposes an explicit pause API. Synapse SQL pools can be paused (compute stops, storage stays). Cognitive Services endpoints can be paused. Azure ML compute, Dataproc, and Databricks all support pause + resume cycles.

The 20 rules together cover most of the cost-recovery action surface a FinOps team works through on a routine basis. Right-sizing rules (which change instance type rather than stop the resource) are deliberately not in this batch because the precondition and post-condition checks are different shape; they are queued for a future certification batch.

What certification means

A rule that ships as certified has been through every step of the auto-remediation pipeline on real cloud resources, in at least three different cloud environments, with zero residue in the audit log.

Certification gate	What it proves
Precondition validates on at least 3 environments	The rule reads live state correctly across customer variation
Action executes successfully end-to-end	The SDK call shape is right and permissions match the rule’s expectations
Post-condition validates within the expected window	The change actually happened and is observable
Audit log captures full sequence	Compliance + forensic replay are possible
Zero residue (no orphaned IAM, leaked SGs, broken dependencies)	The action did not create new cleanup work elsewhere
Error handling produces categorised reasons	Failures map to user_action / transient / system, not “unknown error”
Rollback path tested where applicable	Reverse action works when an admin chooses to undo

A rule that passes all seven gates becomes certified and gets the Remediate button on its recommendation cards. A rule that fails any gate stays uncertified; the recommendation still appears (the detection is independent), but the operator must take the action manually.

The certification process is repeatable. New rules go through the same gates before they earn the button. Existing rules can be re-certified if the underlying cloud API behaviour changes (which has happened with Azure’s pause endpoints twice in the last year). The denylist is not a permanent state for a rule; it is a current state pending engineering work.

The three-category error model

When an auto-remediation does not succeed, the operator sees a banner. The banner copy tells them what to do, derived from a three-category classification of the failure.

Category	What it means	Example causes	Operator next step
`user_action`	Something on the customer side needs attention	Missing IAM permission, resource locked by another tool, tag policy blocking the change	Fix the underlying issue and click Remediate again
`transient`	Probably succeeds on retry without operator action	Cloud API rate limit, transient network error, eventual-consistency lag	Wait and click again, or rely on the auto-retry
`system`	ZopNight needs to investigate	Unexpected SDK error, schema drift, internal bug	Escalate (a support link is in the banner)

The categorisation is the difference between a useful error and a useless one. “Failed: AccessDenied” is technically accurate but does not tell the operator that this is a user_action (their IAM role needs ec2:StopInstances) versus a system issue (ZopNight should have asked for that permission at onboarding). The category implies the next step.

Transient errors auto-retry with backoff. The operator does not see the banner for a transient error until the retry budget is exhausted (typically three attempts over five minutes). In practice, the great majority of transient errors resolve before the operator notices.

System errors are rare (under 0.5% of total attempts in production data across the certified rule set). When they happen, the banner includes a one-click “send to ZopDev support” link that ships the audit-log trace and the cloud-API response. The customer does not have to triage the failure.

Admin approval as an opt-in escape valve

The default for the 20 certified rules is no approval required: the click runs the action. For most rules in most environments this is the right default because the actions are reversible (you can start the instance again), the blast radius is bounded (one resource at a time), and the precondition check has already validated live state.

Some customers want an admin in the loop for specific rules. Reasons vary:

Why require approval	Example rule
Compliance / SOC 2 attestation	All production rules require named-approver
Team policy	DBA team requires approval on any database pause
Blast radius management	High-cost rules (savings > $5k) require a second pair of eyes
Rollout caution	Newly enabled rules require approval for the first 30 days

Toggling approval-required is per rule and persistent. Pending approvals show up in a dedicated inbox on the admin’s dashboard. Each approval card pre-fills the resource, the action, the expected savings, the precondition state, and the operator who clicked. Approving is one click; rejecting is one click plus an optional reason that ZopNight stores in the audit log.

Approval composes naturally with the closed-loop trust score work. Certification is the binary gate (no button without it). The trust score is the second gate (decides which certified rules need approval based on blast radius + reversibility + confidence). Most rules fall in the auto-action band; high-blast or low-confidence rules can route to approval automatically.

How to use auto-remediation day to day

The most common entry point is the Recommendations page. Open it, filter by rule status (Open / Applied / Auto-Resolved), find the certified rules with the most savings, and click Remediate per recommendation. The first time, the click takes the operator through the precondition + action flow in a visible drawer; subsequent clicks on similar recommendations skip the drawer and run inline.

The Recommendations card row shows three counters: Open (recommendations that have not been actioned), Applied (remediations that have completed successfully), Auto-Resolved (recommendations where the underlying condition resolved without action, e.g., the team manually stopped the instance, or the workload finished). Auto-Resolved is the read-out that proves the system is not over-counting savings; resolved without action does not mean ZopNight saved the money.

Auto-remediation also surfaces from two other places. The resource detail drawer (reachable from Atlas, the list view, the Cost Reports) shows the current open recommendations for that resource with the Remediate button inline. The closed-loop FinOps cron can trigger Remediate on certified rules automatically without an operator click, using the same precondition + action + validation pipeline; the difference is the trigger is a schedule or a cost-anomaly signal rather than a button.

For teams just starting with auto-remediation, the safe rollout is to enable approval-required on the first batch for two weeks, watch the approvals come through, build confidence in the precondition checks, then disable approval-required per rule. Most customers reach steady state (no approval needed on certified rules) within a month.

What’s next

The 20 certified rules cover the highest-volume cost-recovery actions across AWS, GCP, and Azure. The next certification batches add coverage in two directions:

Direction	Examples	Why next
Right-sizing rules	Resize EC2 to smaller instance type, downsize RDS tier	Higher savings per action, but precondition + rollback model is different
Schedule-based remediation	Auto-attach a non-prod schedule to newly discovered idle resources	Compose with the existing scheduling product
Cross-account remediation	Stop resources in account A when account B’s budget cap fires	Composes with the budget-brake closed-loop work

Each new batch goes through the same certification gates. The Remediate button surface grows as rules clear certification, not as engineering writes optimistic code.

If you have ZopNight connected to at least one cloud account, the Recommendations page is the right starting point. Filter to certified rules, find the highest-savings card, click Remediate, watch the audit log. The action does what it says or stops with a categorised reason. That posture is what makes the button worth clicking.