Closed-Loop SRE for Kubernetes: Auto-Remediating Pod Crashloops Before the On-Call Pages

The 3am page is rarely about something that needs a human. The on-call gets paged at 03:14 because a pod has crashlooped four times in five minutes. They open Slack, look at the logs, see “OOMKilled” in the exit reason, run kubectl set resources to raise the memory limit, watch it stabilize, go back to bed at 03:42. Twenty-eight minutes of recovered sleep that they will not get back.

That sequence is deterministic. The signal (rate-of-change on restart count plus exit reason equals OOMKilled) maps to one action (raise the memory limit). No judgement was applied. No tribal knowledge was needed. The on-call was a delivery mechanism for kubectl set resources with a five-minute confirmation lag.

Pod crashloop auto-remediation has been talked about for a decade and shipped safely by maybe 5% of SRE teams, because the verify step is hard. Without a strict verify, an auto-remediation that picks the wrong action makes the incident worse: rolling back a fix, doubling memory on a leak, restarting a healthy upstream into a thundering herd. This post is the closed-loop pattern that does ship safely. It is the same detect-decide-act-verify shape as closed-loop IAM remediation, pointed at the runtime incident layer.

The 3am page that did not need a human

Mid-size SRE teams (running 200 to 2000 pods in production) see 8 to 25 pod-crashloop pages per week. The shape is heavy-tailed: a few weeks have 2 pages, a few have 40, the median is around 12. The MTTR distribution is even more skewed: the median page takes 18 minutes to acknowledge plus another 10 to 27 to resolve, but the long tail (genuine novel bugs) reaches 4 to 6 hours.

The diagnostic split is what makes the closed-loop pattern worth shipping.

Page category	Share of pages	Typical action	Needs human judgement?
Recent deployment broke something	25-35%	Roll back the deployment	No
OOMKilled (memory bump fixes it)	20-30%	Raise memory limit by 50%	No
Upstream service cascade	15-25%	Restart upstream service	No
Genuine bug or infrastructure issue	20-40%	Diagnose	Yes

60 to 80 percent of pages fall in the first three rows. They have one trigger condition each, one fix each, and zero ambiguity in retrospect. They wake the on-call up because no system was watching for the trigger condition and applying the fix on a 5-minute response budget.

A closed-loop that handles those three cases takes the predictable middle out of the on-call queue. The remaining 20-40 percent (genuine bugs, novel failure modes, infrastructure issues) still page a human and that is correct. The goal is not 100% automation; it is removing the deterministic cases.

The three deterministic crashloop actions

The three remediations are narrow on purpose. Each has a specific trigger and a specific kubectl command.

Trigger	Action	Command
New deployment landed in last 10 min + crashloop starts	Roll back to previous revision	`kubectl rollout undo deployment/<name>`
Exit reason = OOMKilled, no recent deployment	Raise memory limit by 50%	`kubectl set resources deployment/<name> --limits=memory=<new>`
Upstream service in known-cascade list also showing errors	Restart upstream	`kubectl rollout restart deployment/<upstream>`

Three commands. Each is reversible in seconds. Each has a one-line trigger. The remediator role does not need any other permission.

The discipline is in the trigger logic, not in the actions. The “new deployment in last 10 min” check is a query against the Kubernetes API. The “exit reason equals OOMKilled” check is a field on the container status. The “upstream service cascade” check is a join between the failing pod and a small allowlist of known-cascade dependencies (the message bus, the auth service, the primary database proxy). All three are deterministic; none of them need ML.

Detect: rate-of-change on restart count

The Prometheus rule fires when kube_pod_container_status_restarts_total increases by more than 3 in a 5-minute window for the same pod.

The rule fires fast because Prometheus scrapes kubelet on a 30-second interval and the rate calculation lands within 60 seconds of the third restart. End-to-end detect latency is under 90 seconds, which is critical because anything slower and the kubelet has already started its 5-minute exponential backoff and the pod is offline for users.

OOMKilled is a side-channel. The Prometheus alert has the pod identity but not the exit reason. The decide step queries the Kubernetes API for pod.status.containerStatuses[].lastState.terminated.reason to read it. This adds 200ms of latency and is the difference between blindly rolling back versus knowing memory is the right fix.

Decide: policy that picks the right action

The decide step is a small policy engine, not an LLM. The rules are deterministic.

Three guards prevent the known failure modes:

No double-rollback. Before issuing rollback, check that the most recent deployment was not itself a rollback. If the previous revision is already a rollback in the deployment history, the current state is the pre-rollback broken state and rolling back again puts you in a loop.

No cluster-wide memory pressure. Before raising memory, check that node-level memory pressure is below 80%. If memory pressure is cluster-wide, the OOMKill is a symptom of the cluster running out, not the pod over-allocating. Raising the limit makes it worse and triggers other pods to OOMKill.

No third-party dependency outage. Before restarting an upstream, check that an externally-monitored health probe (PagerDuty status, the upstream’s own health endpoint) shows the upstream as healthy. If the upstream is failing because its third-party API is down, restarting it does nothing.

If any guard fires, the loop falls through to “page human” rather than acting on a partial signal.

Act: narrow-scope RBAC role

The remediator’s Kubernetes ServiceAccount has a Role with exactly three verbs:

Verb	Resource	Scope
`patch`	`deployments/scale`	Specified namespaces only
`patch`	`deployments` (for resources subresource)	Specified namespaces only
`create`	`deployments/rollback`	Specified namespaces only

That is the entire permission scope. The remediator cannot create new resources, cannot delete anything, cannot read secrets, cannot exec into pods. The blast radius of a misfired remediation is bounded by what those three verbs can do on existing deployments. This is the same blast-radius discipline as read-only MCP servers extended to a narrowly-scoped write role.

Verify: the step most teams skip

The verify step is the longest part of the loop. The remediator has to wait long enough for the pod to stabilize but short enough to escalate quickly if the action did not work.

The verify rule:

Wait 30 seconds after the act step (give the pod time to start the new container)
Sample kube_pod_container_status_restarts_total rate-of-change for 4 minutes
Sample pod readiness for 4 minutes
If restart rate stayed below threshold AND readiness reached 1, mark loop complete
Otherwise, escalate to a human page with the full context (which action was tried, what verify saw)

The verify takes 4 to 5 minutes because Kubernetes restart stabilization is inherently slow. The pod has to start, pass its readiness probe, run a few requests, and settle. Anything faster is unreliable; anything slower burns the on-call’s time.

The escalation case is the safety net that makes the closed-loop production-safe. If the loop picked the wrong action (memory bump on a memory-leak that needed a code rollback), verify catches it because restart rate stays high. The page goes out at 4 minutes 30 seconds with a clear note: “auto-remediator tried memory bump, verify failed, escalating.” The on-call now knows the obvious fix did not work and starts at the second hypothesis.

Total loop time: ~6 to 7 minutes. The on-call sleeps through 60-80% of pages.

Composing with policy-aware governance

The risk-tier policies that decide whether a Terraform plan can apply also govern auto-remediation. The policy-aware governance MCP already knows that production-payments is high-risk and dev-marketing is low-risk; the auto-remediator reads from the same policy graph. No duplicate policy store, no drift between policy storage layers.

For high-risk namespaces, the loop can be configured to require human acknowledgement before act (alert with a “remediate now” button instead of acting autonomously). For low-risk namespaces, full autonomy with verify. The mode is a policy attribute, not a separate code path.

This is what makes the closed-loop SRE pattern composable with the broader autonomous-cloud thread. The detect-decide-act-verify shape repeats across cost anomalies, IAM violations, and now runtime incidents. Each loop has its own detect signal and its own act tools, but the policy graph that decides risk tier is shared. The teams that ship one loop ship the others faster because the substrate is already there.

The 3am page does not need a human for the deterministic middle. The closed loop with a strict verify is what lets you stop sending it to one.