Cloud Custodian vs OPA vs MCP-Enforced Policy: A 2026 Decision Matrix for Autonomous Remediation

A platform team picks Cloud Custodian in week one. By week six they realize Custodian fires after the resource is created, the wrong shape for blocking misconfigured Terraform plans. They add OPA. By month four they realize OPA admission decisions block the deployment but tell the developer’s Claude Code agent nothing about why. The agent retries the same misconfigured plan with a slightly different shape, and the next admission failure is identical to the previous one. They add an MCP layer. By month six the team is running three policy engines with overlapping rules, no single source of truth, and a debugging story that lives in three different log streams.

This is the 2026 reality. Three engines now own three different surfaces of cloud governance, and most teams need at least two. The question is not which one wins. The question is which surface each one owns and how to keep the rule definitions from forking across engines.

This post is the honest matrix: where Cloud Custodian, OPA / Gatekeeper / Rego, and MCP-enforced policy actually fit, where they overlap, and how to pick a primary without re-platforming in six months. The pattern composes with closed-loop FinOps and closed-loop IAM remediation.

The three surfaces

Cloud governance has three control points, not one. Most “policy as code” arguments are arguments about which surface matters most. The honest answer is that all three matter and they require different engines.

Surface	Engine class	When the rule fires	Action shape
Resource state (post-creation)	Custodian, AWS Config, native cloud rules	After resource exists	Tag, stop, terminate, notify
Admission (pre-creation)	OPA / Gatekeeper, validating webhooks	Before resource is admitted	Allow, deny, mutate
Agent / human intent (pre-decision)	MCP-enforced policy	Before the action is composed	Surface policy graph to the requester, gate the call

A misconfigured S3 bucket is a state problem. Custodian is the right answer. A Pod spec missing resource limits is an admission problem. OPA Gatekeeper is the right answer. A developer asking their AI agent “remove the deny rule on this IAM role so the deploy unblocks” is an intent problem. MCP-enforced policy is the right answer.

The mistake is using one engine for all three. Custodian rules that try to block creation by terminating-on-detection introduce a race window where the misconfigured resource exists for 30-90 seconds. OPA rules that try to detect drift on existing resources duplicate Custodian’s detection logic in a language that was not designed for state queries. MCP servers that expose write actions without admission and state guardrails are the agentic-AI equivalent of giving a junior engineer a root credential: fast, dangerous, and impossible to audit.

Cloud Custodian: the state-loop engine

Custodian is the most-adopted open-source policy-as-code engine for resource state. It runs as Lambda or scheduled Kubernetes CronJobs, queries cloud provider APIs, and applies actions to resources that match a YAML policy. AWS actively documents the compliance-as-code pattern around it.

The shape Custodian is good at: a policy named untagged-ec2-stop-after-72h targets the aws.ec2 resource, filters for instances with no Owner tag, state running, and instance-age ≥ 3 days, then runs two actions: stop the instance and notify the #cloud-governance Slack channel via the Slack transport. Roughly twenty lines of YAML.

The action surface includes stop, terminate, tag, snapshot, modify-attribute, mark-for-op (delayed action), and notify. Custodian shines on the four loops it was built for: cost (idle resources), security (public buckets, open SGs), compliance (untagged, unencrypted), and lifecycle (orphaned snapshots, expired AMIs).

Custodian’s honest limits in 2026:

Limit	Consequence
Detection runs on a schedule (5-15 min)	Misconfigured resources exist in the gap
YAML expressivity ceiling	Complex multi-resource conditions get awkward fast
No native admission control	Cannot block creation, only react to it
State queries are eventually consistent	Drift can hide between scans
No agent / human-in-loop interface	Automation runs invisibly to developers

The first limit is the load-bearing one. Custodian is fundamentally a state-loop engine: it cannot prevent a resource from being created. For 70% of FinOps and lifecycle problems, that is correct: you want to clean up after the fact because you cannot reasonably gate creation on every team’s deployment pipeline. For the other 30% (security-critical, compliance-critical), state-loop is the wrong shape. A public S3 bucket that exists for 12 minutes before Custodian detects and remediates it is still a public S3 bucket on the wrong side of a SOC 2 control.

Custodian fits when the action is reactive, the blast radius is contained (notify, stop, tag), and the time-to-remediation budget is measured in minutes, not seconds.

OPA / Gatekeeper / Rego: the admission engine

OPA decouples policy from the application by exposing a uniform decision API. Inside Kubernetes, OPA Gatekeeper translates Rego policies into validating admission webhooks: a request to create a Pod, Deployment, or any custom resource is sent to OPA, OPA evaluates the rules, and the API server allows or denies.

The shape OPA is good at: a Rego module under package kubernetes.admission declares two deny rules. The first inspects the incoming admission request, fires when the kind is Pod and any container is missing resources.limits.memory, and returns the message "Pod containers must declare memory limits". The second iterates the container list, fires when any container lacks resources.requests.cpu, and returns a templated message naming the offending container. Both rules run synchronously at admission time and produce a single deny verdict per offending field.

The decision is synchronous, sub-second, and lives in the critical path of the resource creation. There is no race window. A pod without limits never enters etcd. The same engine extends beyond Kubernetes via Conftest (Terraform plan validation), Envoy authorization, and the OPA-Gatekeeper Constraint Templates library.

OPA’s honest limits:

Limit	Consequence
Rego learning curve	Two senior engineers go on vacation; nobody else can debug the deny
Synchronous decision SLA	A slow OPA bricks the API server admission path
No native action verbs	Rego decides; another system must remediate
No state introspection	OPA does not know what already exists; it only sees the incoming request
Audit-grade decision logs require pipeline	Default OPA logs are debug-shaped, not audit-shaped

Rego is the largest social cost. Teams that adopted OPA seriously in 2024-2025 are the ones that successfully stayed on it; teams that picked it up casually have policy files that nobody on the current team can edit safely. The state-introspection limit is the architectural one: OPA evaluates the request in front of it, with whatever data was loaded into its bundle. If the rule needs “this team has already provisioned 8 H100s this month, deny the 9th,” OPA cannot answer without an external data source, which means the source becomes another system to operate.

OPA fits when the decision is admission-time, the rule is expressible in Rego, and the team has at least one engineer who can debug a deny without copy-pasting from Stack Overflow.

MCP-enforced policy: the intent engine

MCP (Model Context Protocol) is the surface that did not exist as a policy lever twelve months ago. It is now the third leg of the matrix because every developer in 2026 is talking to a coding agent (Claude Code, Cursor, Codex) and the agent is making infrastructure decisions on their behalf. A policy engine that does not surface to the agent leaves the largest decision-making interface in the company unmanaged.

MCP-enforced policy works at the request side, not the resource side. When a developer asks the agent “this terraform plan is failing admission, fix the bucket so it deploys,” the agent (if connected to a policy-aware MCP server) reads the policy graph covering active policies, applied resources, ownership, drift, exceptions, and audit events, and answers with the correct fix the first time. The agent does not retry the same failed plan with shape variations.

The shape an MCP policy server is good at:

MCP tool	What it returns	Used by agent for
`list_policies(scope)`	Active policies, severity, owners	Pre-flight: which rules apply to this resource
`check_resource(arn, action)`	Allow/deny + reasoning + cited policy	Should this change be made at all
`resource_ownership(arn)`	Team, cost center, on-call	Routing exceptions / approvals
`drift_events(scope, window)`	Recent drift detections	Is this resource in a known-bad state
`exception_status(policy, resource)`	Active exception, expiry, approver	Is the override legitimate or expired
`violation_history(scope)`	Recent denies, by frequency	Pattern detection for the agent’s reasoning

The architecture pattern is two MCP servers composed together. A read-only cloud MCP exposes resource state (“the bucket has policy X, encryption Y, current ACL Z”). A policy MCP exposes intent (“policy 47 says public buckets are denied unless the bucket is in the public-cdn-prefix exception list, expiring 2026-08-01, owned by web-team”). The agent reads both before taking action, so the action is policy-aware before it is even composed.

MCP’s honest limits:

Limit	Consequence
Only enforces against MCP-aware actions	Direct console clicks bypass entirely
Read-only is the safe default	Write tools are the security risk surface
Requires policy graph as system of record	If graph is stale, advice is wrong
Audit log must capture agent identity	Otherwise per-user enforcement collapses
Caching window matters	60s cache is the typical hot path; longer = stale, shorter = $$

The bypass limit is the structural one. A developer who clicks “Make Public” in the AWS console is not stopped by an MCP policy server, because the console does not call the MCP server. MCP enforces at the agent layer, not the cloud API layer. This is why MCP is the third leg of a matrix, not a replacement for the first two. State-loop (Custodian) catches what bypassed the agent. Admission (OPA) blocks what bypassed the agent and reached the API. MCP guides the largest population of decisions before they reach either gate.

MCP fits when the team has nontrivial AI agent usage, the policy can be expressed as a queryable graph, and the value is in shifting the decision left to the developer’s IDE or the agent’s reasoning loop.

The matrix: which one for which problem

Problem	Best primary	Best secondary	Why
Idle EC2, untagged resources, lifecycle cleanup	Custodian	(none)	State-loop, blast radius small, time budget loose
Public S3 buckets must never exist	OPA (Terraform)	Custodian (state-loop catch)	Block at admission; backstop with state scan
Pods without resource limits	OPA Gatekeeper	(none)	Pure admission, sub-second decision
Compliance evidence collection (SOC 2, ISO 27001)	Custodian	OPA for new resource gating	Periodic state attestation is the audit deliverable
AI agent generating Terraform	MCP	OPA (Conftest on plan)	Guide intent first, validate plan as backstop
Developer asks Claude to “fix this admission failure”	MCP	(none)	Agent must read policy graph or it retries blindly
Cost-driven shutdown of non-prod overnight	Custodian or purpose-built scheduler	(none)	State-loop, scheduled, low-blast-radius action
IAM drift across federated accounts	OPA + state collector	Custodian for remediation	Detection in OPA on Config events; action via Custodian
Claude Code agent making changes in dev	MCP	OPA on the resulting commit/PR	Two-layer: intent (MCP), validation (OPA on PR)
Snowflake compute-credit governance	Custodian-style state loop on Snowflake	(none)	Outside K8s/cloud, but same shape

The general rule of thumb: state problems go to Custodian, admission problems go to OPA, intent problems go to MCP. Most production governance practices need at least two of the three, and the mature ones use all three with clear surface ownership.

This works when the rule definitions stay in one source of truth (typically a policy registry that compiles down to Custodian YAML, Rego, and MCP graph schemas). It breaks when each engine has its own hand-edited rule file, because the same intent (“public buckets denied”) drifts across three rule definitions and the team has no single source to audit.

The migration order

For a team starting in 2026, the order matters. Picking the wrong primary first creates re-platforming work in 6-12 months.

Start with Custodian if your dominant problem is FinOps and lifecycle hygiene (idle resources, untagged sprawl, orphaned snapshots, dev environments running 24/7). Custodian gets you 70% of the visible cost-governance value in two weeks. It is the lowest-cognitive-cost engine: YAML rules, broad cloud coverage, AWS-blessed examples. Add OPA when you start gating new infrastructure (Terraform plans, Kubernetes admission). Add MCP when AI agents become a non-trivial share of infrastructure changes, typically the quarter after the team standardizes on Claude Code or Cursor.

Start with OPA Gatekeeper if your dominant problem is Kubernetes admission (uncontrolled pod specs, missing resource limits, image policy violations) and you have at least one engineer comfortable with Rego. The early investment in Rego pays back across Conftest (Terraform), Envoy authorization, and Gatekeeper constraints. Add Custodian for the state-loop tail (the resources that exist before OPA was deployed, or that OPA cannot reach). Add MCP when agent traffic shows up.

Start with MCP only if the team is already AI-native: every developer uses Claude Code daily, the volume of agent-generated infrastructure changes is meaningful, and the existing manual policy pain is mainly “the agent keeps making the same wrong choice.” This is the leading indicator for 2026, not for 2024 or 2025. Adopting MCP-first without the underlying admission and state engines means the agent is the only line of defense, and the bypass surface (humans clicking in consoles, CI scripts hitting cloud APIs directly) is unmanaged.

Team profile	Order	Rationale
Cost-pressured mid-market, low K8s complexity	Custodian → OPA → MCP	Dollar-pain first; admission later
K8s-heavy platform engineering	OPA → Custodian → MCP	Admission is the dominant pain
AI-native engineering org	MCP + OPA in parallel → Custodian	Intent + admission together; state catch-up
Compliance-driven, audit-anchored	Custodian → OPA → MCP	Evidence-first

The closing call

The 2024 framing was “policy as code.” The 2026 framing is “policy across surfaces.” A single engine cannot own resource state, admission, and agent intent at once, because the rule fires at three different points in the lifecycle and the action shape differs at each point. The teams that ship clean governance in 2026 are the ones that pick a primary surface, adopt a secondary within two quarters, and write the rules in a shape that compiles down to all three engines from a single source.

Custodian still owns the state loop. OPA still owns the admission gate. MCP owns the new surface that did not exist twelve months ago (the agent reasoning loop) and that surface is the largest decision-making interface in the company that is not yet under policy control.

Pick the primary that matches your dominant pain. Add the second engine before the next compliance audit. Add MCP before the agent fleet outgrows the policy graph that is supposed to govern it. The decision matrix is not theoretical: it is the difference between a governance practice that scales with the engineering org and one that re-platforms every other quarter.