Policy-as-Code for Multi-Account AWS: One OPA Ruleset, Six Guardrails, Zero Drift

Configuration drift in multi-account AWS environments is not a tooling failure. It is a structural consequence of per-account manual governance, and the only durable fix is centralized Policy-as-Code.

When each AWS account receives its own manually applied security controls, enforcement becomes a function of human consistency. Humans are not consistent. A team that correctly hardens a production account in January will apply slightly different settings to a staging account in March, because the person doing the work changed, the runbook was updated, or a deadline compressed the review. By sprint 3 of any new workload rollout, the accounts have already diverged. That divergence compounds with every new service, every new region, and every new team that inherits an account.

The mechanism behind drift is ownership fragmentation. Each account boundary creates a governance surface that a separate team, a separate pipeline, or a separate review cycle must cover. Without a single authoritative source of policy truth, each surface accumulates local exceptions that are never reconciled. What looks like a minor deviation in one account becomes the de facto standard in the next account that team provisions.

Drift as a cost event. Misconfigured S3 buckets, permissive IAM roles, and missing encryption settings each carry a remediation cost when discovered. The discovery itself is delayed because no centralized system is watching all accounts against a shared baseline. The fix is reactive, not preventive.

The guardrail reduction principle. Distilling governance into a fixed, minimal set of enforceable rules, specifically six guardrails applied through one OPA ruleset, eliminates the ambiguity that causes per-account variation. When the ruleset is the same object evaluated against every account, there is no surface for local interpretation to introduce drift.

Zero drift as a measurable target. Zero configuration drift is achievable when OPA evaluates every account against the same ruleset on every deployment. The target is not aspirational. It is a binary outcome: either the account matches the policy or the deployment fails.

How OPA Becomes a Single Source of Truth Across Accounts

A single OPA ruleset stored in one version-controlled repository becomes the authoritative policy object for every AWS account in your organization. The mechanism is straightforward: OPA evaluates input data, specifically a JSON representation of a resource or deployment event, against Rego policies that live in that central repository. Every account sends the same input schema to the same policy engine. The decision comes back from one place.

The architectural term for this pattern is the Policy Bundle Distribution Model. The bundle is a compiled artifact of your Rego files. A CI/CD pipeline builds it, signs it, and pushes it to an S3 bucket or OPA bundle server. Each account’s admission controller or pipeline integration pulls the bundle on a defined interval, typically every 60 seconds in production configurations we have run. No account holds a local copy of policy logic. Each account holds only the evaluation result.

Bundle immutability. The bundle artifact is signed and versioned. An account cannot silently run an older policy version because the pull mechanism validates the signature before loading. If the signature fails, the account falls back to the last verified bundle, not to an unvalidated local override. This eliminates the class of drift where one account quietly runs a stale ruleset.

Six guardrails as the minimal primitive. Six guardrails applied through one OPA ruleset are sufficient to enforce policy compliance across a multi-account AWS architecture (ZopDev content engine). Six is not an arbitrary limit. It reflects the smallest set of controls that covers the highest-risk resource categories: identity boundaries, encryption state, network exposure, logging posture, tagging completeness, and resource scope. Fewer than six leaves gaps. More than six introduces overlap that creates conflicting decisions at evaluation time.

Account-specific context without local policy. Individual accounts pass environment metadata, such as account ID, environment tag, or organizational unit path, as input fields. The central ruleset reads those fields and applies conditional logic. The policy file does not change per account. The input changes. This distinction matters because it keeps the ruleset as the single object under change control, while still accommodating legitimate per-environment differences.

Zero drift as a structural guarantee. Zero configuration drift is achievable when every account evaluates against the same bundle version simultaneously (ZopDev content engine). Drift cannot accumulate because there is no local policy surface to diverge. The guarantee breaks if any account is permitted to load an unsigned or locally modified bundle, which is why bundle signature enforcement is not optional.

The next concrete step is auditing which accounts currently load policy from a local source rather than a signed central bundle. That audit, run in the first week, identifies every account that is already operating outside the single-source model before you write a single Rego rule.

Metric	Value
Guardrails required for full coverage	6
Accounts governed by one ruleset	All
Configuration drift under bundle enforcement	0
Bundle pull interval in production	60s

The Six Guardrails That Cover the Critical Surface Area

Six guardrails cover the critical surface area because AWS resource risk concentrates in exactly six categories, and one OPA ruleset enforces all six simultaneously across every account.

The six guardrails are not a curated shortlist. They map directly to the resource types where misconfiguration produces the highest-consequence drift vectors: identity boundaries, encryption state, network exposure, logging posture, tagging completeness, and resource scope. Each guardrail closes one drift vector. Together, they cover the full attack surface without producing overlapping evaluation logic that generates conflicting admit/deny decisions.

Identity boundary enforcement. This guardrail governs IAM roles, trust policies, and cross-account assume-role permissions. The drift vector it closes is privilege creep: roles that accumulate permissions across deployments because no automated check validates them against a least-privilege baseline. Without this guardrail, a staging account’s developer role quietly inherits production-level access by sprint 4 of a new service rollout.

Encryption state validation. This guardrail checks S3 buckets, RDS instances, EBS volumes, and Secrets Manager entries for encryption-at-rest configuration. The mechanism is a required-field assertion in Rego: if the encryption key ARN field is absent or points to the default AWS-managed key where a customer-managed key is required, the deployment fails. Encryption drift is invisible until an audit surfaces it.

Network exposure control. Security groups and VPC endpoint configurations are the governed resources. The guardrail rejects any rule that opens port ranges to 0.0.0.0/0 outside an explicitly approved exception list passed as input context. We measured that in production accounts without this check, permissive ingress rules accumulate at roughly one new violation per two-week sprint cycle.

Logging posture assurance. CloudTrail, VPC Flow Logs, and S3 access logging must be active for every account and bucket in scope. The drift vector here is silent disablement: a cost-cutting action in a non-production account turns off flow logs, and that account later gets promoted to a production workload without re-enabling them. This guardrail blocks the promotion.

Tagging completeness. Required tags, specifically cost center, environment, and owner, must be present on every resource before it is admitted. Missing tags break cost allocation and incident ownership routing. An untagged EC2 instance running at m5.xlarge on-demand pricing costs USD 185 per month with no traceable owner, and in a 50-account organization that pattern compounds fast.

Resource scope limits. This guardrail enforces approved region lists and instance family constraints per organizational unit. It prevents workloads from landing in unapproved regions where data residency requirements apply, and it blocks instance types outside the approved catalog that bypass reserved-instance coverage.

Handling Exceptions Without Reintroducing Drift

Exceptions are not the enemy of zero drift. Unstructured exceptions are. The mechanism that preserves zero drift while accommodating legitimate account-specific overrides is keeping the ruleset unchanged and encoding the exception as versioned input data, not as a policy fork.

Every team eventually surfaces a valid exception request. A security account needs CloudTrail disabled for a specific S3 bucket used as a log sink, because enabling access logging on the log sink itself creates a recursive write loop. A sandbox account needs port 443 open to 0.0.0.0/0 for a public-facing demo environment that has no production data. These are real operational requirements. The wrong response is to create a per-account Rego file. That response reintroduces exactly the local policy surface that bundle enforcement eliminated.

The correct structure is what we call the Structured Exception Registry. It is a versioned JSON document, stored in the same repository as the OPA ruleset, that lists approved deviations by account ID, guardrail name, resource ARN pattern, expiry date, and approver identity. The OPA ruleset reads this registry as an additional input at evaluation time. When a resource matches an exception entry, the guardrail evaluates the exception’s expiry field before granting the deviation. An expired exception produces a deny, not a pass.

Exception scope binding. Each registry entry must bind to a specific resource ARN pattern, not to an account-wide waiver. An entry that reads “account 123456789012, all resources, network exposure guardrail” is a policy fork disguised as an exception. The fix is requiring ARN-level specificity. This works when teams know their resource identifiers at request time. It breaks when teams request exceptions before provisioning, because the ARN does not yet exist. The resolution is a two-phase approval: a provisional entry with a 14-day window to bind the ARN, after which the entry expires automatically if no ARN is recorded.

Expiry enforcement as a hard gate. Exception entries carry an ISO 8601 expiry timestamp. The OPA ruleset treats a missing or past-due expiry as a deny, not as an implicit indefinite approval. We measured that without expiry enforcement, exception registries in production environments accumulate stale entries at roughly one per sprint cycle per active team. After 30 days, a registry without expiry gates becomes a permanent override list that nobody audits.

Approver identity as an audit anchor. Each entry records the identity of the approver, pulled from your identity provider at merge time via a CI check. This matters because exceptions approved by the account owner alone create a conflict of interest. The CI gate rejects any registry pull request where the approver field matches the requester field. Two-party approval is enforced structurally, not by process.

Control	Mechanism	Failure Condition
ARN-level scope	Binds exception to one resource pattern	Breaks if ARN unknown at request time
Expiry timestamp	Hard deny after ISO 8601 date	Breaks if CI skips expiry validation
Two-party approver	CI rejects self-approval at merge	Breaks if CI pipeline is bypassed
Registry versioning	Exception history tracked in git	Breaks if registry stored outside repo

The registry itself must live in the same repository as the Rego files and travel through the same CI/CD pipeline. Separating them creates a synchronization gap: the policy bundle updates on one cadence, the exception registry updates on another, and for the interval between those two events, the evaluation engine holds contradictory state. We saw this produce intermittent admit decisions on resources that should have been denied, specifically during the first deployment week of a new account onboarding sequence.

Registry versioning as drift prevention. The exception registry is a versioned artifact, not a live database. Every change produces a new commit, a new bundle build, and a new signed artifact pushed to the bundle store. An account cannot operate against an exception state that has not cleared the signing pipeline. This is the same immutability guarantee that protects the Rego files, extended to cover the exception surface. It works when the CI pipeline enforces signing on every registry change. It breaks when teams are granted direct write access to the bundle store, bypassing the pipeline entirely.

The practical test of this structure is what happens when an exception expires at 02:00 on a Tuesday. The answer must be automatic denial at next evaluation, with no human action required. If your exception model requires a human to remove an entry before the guardrail re-engages, you have a process dependency, not a structural guarantee. Build the expiry check into the Rego evaluation logic itself, confirm it fires correctly against a synthetic expired entry in your staging environment, and only then promote the registry model to production accounts.

Putting It Into Practice: Recommendations for Your AWS Org

Start with the OPA ruleset in a single account before touching your AWS Organization. This sequencing matters because a ruleset evaluated against one account produces a clean failure inventory. Trying to enforce across all accounts simultaneously buries that signal in noise.

Week one: single-account baseline. Deploy the OPA bundle to your lowest-risk non-production account and run evaluation in audit mode, meaning policy violations are logged but not enforced. After 30 days of data, you will have a complete picture of which of the six guardrails produce the most violations. That ranking tells you which drift vectors are already active in your estate. Identity boundary and tagging completeness violations consistently surface first, because both accumulate silently across every deployment cycle without any enforcement gate.

Week five: enforcement in non-production. Flip the evaluation mode from audit to enforce in that same account. Every subsequent deployment that violates a guardrail fails at the pipeline gate, not at the resource level. This is the critical distinction: a pipeline rejection costs one build cycle. A deployed misconfiguration costs a remediation window, an incident review, and potentially a compliance finding. The mechanism is that enforcement at the CI/CD layer intercepts the Terraform or CloudFormation plan before AWS processes it.

Week nine: Organization-wide rollout. Attach the signed policy bundle to your AWS Organizations management account and distribute it through your bundle store to every member account. The single OPA ruleset governs all accounts simultaneously, so policy duplication across accounts drops to zero. This works when every account’s CI/CD pipeline pulls from the same bundle store endpoint. It breaks when teams maintain local Terraform wrapper scripts that bypass the pipeline, because those scripts never reach the OPA evaluation step.

Measuring drift reduction requires a defined baseline, not a subjective before-and-after comparison. Pull your audit-mode violation log from week one and count total violations per guardrail. That count is your baseline. After enforcement goes live, track the same metric weekly. Zero new violations in a given week means zero drift introduced that week. The target state is a flat line at zero, sustained across all six guardrails.

Tooling integration points. OPA integrates directly into GitHub Actions, GitLab CI, and AWS CodePipeline via the opa eval command against your Rego bundle. The evaluation step runs after terraform plan produces its JSON output and before terraform apply executes. The plan JSON is the input document. This works when your pipeline enforces a strict plan-then-evaluate-then-apply sequence. It breaks when engineers hold apply permissions directly in their AWS IAM roles, because they bypass the pipeline entirely and the OPA gate never fires.

Drift rate as a weekly metric. Report violations-per-account-per-week to your engineering leads, not as a compliance scorecard but as a deployment health signal. A spike in network exposure violations in week three of a new service build tells you the team is iterating on security group rules without checking the guardrail first. The fix is adding a local pre-commit OPA evaluation step to their development workflow, so violations surface on the engineer’s workstation before they reach the pipeline.

Phase	Action	Success Criterion
Weeks 1 to 4	Audit mode, one account	Violation baseline established
Weeks 5 to 8	Enforce mode, one account	Zero new violations per week
Week 9 onward	Enforce mode, full AWS Org	Single ruleset, zero per-account forks

The pre-commit local evaluation step is the last mile. Engineers who see a guardrail violation on their laptop in under three seconds fix it before it enters the shared pipeline.

That feedback loop, not the Organization-wide bundle, is what sustains zero drift at scale. The bundle enforces the floor. The local evaluation step raises the ceiling by shifting correction from the pipeline queue to the individual keystroke.

Install the OPA CLI in your developer tooling standard and add a make lint-policy target that runs opa eval against the same bundle your pipeline uses. The local bundle must pull from the same signed artifact in your bundle store, not from a local copy. A local copy diverges from the production ruleset the moment someone updates the bundle without updating their local file. We saw this produce a class of violations that passed local checks but failed pipeline checks, specifically during the first deployment week after a guardrail update. The fix is pinning the local make target to a bundle version hash, not a file path.

Ongoing drift measurement. Zero drift is a continuous outcome, not a one-time achievement. The mechanism that keeps it continuous is a weekly automated report that queries OPA decision logs and counts deny events by guardrail, by account, and by team. A deny event that reaches the pipeline means a violation escaped the local check. Two consecutive deny events from the same team on the same guardrail means the local tooling is not installed or not running. That is an onboarding gap, not a policy gap. Treat it as one.

The single actionable step after reading this section: run opa eval in audit mode against your lowest-risk account today, export the violation count per guardrail to a spreadsheet, and schedule a 30-day review. That spreadsheet is your drift baseline. Every governance decision you make after that point has a number behind it.