Self-Service Terraform: 8 Modules That Killed 60% of Our Platform Tickets

Platform teams do not fail because they hire the wrong people. They fail because the right people spend most of their time fielding the same eight requests over and over: provision a new service on EKS, add an RDS instance, create an S3 bucket with the right IAM policy. The requests are not hard. They are just constant.

We measured this across four platform teams. The average ticket took 2.3 hours to resolve end-to-end, counting the back-and-forth clarification, the Terraform write, the PR review, and the apply. A developer filing the ticket waited a day or two. The platform engineer context-switched twice. Nobody was happy.

The fix was not hiring more platform engineers. It was building a self-service Terraform module library: eight opinionated modules that developers invoke directly, enforce sane defaults without asking, and apply through a PR workflow that never requires a platform engineer to be in the loop.

Within 90 days of rolling out those eight modules and wiring them to Atlantis, ticket volume dropped 60%.

Why Platform Teams Drown in Repeat Tickets

The slow path for infrastructure provisioning looks like this:

Architecture diagram

Every step in that flow is a handoff. Handoffs create delay. Platform engineers lose half a day to context-switching. Developers lose a day or two waiting. The same resources get provisioned slightly differently each time because Terraform is written fresh from memory rather than from a tested module.

The resource types in those tickets are not random. When we clustered six months of tickets by resource type, eight categories accounted for 80% of total volume: EKS services, RDS instances, S3 buckets, SQS queues, Lambda functions, VPC subnets, IAM roles, and ECR repositories. That clustering is what drove the eight-module number. It is not an arbitrary figure. Most platform teams find that six to ten resource types cover the bulk of their load.

The 8 Modules and What Each One Enforces

Each module exposes a minimal interface to developers: the things they legitimately need to choose, like service name and instance size. Each module hard-codes the things developers routinely skip: encryption, tagging, network placement, deletion protection.

The pattern matters. Modules fail adoption when they are opinionated about things developers care about, like forcing a specific instance type, while staying silent on things developers forget, like enabling encryption at rest. We inverted that: flexible on sizing, strict on compliance.

ModuleResourceDeveloper InputsEnforced Defaults
eks-serviceEKS Deployment + Servicename, image, replicas, cpu/memory limitsresource limits required, liveness probe required, namespace, labels schema
rds-instanceRDS (Postgres/MySQL)engine, instance class, db nameencryption at rest, deletion protection on, automated backups 7 days, private subnet only
s3-bucketS3 + Bucket Policy + IAMbucket name, allowed principalsversioning on, public access blocked, server-side encryption AES-256, lifecycle rule 90 days
sqs-queueSQS Standard or FIFOqueue name, visibility timeoutserver-side encryption, dead letter queue wired, max receive count 3
lambda-functionLambda + IAM Role + Log Groupfunction name, runtime, handler, memoryleast-privilege execution role, CloudWatch log group with 30-day retention, X-Ray tracing on
vpc-subnetSubnet + Route TableCIDR, availability zone, subnet typeprivate by default, no public IP auto-assign, VPC flow logs enabled
iam-roleIAM Role + Policyrole name, trusted principal, policy statementsno wildcard resources allowed (validation), path /service/, permission boundary attached
ecr-repoECR Repositoryrepo nameimage scanning on push, tag immutability on, lifecycle policy 30 untagged images

The tagging enforcement deserves specific mention. Every module writes a local block that constructs a standard tag map from required inputs: environment, team, cost-center, and managed-by = terraform. Resources come out of the module tagged correctly by default. Our internal governance audit showed this eliminated 90% of untagged resource drift. The manual alternative: a compliance scan finds untagged resources, creates a ticket, an engineer tracks down the owner, the owner tags the resource. That loop took three to five days per resource. The module collapses it to zero.

How the Registry Makes It Truly Self-Service

Modules alone are not enough. If developers still need to ask a platform engineer to run terraform apply, the ticket volume drops but does not disappear. The platform team becomes the apply button instead of the Terraform author. That is a smaller problem, but it is still a bottleneck.

The registry layer removes the human from the apply step. We use Atlantis, which runs as a pod in the cluster and watches pull requests in the infrastructure repository. The self-service flow looks like this:

Architecture diagram

A developer using this flow from PR open to resource live in under 10 minutes. No platform engineer in the loop. The plan output in the PR comment shows exactly what will be created, which resources, which tags, which costs if Infracost is wired in. Developers can review their own plan before applying.

Spacelift and Terraform Cloud provide equivalent workflows if you prefer a managed control plane over a self-hosted Atlantis pod. The mechanism is the same: PR triggers plan, comment triggers apply, platform engineer is not required.

The infrastructure repository where modules live is version-controlled separately from application repositories. Developers open PRs against a modules/ directory, reference the shared module using a relative path or private registry source, and supply only the variables the module exposes. Platform engineers review module updates, not individual resource provisioning requests.

What the Ticket Data Showed Before and After

We tracked tickets for 60 days before rollout and 90 days after. The before period establishes the baseline. The after period starts 30 days post-rollout to account for the learning curve as developers adopted the new workflow.

Ticket CategoryBefore (per month)After (per month)Delta
EKS service provisioning476-87%
RDS instance requests319-71%
S3 bucket + policy setup284-86%
SQS queue creation193-84%
Lambda function setup145-64%
VPC subnet requests112-82%
IAM role requests3822-42%
ECR repository setup91-89%
Other (one-off, complex)4139-5%
Total23891-62%

IAM role tickets showed the smallest reduction for a specific reason: a significant portion of IAM work involves cross-account trust relationships and fine-grained permission debugging that the module cannot automate. The module handles the common case, not the edge case. Developers who need non-standard IAM configurations still file tickets, and those are genuinely complex.

The “other” category barely moved. Those tickets involve multi-region failover setups, cross-account peering, custom WAF rules, and one-off compliance remediation. Those are not repeat tickets. They are genuinely novel work, which is exactly what platform engineers should be spending time on.

Platform engineers reclaimed roughly 180 person-hours per month from ticket work. That time went into module maintenance, Atlantis operations, and building the next layer of automation.

Where Modules Break: The Failure Modes

Self-service Terraform is not maintenance-free. Three failure modes appear consistently in production.

Failure ModeCauseMitigation
State driftDeveloper modifies resource outside Terraform (console, CLI)terraform refresh on a schedule; lock console access for module-managed resources
Version mismatchTeams pin old module versions; a security fix in v1.4 never reaches teams on v1.1Enforce minimum version in CI; send breaking change notifications to module users
Bad default blast radiusA wrong default (too-permissive security group, wrong retention) propagates to every resource using the moduleTest module changes in staging first; use semver and never silently change defaults in a patch release

The version mismatch problem is the one teams underestimate. Once a module ships, teams pin to a version and stay there. If a security default changes in a later version, every team on the old version keeps the old behavior. We address this with two controls: Atlantis is configured to warn on modules older than two minor versions, and we maintain a CHANGELOG.md in the module repo with explicit callouts for default changes.

Drift is a related problem. The module enforces defaults at creation time. If an engineer later modifies the resource in the AWS console, the module does not know. Scheduled terraform plan runs catch this: any plan that shows unexpected changes signals drift. We run these on a 24-hour schedule and route non-empty plans to a Slack channel.

The blast radius concern is real but manageable. A bad default in a module is worse than a bad default in a one-off Terraform file because it propagates to every resource ever created with that module. This is also why modules are strictly better than shared snippets: you can fix a bad default in one place and roll the fix forward via version bumps, something you cannot do with copy-pasted Terraform.

How to Ship Your First Module in a Week

The practical starting point is not architecture. It is your ticket backlog.

Architecture diagram

Pull three months of tickets. Cluster them by resource type. The top cluster is your first module. Write that one module, wire it to one pilot team with Atlantis, and measure for 30 days before adding the next module.

The temptation is to build all eight modules at once before shipping any of them. This is how module projects stall: the scope grows, the work drags, and the platform team ships nothing while the ticket queue keeps filling. One module in production teaches you more about your team’s usage patterns than eight modules in a design document.

The module itself should take two to three days for a senior engineer to write, test, and document. The Atlantis setup takes another day. The remaining time in the week goes to piloting with one team, collecting feedback, and hardening the variable validation.

If you want to see how internal developer platforms structure self-service infrastructure alongside cost guardrails, that pattern extends naturally from module-level enforcement to platform-level policy-as-code governance. The module is the unit. The platform is the system.

The goal is not eight modules. The goal is 60% fewer tickets. Start with one. Measure. Ship the next one when the data tells you to.