ZopDay: Provisioning EKS, GKE, AKS, and a Managed Datastore in One 8-Step Wizard

A new engineer joins on Monday. By Friday they need their first production-grade EKS cluster running so they can deploy the service they were hired to build. They open the company’s Terraform module. They find that the module references three input variables they have never heard of, that the README links to a Confluence page from 2024 that links to another Confluence page from 2023, and that one of the four cluster-creation guides says to use public subnets while another says to use private. By Wednesday they have a half-broken cluster, a stack of new IAM tickets pending platform-team review, and a worry that they are going to miss their first sprint.

This first-cluster experience is the universal complaint. Standing up a production-grade EKS, GKE, or AKS cluster is a 30-step Terraform module plus a 12-step IAM negotiation plus a 5-step VPC layout discussion. The accumulated knowledge is in 4 Confluence pages that disagree with each other. Three to five days of senior-engineer attention per cluster is the going rate, and the variance between teams who do it well and teams who do it badly is on the order of months of debugging time downstream.

ZopNight ships ZopDay, the provisioning wizard that collapses this into a guided 5-to-8 step flow. The output is a production-ready cluster (plus an optional managed datastore: RDS, Cloud SQL, or Azure SQL) with sensible defaults baked in. The senior engineer’s hard-won opinions about private subnets, flow logs, Graviton instances, and scoped IAM are the wizard’s defaults. The new engineer who just joined gets a cluster that looks like the cluster the senior engineer would have built, in 10 minutes instead of 3 days.

This post walks through what ZopDay asks (and what it deliberately does not ask), why the defaults are the actual product, how the handoff to ZopNight’s Live Kubernetes view works, and when the wizard is the wrong tool to reach for.

Why a new engineer takes 3 days to stand up a production EKS cluster

The first-cluster experience is the failure mode that motivates the wizard. Each step of the manual path adds time, expertise required, and risk that something is wrong in a way that does not surface until weeks later.

Step in the manual path	Time cost	Failure mode if done wrong
Pick a region	10 min	Cross-region calls become expensive (covered in the Atlas blog)
VPC + subnet layout	2-4 hours	Nodes get public IPs, NAT gateway misconfigured
Node groups + autoscaler	2-3 hours	Wrong instance family, autoscaler not wired
IAM roles for nodes, ALB, ingress	4-8 hours	Roles too broad (cluster-admin everywhere)
Log forwarding + retention	1-2 hours	Logs in two places, no retention policy
Managed datastore + private networking	3-5 hours	Cross-VPC plumbing breaks, public IP exposed
Default add-ons (CSI, ALB controller, KEDA)	2-4 hours	Wrong versions, no upgrade path
Documentation that the next engineer reads	1-3 hours	Future engineer skips the doc, repeats the mistakes

Total: 15 to 30 hours of senior-engineer time for the first cluster, less for subsequent clusters but still measured in days. ZopDay collapses each step into either “wizard handles it with the right default” or “wizard asks exactly the question that varies.”

What ZopDay actually does

The wizard is 5 to 8 steps depending on the cloud. The steps map to the questions where the answer actually varies between deployments. Everywhere else, the wizard applies a default.

Step	EKS	GKE	AKS	Notes
Cluster name	yes	yes	yes	Slug-style identifier
Region	yes	yes	yes	Determines VPC region, datastore region
Network CIDR	optional	optional	optional	Default is `10.0.0.0/16` if not specified
Kubernetes version	yes (default = latest stable)	yes	yes	Pinned versions for predictability
Node group config	yes (instance type, min/max)	yes	yes	Graviton/Tau default for EKS
Managed datastore?	yes (RDS optional)	yes (Cloud SQL optional)	yes (Azure SQL optional)	Skip if external datastore
Datastore engine + size	conditional	conditional	conditional	Only if datastore selected
Confirm + provision	yes	yes	yes	Shows the named-step provisioning job

The questions the wizard does not ask are the questions where the right answer is the same for 90% of teams: NAT gateway topology, log retention policy, default IAM role for the kubelet, default add-on versions, VPC flow logs on/off. Each of these is a decision the platform team would have argued about for half an hour and then settled on the same answer everyone else lands on. The wizard makes the same call and saves the half-hour.

The opinionated defaults that are the actual product

The defaults are where the platform-team value lives. Each one is a decision the wizard makes that the operator does not see and probably should not be making by hand. Together they amount to a substantial part of what “production-grade cluster” actually means.

Default	What it does	Why the operator does not pick it
Private subnets for nodes, public for load balancer only	Nodes have no inbound internet path; load balancer is the only public entry	This is the right answer for any cluster that does not need direct internet ingress to pods
VPC flow logs on, forwarded to centralised log group	Network traffic is auditable post-incident	The cost is 1-3% of cluster spend; the benefit on day 700 is enormous
Node instance family: Graviton (EKS) / Tau (GCP) / D-series-v5 (Azure)	Cheaper-per-core than equivalent x86 instances	15-20% cost cut by default; user can override if x86 is required
Managed datastore on private IP only	No public endpoint, no IAM auth confusion	Eliminates the entire “datastore exposed to internet” risk class
IAM roles scoped to namespace, not cluster-admin	A service account in `team-a` cannot do things outside `team-a`	Removes a whole vector of blast-radius accidents
Default add-ons pinned to versions known to work together	CSI driver, ALB controller, KEDA versions tested as a set	Avoids the “upgraded one add-on and broke ingress” failure mode

The Graviton default alone is a 15-20% compute cost cut versus equivalent x86 instances on EKS. The flow-logs-on default is the difference between being able to root-cause a network anomaly on day 200 versus being unable to. The private-IP-only datastore default is the difference between a compliant cluster and a finding on the next audit.

None of these defaults are advertised in the wizard’s user-facing copy. They are the work the wizard does on the operator’s behalf. The operator sees “Provision” and gets a cluster that has these properties.

Provisioning timing and the Live Kubernetes handoff

Once the wizard’s final step is confirmed, the provisioning job runs. The job’s steps are named (per the step-named provisioning logs), so the operator watching the job sees “Create VPC”, “Create node group”, “Install ALB controller”, “Provision RDS instance”, “Wire datastore connection secret” rather than opaque snake_case identifiers.

Cloud	Typical provisioning time	What the operator sees on completion
EKS	8-14 minutes	Cluster page → Live Kubernetes view, kube-system pods running
GKE	6-12 minutes	Same, with GKE’s default workloads visible
AKS	14-20 minutes	Same, with AKS’s slower node-pool provisioning

On completion the operator is dropped into the Live Kubernetes view: the 21 typed resource pages, the Crashloop Overview, the per-type drawers. The cluster they just provisioned is fully visible from the moment it boots, with the kube-system pods already enumerated and the warning event stream already running. There is no second product to install, no separate dashboard to configure.

The handoff is the answer to “what do I do now” right when the operator would otherwise be looking for it. The cluster page is the same page they will use to debug a CrashLoopBackOff on day 30, audit a service account on day 100, and right-size a node group on day 200.

ZopDay + ZopNight as Day 1 + Day 2

ZopDay is the speed-to-first-cluster product. ZopNight is the long-term operate-and-optimise product. They are not competitors and they are not separate purchases; they are the two halves of a single lifecycle.

The cluster ZopDay provisions appears in Atlas the moment it is up, with its region and resources plotted on the map. It appears in Cost Reports the moment the first cost record lands (usually within 6 hours). It is eligible for auto-remediation rules immediately, and it can carry visual schedules on its node groups from day one.

This composition is the architectural payoff. Tools that handle Day 1 but not Day 2 hand off to “now configure your monitoring product.” Tools that handle Day 2 but not Day 1 require the customer to have already built the cluster. ZopDay + ZopNight is the same UI, the same primitives, the same cluster model end-to-end.

The exported Terraform: wizard is a starter, not a cage

The wizard generates Terraform under the hood. The operator can view and export the Terraform from the cluster’s settings page. Subsequent reconciliation respects edits the operator makes to the exported module: ZopDay does not enforce that the wizard’s output is preserved.

This is the escape-hatch architecture. The wizard handles the 90% case. The 10% case (a special VPC peering setup, a non-default IAM trust relationship, a custom add-on version) is editable in the Terraform that the wizard wrote. The operator owns the artefact.

This matters because the alternative pattern (wizard that owns the cluster forever) loses customers the moment they hit a customisation the wizard does not expose. ZopDay’s contract is the inverse: the wizard is faster than writing the Terraform by hand, and the Terraform it writes is yours to modify. Customers who outgrow the wizard’s defaults do not have to migrate away; they edit the module and keep the cluster.

How to use ZopDay day to day

The first-cluster workflow is the canonical case.

Step	Action	Where
1	Sign in, open ZopDay	Sidebar → ZopDay
2	Pick cloud (AWS / GCP / Azure)	First wizard screen
3	Answer the 5-8 questions	Wizard screens
4	Confirm + provision	Final wizard screen
5	Watch the named-step job	Provisioning surface
6	Land on Live Kubernetes view	Auto-redirect on completion

ZopDay is the right tool for: a fresh non-prod cluster, a fresh team’s first production cluster, a parallel cluster for a new region, a cluster for a proof-of-concept that might become production. The wizard is fast enough that “spin up a throwaway cluster for the day” is feasible, not a half-day commitment.

ZopDay is the wrong tool when: the cluster needs to land inside a complex pre-existing VPC topology with specific peering requirements, when the team needs Kubernetes versions or add-on versions the wizard does not yet expose, when the cluster is part of a multi-region high-availability setup that requires coordinated provisioning. In those cases the team writes Terraform directly or uses ZopDay to generate a starter and then customises heavily.

The wizard is not the end of platform engineering. It is the end of the platform-engineering chore that nobody enjoys: the first-cluster setup that has been the same 30 steps for the last six years. Skip that chore. Spend the recovered time on the work that is actually unique to your company. That is the work the work is for.