A new engineer joins on Monday. By Friday they need their first production-grade EKS cluster running so they can deploy the service they were hired to build. They open the company’s Terraform module. They find that the module references three input variables they have never heard of, that the README links to a Confluence page from 2024 that links to another Confluence page from 2023, and that one of the four cluster-creation guides says to use public subnets while another says to use private. By Wednesday they have a half-broken cluster, a stack of new IAM tickets pending platform-team review, and a worry that they are going to miss their first sprint.
This first-cluster experience is the universal complaint. Standing up a production-grade EKS, GKE, or AKS cluster is a 30-step Terraform module plus a 12-step IAM negotiation plus a 5-step VPC layout discussion. The accumulated knowledge is in 4 Confluence pages that disagree with each other. Three to five days of senior-engineer attention per cluster is the going rate, and the variance between teams who do it well and teams who do it badly is on the order of months of debugging time downstream.
ZopNight ships ZopDay, the provisioning wizard that collapses this into a guided 5-to-8 step flow. The output is a production-ready cluster (plus an optional managed datastore: RDS, Cloud SQL, or Azure SQL) with sensible defaults baked in. The senior engineer’s hard-won opinions about private subnets, flow logs, Graviton instances, and scoped IAM are the wizard’s defaults. The new engineer who just joined gets a cluster that looks like the cluster the senior engineer would have built, in 10 minutes instead of 3 days.
This post walks through what ZopDay asks (and what it deliberately does not ask), why the defaults are the actual product, how the handoff to ZopNight’s Live Kubernetes view works, and when the wizard is the wrong tool to reach for.
Why a new engineer takes 3 days to stand up a production EKS cluster
The first-cluster experience is the failure mode that motivates the wizard. Each step of the manual path adds time, expertise required, and risk that something is wrong in a way that does not surface until weeks later.
| Step in the manual path | Time cost | Failure mode if done wrong |
|---|---|---|
| Pick a region | 10 min | Cross-region calls become expensive (covered in the Atlas blog) |
| VPC + subnet layout | 2-4 hours | Nodes get public IPs, NAT gateway misconfigured |
| Node groups + autoscaler | 2-3 hours | Wrong instance family, autoscaler not wired |
| IAM roles for nodes, ALB, ingress | 4-8 hours | Roles too broad (cluster-admin everywhere) |
| Log forwarding + retention | 1-2 hours | Logs in two places, no retention policy |
| Managed datastore + private networking | 3-5 hours | Cross-VPC plumbing breaks, public IP exposed |
| Default add-ons (CSI, ALB controller, KEDA) | 2-4 hours | Wrong versions, no upgrade path |
| Documentation that the next engineer reads | 1-3 hours | Future engineer skips the doc, repeats the mistakes |
Total: 15 to 30 hours of senior-engineer time for the first cluster, less for subsequent clusters but still measured in days. ZopDay collapses each step into either “wizard handles it with the right default” or “wizard asks exactly the question that varies.”
What ZopDay actually does
The wizard is 5 to 8 steps depending on the cloud. The steps map to the questions where the answer actually varies between deployments. Everywhere else, the wizard applies a default.
| Step | EKS | GKE | AKS | Notes |
|---|---|---|---|---|
| Cluster name | yes | yes | yes | Slug-style identifier |
| Region | yes | yes | yes | Determines VPC region, datastore region |
| Network CIDR | optional | optional | optional | Default is 10.0.0.0/16 if not specified |
| Kubernetes version | yes (default = latest stable) | yes | yes | Pinned versions for predictability |
| Node group config | yes (instance type, min/max) | yes | yes | Graviton/Tau default for EKS |
| Managed datastore? | yes (RDS optional) | yes (Cloud SQL optional) | yes (Azure SQL optional) | Skip if external datastore |
| Datastore engine + size | conditional | conditional | conditional | Only if datastore selected |
| Confirm + provision | yes | yes | yes | Shows the named-step provisioning job |
The questions the wizard does not ask are the questions where the right answer is the same for 90% of teams: NAT gateway topology, log retention policy, default IAM role for the kubelet, default add-on versions, VPC flow logs on/off. Each of these is a decision the platform team would have argued about for half an hour and then settled on the same answer everyone else lands on. The wizard makes the same call and saves the half-hour.
The opinionated defaults that are the actual product
The defaults are where the platform-team value lives. Each one is a decision the wizard makes that the operator does not see and probably should not be making by hand. Together they amount to a substantial part of what “production-grade cluster” actually means.
| Default | What it does | Why the operator does not pick it |
|---|---|---|
| Private subnets for nodes, public for load balancer only | Nodes have no inbound internet path; load balancer is the only public entry | This is the right answer for any cluster that does not need direct internet ingress to pods |
| VPC flow logs on, forwarded to centralised log group | Network traffic is auditable post-incident | The cost is 1-3% of cluster spend; the benefit on day 700 is enormous |
| Node instance family: Graviton (EKS) / Tau (GCP) / D-series-v5 (Azure) | Cheaper-per-core than equivalent x86 instances | 15-20% cost cut by default; user can override if x86 is required |
| Managed datastore on private IP only | No public endpoint, no IAM auth confusion | Eliminates the entire “datastore exposed to internet” risk class |
| IAM roles scoped to namespace, not cluster-admin | A service account in team-a cannot do things outside team-a | Removes a whole vector of blast-radius accidents |
| Default add-ons pinned to versions known to work together | CSI driver, ALB controller, KEDA versions tested as a set | Avoids the “upgraded one add-on and broke ingress” failure mode |
The Graviton default alone is a 15-20% compute cost cut versus equivalent x86 instances on EKS. The flow-logs-on default is the difference between being able to root-cause a network anomaly on day 200 versus being unable to. The private-IP-only datastore default is the difference between a compliant cluster and a finding on the next audit.
None of these defaults are advertised in the wizard’s user-facing copy. They are the work the wizard does on the operator’s behalf. The operator sees “Provision” and gets a cluster that has these properties.
Provisioning timing and the Live Kubernetes handoff
Once the wizard’s final step is confirmed, the provisioning job runs. The job’s steps are named (per the step-named provisioning logs), so the operator watching the job sees “Create VPC”, “Create node group”, “Install ALB controller”, “Provision RDS instance”, “Wire datastore connection secret” rather than opaque snake_case identifiers.
| Cloud | Typical provisioning time | What the operator sees on completion |
|---|---|---|
| EKS | 8-14 minutes | Cluster page → Live Kubernetes view, kube-system pods running |
| GKE | 6-12 minutes | Same, with GKE’s default workloads visible |
| AKS | 14-20 minutes | Same, with AKS’s slower node-pool provisioning |
On completion the operator is dropped into the Live Kubernetes view: the 21 typed resource pages, the Crashloop Overview, the per-type drawers. The cluster they just provisioned is fully visible from the moment it boots, with the kube-system pods already enumerated and the warning event stream already running. There is no second product to install, no separate dashboard to configure.
The handoff is the answer to “what do I do now” right when the operator would otherwise be looking for it. The cluster page is the same page they will use to debug a CrashLoopBackOff on day 30, audit a service account on day 100, and right-size a node group on day 200.
ZopDay + ZopNight as Day 1 + Day 2
ZopDay is the speed-to-first-cluster product. ZopNight is the long-term operate-and-optimise product. They are not competitors and they are not separate purchases; they are the two halves of a single lifecycle.
The cluster ZopDay provisions appears in Atlas the moment it is up, with its region and resources plotted on the map. It appears in Cost Reports the moment the first cost record lands (usually within 6 hours). It is eligible for auto-remediation rules immediately, and it can carry visual schedules on its node groups from day one.
This composition is the architectural payoff. Tools that handle Day 1 but not Day 2 hand off to “now configure your monitoring product.” Tools that handle Day 2 but not Day 1 require the customer to have already built the cluster. ZopDay + ZopNight is the same UI, the same primitives, the same cluster model end-to-end.
The exported Terraform: wizard is a starter, not a cage
The wizard generates Terraform under the hood. The operator can view and export the Terraform from the cluster’s settings page. Subsequent reconciliation respects edits the operator makes to the exported module: ZopDay does not enforce that the wizard’s output is preserved.
This is the escape-hatch architecture. The wizard handles the 90% case. The 10% case (a special VPC peering setup, a non-default IAM trust relationship, a custom add-on version) is editable in the Terraform that the wizard wrote. The operator owns the artefact.
This matters because the alternative pattern (wizard that owns the cluster forever) loses customers the moment they hit a customisation the wizard does not expose. ZopDay’s contract is the inverse: the wizard is faster than writing the Terraform by hand, and the Terraform it writes is yours to modify. Customers who outgrow the wizard’s defaults do not have to migrate away; they edit the module and keep the cluster.
How to use ZopDay day to day
The first-cluster workflow is the canonical case.
| Step | Action | Where |
|---|---|---|
| 1 | Sign in, open ZopDay | Sidebar → ZopDay |
| 2 | Pick cloud (AWS / GCP / Azure) | First wizard screen |
| 3 | Answer the 5-8 questions | Wizard screens |
| 4 | Confirm + provision | Final wizard screen |
| 5 | Watch the named-step job | Provisioning surface |
| 6 | Land on Live Kubernetes view | Auto-redirect on completion |
ZopDay is the right tool for: a fresh non-prod cluster, a fresh team’s first production cluster, a parallel cluster for a new region, a cluster for a proof-of-concept that might become production. The wizard is fast enough that “spin up a throwaway cluster for the day” is feasible, not a half-day commitment.
ZopDay is the wrong tool when: the cluster needs to land inside a complex pre-existing VPC topology with specific peering requirements, when the team needs Kubernetes versions or add-on versions the wizard does not yet expose, when the cluster is part of a multi-region high-availability setup that requires coordinated provisioning. In those cases the team writes Terraform directly or uses ZopDay to generate a starter and then customises heavily.
The wizard is not the end of platform engineering. It is the end of the platform-engineering chore that nobody enjoys: the first-cluster setup that has been the same 30 steps for the last six years. Skip that chore. Spend the recovered time on the work that is actually unique to your company. That is the work the work is for.


