Skip to main content
Event Readiness: A Wizard for Pre-Scaling Infrastructure Before a Launch

Event Readiness: A Wizard for Pre-Scaling Infrastructure Before a Launch

Muskan Bandta By Muskan Bandta
Published: May 12, 2026 10 min read

A product team announces a new feature on Tuesday at 10:00 AM Pacific. The marketing email goes out at 09:55. By 10:01 the load balancer is seeing 8 times its baseline request rate. The autoscaler is doing what autoscalers do: it has observed the load, computed that more capacity is needed, requested new nodes, and is waiting for those nodes to boot. By the time the nodes are ready (six minutes later), the launch announcement has been retweeted to a peak that the new capacity can handle, but the first five minutes of customer experience were five different shades of degraded. Half the email recipients who clicked through during those five minutes saw a 502 from the overwhelmed gateway. The launch is widely considered a half-success: the feature is good, but the launch hour was rough.

This is the canonical failure mode of reactive autoscaling on planned events. Autoscalers react; they do not anticipate. A 10x traffic spike that arrives in 60 seconds will outpace any reactive scaler, regardless of how well-tuned its target utilisation is. The fix is to pre-scale before the spike, which engineering teams know but rarely do because the pre-scale work is manual, error-prone, and full of “scale up what” questions that vary per resource type.

ZopNight ships Event Readiness, a wizard that automates the pre-scale. The operator picks the event window, the expected traffic multiplier, and the affected workloads. The wizard generates a plan, shows the cost delta, and (once approved) fires the scale-up before the event, validates the new capacity, then scales down after the event window closes. This post walks through what the wizard does, why three resource classes need three different pre-scale primitives, why lead time is the parameter operators most often get wrong, and how scale-down gating prevents the post-event capacity-overhang that has been the second-biggest cost waste on launch days for years.

Why reactive autoscalers fail on planned launches

The fundamental shape of a reactive autoscaler is a feedback loop: observe load, decide scale action, request capacity, wait for capacity, repeat. The cycle time is the floor on how fast the system can adapt. For a Kubernetes HPA + cluster autoscaler running on EKS, the cycle time is typically 4 to 10 minutes from “load spikes” to “new pods are serving traffic.” For RDS read replicas the cycle time is 8 to 15 minutes. For AWS service quotas the cycle time can be hours or days because the request goes to AWS support.

Load curveReactive onlyPre-scaled
10x spike over 60 secondsFirst 5-10 minutes degradedFull capacity from second 0
5x spike over 5 minutesFirst 1-2 minutes degraded, then fineSlightly over-provisioned
3x spike over 30 minutesReactive handles fineModest waste from over-pre-scale
2x spike over an hourReactive handles finePre-scale unnecessary

The matrix shows where pre-scaling pays off. For slow ramps (the last two rows), reactive autoscaling is fine. For sharp spikes (the first two rows), reactive autoscaling produces a degraded customer experience for the duration of one or two cycle times. The cost of pre-scaling 10 minutes early is hundreds of dollars; the cost of being undersized for 5 minutes of a launch is a degraded customer experience that takes weeks to recover from. The economics favour pre-scaling by a wide margin.

Pre-scaling is therefore the right default for any planned launch with a known traffic curve. The wizard’s job is to make pre-scaling a 5-minute setup rather than a multi-hour manual exercise per event.

What Event Readiness does

The wizard is a guided flow. The operator answers four questions, the wizard generates a plan, the operator approves the plan, the wizard executes.

Diagram 1

The plan generation is deterministic: given the same inputs, the wizard generates the same plan. This matters for repeated events (weekly demos, monthly newsletters, quarterly launches). The operator can save a plan as a template and reuse it.

The plan is also dry-runnable. The operator can ask the wizard to “execute against last quarter’s launch data” to validate that the plan would have handled the actual traffic curve. The dry-run produces a confidence band (“the plan would have served the actual peak with 23% headroom”) without firing any real cloud actions.

Three resource classes, three pre-scale primitives

Not all resources scale the same way. Event Readiness handles three classes, each with its own pre-scale primitive and validation contract.

ClassExample resourcesPre-scale primitiveValidation
ComputeEKS node groups, ECS services, Lambda provisioned concurrency, GKE node poolsIncrease desired-count on the autoscaler / node groupPods Ready, health checks passing
DatastoresRDS / Cloud SQL read replicas, DynamoDB read+write capacity, ElastiCache nodesProvision new replicas / increase capacity unitsReplica in available state, replication lag below threshold
QuotasAWS service quotas (SES, Lambda concurrency, EC2 instance limits), GCP quotas, Azure quotasSubmit quota increase via support API or quota providerQuota visible in account at new limit

The compute class is the simplest because the primitives are well-defined and the scale-up is fast. The datastore class is more nuanced because the replica has to catch up on replication before it can serve queries; the validation contract waits for replication lag to drop below threshold before declaring the replica healthy. The quota class is the trickiest because the scale-up is asynchronous and slow (hours to days for some quotas), and the validation contract is simply “is the new quota visible yet.”

The wizard segregates the three classes in the plan view so the operator can see which actions are short, which are slow, and which depend on third-party approval. A quota increase that needs AWS support sign-off shows up as a separate timeline row with a “submit T-48h, ETA T-12h” entry, distinct from the compute and datastore actions that fire automatically.

Lead time per class: the parameter operators get wrong

Lead time is the single most important parameter in the plan, and the one operators most often underestimate. The wizard pre-fills sensible defaults per resource class and surfaces a blocking warning if the operator’s chosen lead time is shorter than the resource’s typical scale-up time.

ResourceTypical lead timeWhy
EKS node group (small)4-7 minutesNode boot + kubelet ready + pod scheduling
EKS node group (large, 100+ nodes)8-15 minutesAWS capacity availability checks per AZ
ECS service desired count up2-5 minutesTask launch + healthcheck pass
Lambda provisioned concurrency1-3 minutesContainer warm-start
RDS read replica8-15 minutesSnapshot, restore, replication catch-up
Cloud SQL read replica10-20 minutesSame as RDS plus GCP-specific provisioning
DynamoDB capacity (RCU/WCU)5-30 minutesInternal AWS capacity rebalancing
AWS service quota (Lambda concurrency)1-4 hoursSupport team review
AWS service quota (SES sandbox exit)24-72 hoursCompliance / fraud review

The “AWS service quota” entries are the ones most teams find out about painfully. A team plans a launch for 10 AM Tuesday, requests a quota increase Monday at 6 PM, and discovers Tuesday morning that the request is still under review. The wizard prevents this by reading the operator’s chosen event start time, the resource’s typical lead time, and rendering a blocking warning if the operator started too late. “Your event starts in 14 hours; SES sandbox exit typically takes 48 to 72 hours” appears as a red banner in the plan view.

The override is supported but explicit. An operator can override the wizard’s lead-time default if they know something the wizard does not (e.g., they have a pre-approved support contact at AWS who can expedite). The override is logged in the audit trail; if the event later goes wrong, the audit trail shows that the operator chose the shorter lead time despite the warning.

Gantt + cost preview before commit

The plan is rendered as a Gantt timeline before any action fires. Each row is an action, with a start time, a duration, an expected end time, and a cost delta.

Time relative to eventActionDurationCost delta
T-15 minScale EKS node group web-app from 20 to 808 min+$28
T-12 minScale ECS service payments-api desired count from 4 to 163 min+$14
T-20 minProvision 2x RDS read replicas for customer-db12 min+$45
T-48 hoursSubmit AWS quota increase for Lambda concurrency to 5000async$0 (until used)
Event windowHigher capacity serves traffic2 hours+$340 (event-hour spend)
T+10 minValidate baseline metrics back to pre-event(gate)$0
T+15 minScale EKS / ECS / RDS back to baseline5 min$0 (resources removed)
Total marginal cost$427

The operator sees the total cost before committing. $427 for a 2-hour event that prevents 5 minutes of degradation is an easy yes; $4,000 for a 2-hour event that prevents 5 minutes of degradation is a harder conversation. The cost preview moves the decision to where it belongs: before the event, not as a surprise on the next bill.

The Gantt also surfaces dependencies. A pre-scaled EKS node group depends on the AWS quota being raised first; the wizard renders the dependency edge so the operator can see “if the quota does not land by T-30 min, the EKS scale-up fails.” This is the kind of dependency that ad-hoc pre-scale scripts get wrong every time.

Scale-down gating: the second-hard problem

Forgotten scale-downs are the second-biggest cost waste on launch days. A cluster pre-scaled to 80 nodes for a 2-hour event runs at 80 nodes for the next 7 days because the infrastructure team got pulled into post-launch firefighting and forgot to scale back. The cost difference between “scaled for 2 hours” and “scaled for 7 days” is roughly 84x; the over-scaled week often costs more than the event-day pre-scale itself.

Event Readiness automates the scale-down at the configured event end time, but with two gates that prevent the scale-down from breaking a still-busy system.

GateWhat it checksWhat it protects against
Queue depth back to baselineApplication queue length, message-broker backlogScaling down while traffic is still draining
p99 latency back to pre-eventApplication p99 across the affected servicesScaling down while user experience is still degraded
Manual holdOperator override”Traffic is high, hold the scale-down”
Hard capMaximum extension window (default: 2x event duration)Indefinite hold if metrics never recover

The gates run automatically; the operator does not need to babysit the scale-down. If queue depth and p99 are back to baseline at the event end time, the scale-down fires. If not, the scale-down extends the scaled-up state until both gates are satisfied, up to the hard-cap window (after which it fires anyway with a warning).

The post-event report shows the actual scale-down time and the gates that fired or held. A team that consistently sees “scale-down extended by 23 minutes after the event window” knows to schedule the next event with a longer window, or to investigate why traffic is taking longer to drain than expected.

How to use it day to day

The setup workflow is short.

StepActionWhere
1Open Event ReadinessSidebar → Event Readiness
2Pick a new event”New event” button
3Enter event window, multiplier, affected workloadsWizard screens
4Review the generated plan + costGantt view
5Address any blocking warnings (lead times too short)Plan editor
6Save the planSave button
7Plan fires automatically at the right timesNo babysitting needed
8Review post-event reportSame page

For repeated events (weekly demos, monthly newsletters) the operator saves the plan as a template. A subsequent event uses the template; the only inputs that change are the date and (sometimes) the multiplier.

For ad-hoc small events (an internal demo for 20 engineers) the wizard is overkill — manual scaling or just leaving the autoscaler reactive is fine. The wizard pays off for events where the traffic spike is fast (under a few minutes), the customer-facing impact of degradation is high, or both. Most teams find the threshold sits around “any external launch with public marketing tied to a specific time.”

What ZopNight does not yet ship: multi-event scheduling (overlapping events with combined plans), predicted multipliers from historical data (auto-suggest based on past similar events instead of asking the operator), and cross-cloud events that hit both AWS and GCP simultaneously. Each is a future direction; the current deliverable is the single-event, single-cloud pre-scale wizard.

Reactive autoscalers were the right answer for the steady-traffic web of the 2010s. Planned launches with 10x spikes in 60 seconds are a different problem class that needs a different primitive. Pre-scale the right amount of the right resources at the right lead time. Watch the cost-delta before you commit. Let the gated scale-down handle the post-event teardown. That is the work the work is for.

Muskan Bandta

Written by

Muskan Bandta Author

Engineer at Zop.Dev

ZopDev Resources

Stay in the loop

Get the latest articles, ebooks, and guides
delivered to your inbox. No spam, unsubscribe anytime.