Event Readiness: A Wizard for Pre-Scaling Infrastructure Before a Launch

A product team announces a new feature on Tuesday at 10:00 AM Pacific. The marketing email goes out at 09:55. By 10:01 the load balancer is seeing 8 times its baseline request rate. The autoscaler is doing what autoscalers do: it has observed the load, computed that more capacity is needed, requested new nodes, and is waiting for those nodes to boot. By the time the nodes are ready (six minutes later), the launch announcement has been retweeted to a peak that the new capacity can handle, but the first five minutes of customer experience were five different shades of degraded. Half the email recipients who clicked through during those five minutes saw a 502 from the overwhelmed gateway. The launch is widely considered a half-success: the feature is good, but the launch hour was rough.

This is the canonical failure mode of reactive autoscaling on planned events. Autoscalers react; they do not anticipate. A 10x traffic spike that arrives in 60 seconds will outpace any reactive scaler, regardless of how well-tuned its target utilisation is. The fix is to pre-scale before the spike, which engineering teams know but rarely do because the pre-scale work is manual, error-prone, and full of “scale up what” questions that vary per resource type.

ZopNight ships Event Readiness, a wizard that automates the pre-scale. The operator picks the event window, the expected traffic multiplier, and the affected workloads. The wizard generates a plan, shows the cost delta, and (once approved) fires the scale-up before the event, validates the new capacity, then scales down after the event window closes. This post walks through what the wizard does, why three resource classes need three different pre-scale primitives, why lead time is the parameter operators most often get wrong, and how scale-down gating prevents the post-event capacity-overhang that has been the second-biggest cost waste on launch days for years.

Why reactive autoscalers fail on planned launches

The fundamental shape of a reactive autoscaler is a feedback loop: observe load, decide scale action, request capacity, wait for capacity, repeat. The cycle time is the floor on how fast the system can adapt. For a Kubernetes HPA + cluster autoscaler running on EKS, the cycle time is typically 4 to 10 minutes from “load spikes” to “new pods are serving traffic.” For RDS read replicas the cycle time is 8 to 15 minutes. For AWS service quotas the cycle time can be hours or days because the request goes to AWS support.

Load curve	Reactive only	Pre-scaled
10x spike over 60 seconds	First 5-10 minutes degraded	Full capacity from second 0
5x spike over 5 minutes	First 1-2 minutes degraded, then fine	Slightly over-provisioned
3x spike over 30 minutes	Reactive handles fine	Modest waste from over-pre-scale
2x spike over an hour	Reactive handles fine	Pre-scale unnecessary

The matrix shows where pre-scaling pays off. For slow ramps (the last two rows), reactive autoscaling is fine. For sharp spikes (the first two rows), reactive autoscaling produces a degraded customer experience for the duration of one or two cycle times. The cost of pre-scaling 10 minutes early is hundreds of dollars; the cost of being undersized for 5 minutes of a launch is a degraded customer experience that takes weeks to recover from. The economics favour pre-scaling by a wide margin.

Pre-scaling is therefore the right default for any planned launch with a known traffic curve. The wizard’s job is to make pre-scaling a 5-minute setup rather than a multi-hour manual exercise per event.

What Event Readiness does

The wizard is a guided flow. The operator answers four questions, the wizard generates a plan, the operator approves the plan, the wizard executes.

The plan generation is deterministic: given the same inputs, the wizard generates the same plan. This matters for repeated events (weekly demos, monthly newsletters, quarterly launches). The operator can save a plan as a template and reuse it.

The plan is also dry-runnable. The operator can ask the wizard to “execute against last quarter’s launch data” to validate that the plan would have handled the actual traffic curve. The dry-run produces a confidence band (“the plan would have served the actual peak with 23% headroom”) without firing any real cloud actions.

Three resource classes, three pre-scale primitives

Not all resources scale the same way. Event Readiness handles three classes, each with its own pre-scale primitive and validation contract.

Class	Example resources	Pre-scale primitive	Validation
Compute	EKS node groups, ECS services, Lambda provisioned concurrency, GKE node pools	Increase desired-count on the autoscaler / node group	Pods Ready, health checks passing
Datastores	RDS / Cloud SQL read replicas, DynamoDB read+write capacity, ElastiCache nodes	Provision new replicas / increase capacity units	Replica in `available` state, replication lag below threshold
Quotas	AWS service quotas (SES, Lambda concurrency, EC2 instance limits), GCP quotas, Azure quotas	Submit quota increase via support API or quota provider	Quota visible in account at new limit

The compute class is the simplest because the primitives are well-defined and the scale-up is fast. The datastore class is more nuanced because the replica has to catch up on replication before it can serve queries; the validation contract waits for replication lag to drop below threshold before declaring the replica healthy. The quota class is the trickiest because the scale-up is asynchronous and slow (hours to days for some quotas), and the validation contract is simply “is the new quota visible yet.”

The wizard segregates the three classes in the plan view so the operator can see which actions are short, which are slow, and which depend on third-party approval. A quota increase that needs AWS support sign-off shows up as a separate timeline row with a “submit T-48h, ETA T-12h” entry, distinct from the compute and datastore actions that fire automatically.

Lead time per class: the parameter operators get wrong

Lead time is the single most important parameter in the plan, and the one operators most often underestimate. The wizard pre-fills sensible defaults per resource class and surfaces a blocking warning if the operator’s chosen lead time is shorter than the resource’s typical scale-up time.

Resource	Typical lead time	Why
EKS node group (small)	4-7 minutes	Node boot + kubelet ready + pod scheduling
EKS node group (large, 100+ nodes)	8-15 minutes	AWS capacity availability checks per AZ
ECS service desired count up	2-5 minutes	Task launch + healthcheck pass
Lambda provisioned concurrency	1-3 minutes	Container warm-start
RDS read replica	8-15 minutes	Snapshot, restore, replication catch-up
Cloud SQL read replica	10-20 minutes	Same as RDS plus GCP-specific provisioning
DynamoDB capacity (RCU/WCU)	5-30 minutes	Internal AWS capacity rebalancing
AWS service quota (Lambda concurrency)	1-4 hours	Support team review
AWS service quota (SES sandbox exit)	24-72 hours	Compliance / fraud review

The “AWS service quota” entries are the ones most teams find out about painfully. A team plans a launch for 10 AM Tuesday, requests a quota increase Monday at 6 PM, and discovers Tuesday morning that the request is still under review. The wizard prevents this by reading the operator’s chosen event start time, the resource’s typical lead time, and rendering a blocking warning if the operator started too late. “Your event starts in 14 hours; SES sandbox exit typically takes 48 to 72 hours” appears as a red banner in the plan view.

The override is supported but explicit. An operator can override the wizard’s lead-time default if they know something the wizard does not (e.g., they have a pre-approved support contact at AWS who can expedite). The override is logged in the audit trail; if the event later goes wrong, the audit trail shows that the operator chose the shorter lead time despite the warning.

Gantt + cost preview before commit

The plan is rendered as a Gantt timeline before any action fires. Each row is an action, with a start time, a duration, an expected end time, and a cost delta.

Time relative to event	Action	Duration	Cost delta
T-15 min	Scale EKS node group `web-app` from 20 to 80	8 min	+$28
T-12 min	Scale ECS service `payments-api` desired count from 4 to 16	3 min	+$14
T-20 min	Provision 2x RDS read replicas for `customer-db`	12 min	+$45
T-48 hours	Submit AWS quota increase for Lambda concurrency to 5000	async	$0 (until used)
Event window	Higher capacity serves traffic	2 hours	+$340 (event-hour spend)
T+10 min	Validate baseline metrics back to pre-event	(gate)	$0
T+15 min	Scale EKS / ECS / RDS back to baseline	5 min	$0 (resources removed)
Total marginal cost			$427

The operator sees the total cost before committing. $427 for a 2-hour event that prevents 5 minutes of degradation is an easy yes; $4,000 for a 2-hour event that prevents 5 minutes of degradation is a harder conversation. The cost preview moves the decision to where it belongs: before the event, not as a surprise on the next bill.

The Gantt also surfaces dependencies. A pre-scaled EKS node group depends on the AWS quota being raised first; the wizard renders the dependency edge so the operator can see “if the quota does not land by T-30 min, the EKS scale-up fails.” This is the kind of dependency that ad-hoc pre-scale scripts get wrong every time.

Scale-down gating: the second-hard problem

Forgotten scale-downs are the second-biggest cost waste on launch days. A cluster pre-scaled to 80 nodes for a 2-hour event runs at 80 nodes for the next 7 days because the infrastructure team got pulled into post-launch firefighting and forgot to scale back. The cost difference between “scaled for 2 hours” and “scaled for 7 days” is roughly 84x; the over-scaled week often costs more than the event-day pre-scale itself.

Event Readiness automates the scale-down at the configured event end time, but with two gates that prevent the scale-down from breaking a still-busy system.

Gate	What it checks	What it protects against
Queue depth back to baseline	Application queue length, message-broker backlog	Scaling down while traffic is still draining
p99 latency back to pre-event	Application p99 across the affected services	Scaling down while user experience is still degraded
Manual hold	Operator override	”Traffic is high, hold the scale-down”
Hard cap	Maximum extension window (default: 2x event duration)	Indefinite hold if metrics never recover

The gates run automatically; the operator does not need to babysit the scale-down. If queue depth and p99 are back to baseline at the event end time, the scale-down fires. If not, the scale-down extends the scaled-up state until both gates are satisfied, up to the hard-cap window (after which it fires anyway with a warning).

The post-event report shows the actual scale-down time and the gates that fired or held. A team that consistently sees “scale-down extended by 23 minutes after the event window” knows to schedule the next event with a longer window, or to investigate why traffic is taking longer to drain than expected.

How to use it day to day

The setup workflow is short.

Step	Action	Where
1	Open Event Readiness	Sidebar → Event Readiness
2	Pick a new event	”New event” button
3	Enter event window, multiplier, affected workloads	Wizard screens
4	Review the generated plan + cost	Gantt view
5	Address any blocking warnings (lead times too short)	Plan editor
6	Save the plan	Save button
7	Plan fires automatically at the right times	No babysitting needed
8	Review post-event report	Same page

For repeated events (weekly demos, monthly newsletters) the operator saves the plan as a template. A subsequent event uses the template; the only inputs that change are the date and (sometimes) the multiplier.

For ad-hoc small events (an internal demo for 20 engineers) the wizard is overkill — manual scaling or just leaving the autoscaler reactive is fine. The wizard pays off for events where the traffic spike is fast (under a few minutes), the customer-facing impact of degradation is high, or both. Most teams find the threshold sits around “any external launch with public marketing tied to a specific time.”

What ZopNight does not yet ship: multi-event scheduling (overlapping events with combined plans), predicted multipliers from historical data (auto-suggest based on past similar events instead of asking the operator), and cross-cloud events that hit both AWS and GCP simultaneously. Each is a future direction; the current deliverable is the single-event, single-cloud pre-scale wizard.

Reactive autoscalers were the right answer for the steady-traffic web of the 2010s. Planned launches with 10x spikes in 60 seconds are a different problem class that needs a different primitive. Pre-scale the right amount of the right resources at the right lead time. Watch the cost-delta before you commit. Let the gated scale-down handle the post-event teardown. That is the work the work is for.