A product team announces a new feature on Tuesday at 10:00 AM Pacific. The marketing email goes out at 09:55. By 10:01 the load balancer is seeing 8 times its baseline request rate. The autoscaler is doing what autoscalers do: it has observed the load, computed that more capacity is needed, requested new nodes, and is waiting for those nodes to boot. By the time the nodes are ready (six minutes later), the launch announcement has been retweeted to a peak that the new capacity can handle, but the first five minutes of customer experience were five different shades of degraded. Half the email recipients who clicked through during those five minutes saw a 502 from the overwhelmed gateway. The launch is widely considered a half-success: the feature is good, but the launch hour was rough.
This is the canonical failure mode of reactive autoscaling on planned events. Autoscalers react; they do not anticipate. A 10x traffic spike that arrives in 60 seconds will outpace any reactive scaler, regardless of how well-tuned its target utilisation is. The fix is to pre-scale before the spike, which engineering teams know but rarely do because the pre-scale work is manual, error-prone, and full of “scale up what” questions that vary per resource type.
ZopNight ships Event Readiness, a wizard that automates the pre-scale. The operator picks the event window, the expected traffic multiplier, and the affected workloads. The wizard generates a plan, shows the cost delta, and (once approved) fires the scale-up before the event, validates the new capacity, then scales down after the event window closes. This post walks through what the wizard does, why three resource classes need three different pre-scale primitives, why lead time is the parameter operators most often get wrong, and how scale-down gating prevents the post-event capacity-overhang that has been the second-biggest cost waste on launch days for years.
Why reactive autoscalers fail on planned launches
The fundamental shape of a reactive autoscaler is a feedback loop: observe load, decide scale action, request capacity, wait for capacity, repeat. The cycle time is the floor on how fast the system can adapt. For a Kubernetes HPA + cluster autoscaler running on EKS, the cycle time is typically 4 to 10 minutes from “load spikes” to “new pods are serving traffic.” For RDS read replicas the cycle time is 8 to 15 minutes. For AWS service quotas the cycle time can be hours or days because the request goes to AWS support.
| Load curve | Reactive only | Pre-scaled |
|---|---|---|
| 10x spike over 60 seconds | First 5-10 minutes degraded | Full capacity from second 0 |
| 5x spike over 5 minutes | First 1-2 minutes degraded, then fine | Slightly over-provisioned |
| 3x spike over 30 minutes | Reactive handles fine | Modest waste from over-pre-scale |
| 2x spike over an hour | Reactive handles fine | Pre-scale unnecessary |
The matrix shows where pre-scaling pays off. For slow ramps (the last two rows), reactive autoscaling is fine. For sharp spikes (the first two rows), reactive autoscaling produces a degraded customer experience for the duration of one or two cycle times. The cost of pre-scaling 10 minutes early is hundreds of dollars; the cost of being undersized for 5 minutes of a launch is a degraded customer experience that takes weeks to recover from. The economics favour pre-scaling by a wide margin.
Pre-scaling is therefore the right default for any planned launch with a known traffic curve. The wizard’s job is to make pre-scaling a 5-minute setup rather than a multi-hour manual exercise per event.
What Event Readiness does
The wizard is a guided flow. The operator answers four questions, the wizard generates a plan, the operator approves the plan, the wizard executes.
The plan generation is deterministic: given the same inputs, the wizard generates the same plan. This matters for repeated events (weekly demos, monthly newsletters, quarterly launches). The operator can save a plan as a template and reuse it.
The plan is also dry-runnable. The operator can ask the wizard to “execute against last quarter’s launch data” to validate that the plan would have handled the actual traffic curve. The dry-run produces a confidence band (“the plan would have served the actual peak with 23% headroom”) without firing any real cloud actions.
Three resource classes, three pre-scale primitives
Not all resources scale the same way. Event Readiness handles three classes, each with its own pre-scale primitive and validation contract.
| Class | Example resources | Pre-scale primitive | Validation |
|---|---|---|---|
| Compute | EKS node groups, ECS services, Lambda provisioned concurrency, GKE node pools | Increase desired-count on the autoscaler / node group | Pods Ready, health checks passing |
| Datastores | RDS / Cloud SQL read replicas, DynamoDB read+write capacity, ElastiCache nodes | Provision new replicas / increase capacity units | Replica in available state, replication lag below threshold |
| Quotas | AWS service quotas (SES, Lambda concurrency, EC2 instance limits), GCP quotas, Azure quotas | Submit quota increase via support API or quota provider | Quota visible in account at new limit |
The compute class is the simplest because the primitives are well-defined and the scale-up is fast. The datastore class is more nuanced because the replica has to catch up on replication before it can serve queries; the validation contract waits for replication lag to drop below threshold before declaring the replica healthy. The quota class is the trickiest because the scale-up is asynchronous and slow (hours to days for some quotas), and the validation contract is simply “is the new quota visible yet.”
The wizard segregates the three classes in the plan view so the operator can see which actions are short, which are slow, and which depend on third-party approval. A quota increase that needs AWS support sign-off shows up as a separate timeline row with a “submit T-48h, ETA T-12h” entry, distinct from the compute and datastore actions that fire automatically.
Lead time per class: the parameter operators get wrong
Lead time is the single most important parameter in the plan, and the one operators most often underestimate. The wizard pre-fills sensible defaults per resource class and surfaces a blocking warning if the operator’s chosen lead time is shorter than the resource’s typical scale-up time.
| Resource | Typical lead time | Why |
|---|---|---|
| EKS node group (small) | 4-7 minutes | Node boot + kubelet ready + pod scheduling |
| EKS node group (large, 100+ nodes) | 8-15 minutes | AWS capacity availability checks per AZ |
| ECS service desired count up | 2-5 minutes | Task launch + healthcheck pass |
| Lambda provisioned concurrency | 1-3 minutes | Container warm-start |
| RDS read replica | 8-15 minutes | Snapshot, restore, replication catch-up |
| Cloud SQL read replica | 10-20 minutes | Same as RDS plus GCP-specific provisioning |
| DynamoDB capacity (RCU/WCU) | 5-30 minutes | Internal AWS capacity rebalancing |
| AWS service quota (Lambda concurrency) | 1-4 hours | Support team review |
| AWS service quota (SES sandbox exit) | 24-72 hours | Compliance / fraud review |
The “AWS service quota” entries are the ones most teams find out about painfully. A team plans a launch for 10 AM Tuesday, requests a quota increase Monday at 6 PM, and discovers Tuesday morning that the request is still under review. The wizard prevents this by reading the operator’s chosen event start time, the resource’s typical lead time, and rendering a blocking warning if the operator started too late. “Your event starts in 14 hours; SES sandbox exit typically takes 48 to 72 hours” appears as a red banner in the plan view.
The override is supported but explicit. An operator can override the wizard’s lead-time default if they know something the wizard does not (e.g., they have a pre-approved support contact at AWS who can expedite). The override is logged in the audit trail; if the event later goes wrong, the audit trail shows that the operator chose the shorter lead time despite the warning.
Gantt + cost preview before commit
The plan is rendered as a Gantt timeline before any action fires. Each row is an action, with a start time, a duration, an expected end time, and a cost delta.
| Time relative to event | Action | Duration | Cost delta |
|---|---|---|---|
| T-15 min | Scale EKS node group web-app from 20 to 80 | 8 min | +$28 |
| T-12 min | Scale ECS service payments-api desired count from 4 to 16 | 3 min | +$14 |
| T-20 min | Provision 2x RDS read replicas for customer-db | 12 min | +$45 |
| T-48 hours | Submit AWS quota increase for Lambda concurrency to 5000 | async | $0 (until used) |
| Event window | Higher capacity serves traffic | 2 hours | +$340 (event-hour spend) |
| T+10 min | Validate baseline metrics back to pre-event | (gate) | $0 |
| T+15 min | Scale EKS / ECS / RDS back to baseline | 5 min | $0 (resources removed) |
| Total marginal cost | $427 |
The operator sees the total cost before committing. $427 for a 2-hour event that prevents 5 minutes of degradation is an easy yes; $4,000 for a 2-hour event that prevents 5 minutes of degradation is a harder conversation. The cost preview moves the decision to where it belongs: before the event, not as a surprise on the next bill.
The Gantt also surfaces dependencies. A pre-scaled EKS node group depends on the AWS quota being raised first; the wizard renders the dependency edge so the operator can see “if the quota does not land by T-30 min, the EKS scale-up fails.” This is the kind of dependency that ad-hoc pre-scale scripts get wrong every time.
Scale-down gating: the second-hard problem
Forgotten scale-downs are the second-biggest cost waste on launch days. A cluster pre-scaled to 80 nodes for a 2-hour event runs at 80 nodes for the next 7 days because the infrastructure team got pulled into post-launch firefighting and forgot to scale back. The cost difference between “scaled for 2 hours” and “scaled for 7 days” is roughly 84x; the over-scaled week often costs more than the event-day pre-scale itself.
Event Readiness automates the scale-down at the configured event end time, but with two gates that prevent the scale-down from breaking a still-busy system.
| Gate | What it checks | What it protects against |
|---|---|---|
| Queue depth back to baseline | Application queue length, message-broker backlog | Scaling down while traffic is still draining |
| p99 latency back to pre-event | Application p99 across the affected services | Scaling down while user experience is still degraded |
| Manual hold | Operator override | ”Traffic is high, hold the scale-down” |
| Hard cap | Maximum extension window (default: 2x event duration) | Indefinite hold if metrics never recover |
The gates run automatically; the operator does not need to babysit the scale-down. If queue depth and p99 are back to baseline at the event end time, the scale-down fires. If not, the scale-down extends the scaled-up state until both gates are satisfied, up to the hard-cap window (after which it fires anyway with a warning).
The post-event report shows the actual scale-down time and the gates that fired or held. A team that consistently sees “scale-down extended by 23 minutes after the event window” knows to schedule the next event with a longer window, or to investigate why traffic is taking longer to drain than expected.
How to use it day to day
The setup workflow is short.
| Step | Action | Where |
|---|---|---|
| 1 | Open Event Readiness | Sidebar → Event Readiness |
| 2 | Pick a new event | ”New event” button |
| 3 | Enter event window, multiplier, affected workloads | Wizard screens |
| 4 | Review the generated plan + cost | Gantt view |
| 5 | Address any blocking warnings (lead times too short) | Plan editor |
| 6 | Save the plan | Save button |
| 7 | Plan fires automatically at the right times | No babysitting needed |
| 8 | Review post-event report | Same page |
For repeated events (weekly demos, monthly newsletters) the operator saves the plan as a template. A subsequent event uses the template; the only inputs that change are the date and (sometimes) the multiplier.
For ad-hoc small events (an internal demo for 20 engineers) the wizard is overkill — manual scaling or just leaving the autoscaler reactive is fine. The wizard pays off for events where the traffic spike is fast (under a few minutes), the customer-facing impact of degradation is high, or both. Most teams find the threshold sits around “any external launch with public marketing tied to a specific time.”
What ZopNight does not yet ship: multi-event scheduling (overlapping events with combined plans), predicted multipliers from historical data (auto-suggest based on past similar events instead of asking the operator), and cross-cloud events that hit both AWS and GCP simultaneously. Each is a future direction; the current deliverable is the single-event, single-cloud pre-scale wizard.
Reactive autoscalers were the right answer for the steady-traffic web of the 2010s. Planned launches with 10x spikes in 60 seconds are a different problem class that needs a different primitive. Pre-scale the right amount of the right resources at the right lead time. Watch the cost-delta before you commit. Let the gated scale-down handle the post-event teardown. That is the work the work is for.


