Kubernetes CPU Throttling Is Lying to You: Why Container Limits Bleed Latency

The dashboard says CPU throttling is at 0.5%. The p99 latency on that container says 30% of requests just lost 80 milliseconds to scheduling delay. Both numbers are correct. They are measuring different things, and the one that lands on every Kubernetes cost-and-performance dashboard is not the one that predicts whether your users are angry.

This is the most consequential observability gap in container platforms today. CFS bandwidth control, the kernel mechanism behind cpu.limit, reports throttling as a fraction of total CPU time. But latency damage from throttling is not proportional to total CPU time, it is proportional to the number of 100-millisecond windows that hit the quota at all. A pod can spend 99% of its wall-clock time idle and still throttle 30% of incoming requests if those requests cluster.

The fix is not to remove the metric. The fix is to graph the right one and to understand the four levers that change the throttling behavior of a workload. This post is what the levers are, when each one applies, and the small set of metrics that actually predict tail latency.

The pattern composes with VPA, HPA, and KEDA scaling decisions and the Karpenter rebalance loop we have been writing about, but the ground truth is the kernel scheduler underneath all of them.

The throttling number that hides the pain

Pick a real production scenario. A latency-sensitive service with cpu.request=200m, cpu.limit=500m. Average CPU consumption is 80m. The dashboard reports 0.5% throttling.

Dashboard reads	Reality on the request path	What the team believes
0.5% throttling	5% of requests hit a throttle event with 30-80ms added	”Throttling is fine”
2% throttling	20% of requests, p99 +120ms	”Marginal but ignorable”
5% throttling	40% of requests, p99 +200ms	”Maybe a problem”
10% throttling	60-80% of requests, p99 +300ms	”Now we see it”

The scaling factor between dashboard throttling and tail-latency damage is not 1:1, it is 5x to 20x. By the time the dashboard hits 10% and the team accepts there is a problem, the user-visible latency has been bad for a long time. The metric is not wrong. It is just not the metric anyone needed.

How CFS bandwidth control actually works

The Completely Fair Scheduler enforces cpu.limit through a quota-and-period mechanic. The default period is 100 milliseconds of wall-clock time. The quota is cpu.limit multiplied by the period. A pod with cpu.limit=500m gets 50 milliseconds of CPU per 100-millisecond window.

When a request needs more CPU than the remaining quota in the current window, the container is throttled until the next quota refill. The latency penalty is “however much of the current window is left, plus enough of the next window to finish the work.” For a request that needed 80ms of CPU and arrived 30ms into a window with full quota available, the math runs as follows.

The +30ms is not abstract. That is the actual user-facing tail. It happens once per arrival that crosses a quota boundary. If the workload sees 100 requests per second at 80ms of CPU each on a 500m limit, roughly half of them will straddle a window boundary and add 20-50ms of tail latency every time.

Aggregate over a minute, and the throttled CPU time is 50ms times the throttle count. Across millions of CPU-microseconds of scheduling, that is a small percentage. Across user-visible request latency, it is half your p99.

Two metrics that disagree

cAdvisor exposes both numbers. Most dashboards graph the wrong one.

Metric	What it measures	When it lies
`container_cpu_cfs_throttled_seconds_total`	Total seconds the container was throttled	Always reads small because most windows are not throttled
`container_cpu_cfs_throttled_periods_total`	Count of 100ms windows where ANY throttling occurred	The honest tail-latency predictor
`container_cpu_cfs_periods_total`	Count of all 100ms windows the container ran in	Denominator for the ratio

The right alert metric is throttled_periods / periods. What fraction of windows hit the quota. Anything above 5% of windows is a tail-latency problem worth investigation. Above 20% is operational pain that users are reporting. Above 50% is a workload that should not have a cpu.limit at all.

The metric most teams alert on is throttled_seconds / periods * period_length. This is the right number for capacity planning (“how much CPU did the limit cost me”) but the wrong number for latency (“how often did the limit hurt”). The two are not interchangeable, and switching alerts from the first to the second is the lowest-effort fix in this post.

The four fixes

When the right metric tells you a workload is throttling, four levers change the behavior. Each one has a different effort, risk, and applicability.

Fix	Kernel requirement	Effort	Risk	Typical p99 improvement
Set `cpu.cfs_burst_us` to 25-50% of limit	5.14+	3 lines in pod spec	Bounded by burst cap	60-90% reduction in throttled periods
Remove `cpu.limit` entirely	Any	1 line removed	Noisy-neighbor exposure	Eliminates throttling
Lower CFS period from 100ms to 10ms	Any	Node kernel parameter	+0.5-1.5% scheduler overhead per node	5-10x lower tail variance
Raise `cpu.limit` to 2x request	Any	One number changed	Less protection from runaway pods	50-70% reduction in throttled periods

The CFS burst lever is the right default for most teams in 2026. Three lines per pod, no infrastructure change, latency improvement is large, and the burst is bounded by configuration rather than open-ended. The kernel requirement is the one gate, but most managed Kubernetes (EKS 1.28+, GKE 1.27+, AKS 1.28+) ships 5.14 or newer.

Removing cpu.limit entirely is the right answer for latency-sensitive services on dedicated nodes. The advice “always set a limit” predates VPA and predates good cgroup v2 fairness. On a node with one tenant per node-pool and VPA-managed requests, the limit is purely a tail-latency injector with no upside. Tim Hockin and most large-scale operators have published the same conclusion. The advice is decade-old and should be retired for the workloads it was never meant to protect.

Lowering the CFS period is the right answer for shared multi-tenant clusters where the limit must stay. A 10ms period instead of 100ms reduces the maximum throttling stall to 10ms, which moves the p99 tail from “user-visible” to “in the noise.” The cost is a small per-node scheduler overhead; the latency benefit is dramatic.

What to actually graph in production

The replacement alerting rules are short. Two new alerts plus a deprecation.

Legacy alert	Replacement	Threshold
`container_cpu_cfs_throttled_seconds_total > X`	DEPRECATE, keep only for capacity planning	n/a
(none)	`throttled_periods / periods > 0.05` for 5 min	warn
(none)	`throttled_periods / periods > 0.2` for 1 min	page

The 5% warning fires before users notice. The 20% page fires when users are actively impacted. Both are computed from the same cAdvisor metric the cluster already exports; the change is what the alert evaluates.

The composing observation is that this is a closed-loop remediation pattern in disguise. Detect (throttled_periods crosses threshold), decide (which fix matches the workload shape), act (apply burst, remove limit, or change period), verify (re-read the metric after the next deploy). The detect signal lives in cAdvisor. The decide step is policy on cpu.limit configuration. The act step is a pod spec edit. The verify step is the same metric ten minutes later.

When cpu.limit still belongs

Be honest about the cases where the limit is correct.

Workload shape	Use limit?	Why
Multi-tenant cluster, untrusted code	Yes	Limit is the noisy-neighbor guard; throttling is the cost of safety
Batch jobs (training, ETL)	Yes	Throughput-bound; bound resource use to fit cluster budget
Latency-sensitive service on dedicated nodes	No	Limit only injects tail latency; no protection benefit
Bursty interactive workload, 5.14+ kernel	Limit + burst	CFS burst rebuilds the safety with much less tail damage
Untuned legacy workload	Yes (with monitoring)	Default safety until properly characterized

The decision is workload-shape-dependent. The default of “always set a limit” was correct in 2017 when fairness primitives were weaker. In 2026, the right default is “set a limit for protection cases, remove it for latency cases, and use burst when the kernel allows.” The dashboards you graph need to support that nuance, and the metric you alert on needs to be the one that predicts the latency outcome rather than the one that summarizes the kernel counter.

The lever is small. The CFS quota mechanic is fifty lines of kernel code. The latency damage from throttling is a half-billion-dollar global problem because every Kubernetes cluster reports the wrong metric on its default dashboard. Switch the metric, pick the right fix per workload, and the tail latency you have been chasing for two years gets fixed in an afternoon.