The $90k Observability Bill: Why Your Cardinality Limit Is the One Knob That Matters

The observability bill at a 50-engineer org goes from $8,000/month in year one to $90,000/month by year three. The growth never gets a budget review because each individual instrumentation change looks tiny: a new metric, a new tag on an existing metric, a per-customer dimension someone added to debug a spike. None of those changes show up on the monthly bill as a single line item. They show up as the bill quietly compounding because each one of them multiplied the number of unique series the vendor stores.

The teams that try to fix this usually focus on data volume (samples per second, ingest rate, log lines per day) because that is what the vendor’s dashboard surfaces in big numbers. Data volume is 5-10% of the bill at most vendor pricing tiers in 2026. Cardinality, the count of unique time series your metrics generate, is 50-70% of the bill. Optimizing for ingest rate cuts 5% when 60% is available.

The single knob that actually controls observability cost is cardinality, specifically the count of unique tag-value combinations per metric. A 90-day cardinality-first review at a typical mid-market org cuts $35,000 to $60,000 from the monthly bill with no loss of diagnostic capability and no vendor migration. The work is 2-4 engineer-weeks. The payback is positive in month one and compounds because the cost growth curve flattens, not just the level.

The piece is the operator’s guide to that review. The composition is with vendor pricing models in the observability space, but the angle here is one layer deeper: the property of your own instrumentation that drives the bill regardless of vendor.

The $90k observability bill nobody planned for

Look at the spend trajectory for a typical mid-market SaaS over three years.

Year	Monthly observability bill	Volume contribution	Cardinality contribution	What changed
1	$8,000	$1,800	$5,200	Initial instrumentation; ~250k series
2	$32,000	$4,000	$25,000	Per-customer dimensions added; ~1.4M series
3	$90,000	$7,500	$76,000	Trace IDs leaked into metric labels; ~4.8M series

The volume column grows linearly with the system (more requests, more events, more log lines per minute). The cardinality column grows faster than linearly because each new tag multiplies the existing series count. By year three the cardinality cost is 8-12x the volume cost on the same instrumentation surface.

The bill conversation usually starts with the wrong number. A platform team looks at the bill, sees that Datadog charges per ingested log GB and per million metric samples, and starts a project to “reduce ingest volume.” They turn off DEBUG logs, sample non-critical traces, reduce metric sample rate from 10s to 60s. The bill drops $3,000/month. The team celebrates and the cost keeps growing because the cardinality knob is untouched.

The right conversation starts at cardinality: which metrics generate the most unique series, why, what are the tags driving the explosion, and which of those tags actually need to be on a metric versus on a trace or log.

Cardinality math: 4 tags can produce 1.4B series

Cardinality is the product of all tag values on a metric. A counter http_requests_total with no tags is 1 series. Add method (8 values: GET, POST, PUT, DELETE, etc.) and it is 8 series. Add endpoint (300 routes) and it is 8 × 300 = 2,400 series. Add status (12 HTTP codes) and it is 8 × 300 × 12 = 28,800 series. Still cheap. Now add user_id with 50,000 values: http_requests_total × method (8) × endpoint (300) × status (12) × user_id (50,000) = 1,440,000,000 potential series.

A single counter just produced 1.44 billion potential series. In practice the actual count is much lower (most combinations never fire) but the live cardinality typically lands at 30-60% of the potential, which on this example is 430M to 860M series. At Datadog’s $0.05/series/month for the standard tier, that one counter costs $21M to $43M per month.

The vendor does not stop you from creating this cardinality. It just bills you for it. The bill arrives, the platform team sees the spike, the question becomes: which metric did this, and which tag is the culprit?

Metric example	Tags	Tag value counts	Total series
`http_requests_total`	method, endpoint, status	8 × 300 × 12	28,800
`http_requests_total`	method, endpoint, status, region	8 × 300 × 12 × 5	144,000
`http_requests_total`	method, endpoint, status, region, user_id	8 × 300 × 12 × 5 × 50,000	7,200,000,000
`db_query_duration`	query_name, db_name	80 × 6	480
`db_query_duration`	query_name, db_name, customer_id	80 × 6 × 400	192,000
`db_query_duration`	query_name, db_name, customer_id, session_id	80 × 6 × 400 × 1,000,000	192,000,000,000

The bolded tags are the cardinality detonators. Each one is a high-uniqueness identifier (per-user, per-customer, per-session) that has no business being a metric dimension. They belong on traces (where each trace is a single event, not a time series) or logs (where each line is a single record). Putting them on a metric multiplies the series count by the cardinality of the identifier.

The three high-cardinality offenders

Most observability bill overruns reduce to three specific tag classes. Removing or aggregating them recovers 40-65% of the bill with zero diagnostic loss.

Tag class	Why it lands on metrics	Where it actually belongs	Bill impact when removed
`user_id` / `customer_id`	”Per-tenant visibility” demand	Trace span attribute	30-50%
`trace_id` / `span_id`	Accidental metric labeling	Already in trace, never metric	10-25%
`version` / `build_id` / `git_sha`	Added for deploy debugging, never pruned	Trace metadata; metric only for last N versions	5-15%

user_id / customer_id on shared metrics. A team wants “per-tenant API latency.” Someone adds customer_id to the api_request_duration histogram. Series count multiplies by customer count. The dashboard shows per-customer p99 latency, which the team uses three times in six months. The bill triples. The right answer: keep customer_id on the trace span; query traces for per-tenant analysis; keep the metric to a rollup (api_request_duration × endpoint × status only, no per-customer breakdown).

trace_id promoted to a metric label. A common bug pattern: an OpenTelemetry SDK is misconfigured to copy trace context attributes onto every emitted metric. trace_id is unique per request (effectively infinite cardinality from the metric’s perspective). The vendor bill shows millions of one-sample series. The fix is at the SDK / collector level: explicit allow-list of attributes to copy from trace to metric, blocking trace_id and span_id by default.

version / build_id not pruned. Deploy lands; instrumentation tags every metric with version=v1.2.3 so the team can compare pre- and post-deploy behavior. Three weeks later there are 40 versions in the tag values, each with its own series. The team only ever queries the last 2-3 versions. The fix: tag with version, but at the collector level prune any version older than 30 days from the metric pipeline (traces and logs can keep the full history because they age out on their own retention curve).

The pattern across all three: high-uniqueness identifiers belong on traces and logs (which the vendor bills very differently and which scale fine with cardinality) rather than on metrics (which compound). The OpenTelemetry three-pillar separation (metrics, traces, logs) exists precisely so that each type of telemetry can handle the data class it is good at. Cardinality goes on traces and logs; metrics stay aggregated.

The cardinality report (one engineer-day, highest-leverage observability work)

The work to make cardinality manageable starts with measurement. Every major observability vendor exposes per-metric cardinality somehow; the surface is just not in the default dashboard.

Vendor	Cardinality introspection	Where to find it
Datadog	Metric summary page; `datadog.estimated_usage.metrics_*`	Per-metric panel in Metrics Explorer
Prometheus	`prometheus_tsdb_head_series` metric, `topk()` against it	Self-monitoring scrape
Honeycomb	Dataset cardinality view	Per-dataset settings → cardinality
Grafana Mimir / Cortex	`cortex_ingester_active_series`	Self-monitoring
New Relic	Metric cardinality limit warnings in usage UI	Account → Usage

The weekly cardinality report is one engineer-day to build and is the single highest-leverage piece of observability work most teams can ship in 2026. It contains:

Column	Purpose
Metric name	Identification
Series count (now)	Current cardinality
Series count (7d ago)	Growth detection
Top 3 tags by value count	Which dimensions are driving it
Cap	The configured per-metric limit
Action	”Over cap”, “Approaching cap”, “Healthy”

The report runs weekly, posts to a #observability-cost Slack channel, and surfaces the top 20 metrics by series count plus any metric that grew >50% week-over-week. The platform team reviews the report in 15 minutes. Most weeks there is no action; the weeks where a new high-cardinality tag landed (often unintentionally) catch it before the next billing cycle.

The team that does not have this report has no way to know which metric is driving the bill until the bill arrives. The team that has the report fixes the cardinality issue in the week it appears, not in the quarter after the bill review.

Aggregation: move high-cardinality data to traces and logs

The right place for high-cardinality data is determined by the OpenTelemetry three-pillar separation. Each pillar has a different cost-vs-detail tradeoff and the high-cardinality identifiers go to the pillars that handle them well.

Data class	Pillar	Why
Request counts by method/status/endpoint	Metric	Low cardinality, queried as time series
Per-customer latency analysis	Trace	High cardinality, queried per-request
Per-user error rate	Trace	High cardinality identifier
Aggregate error rate by service	Metric	Low cardinality
Audit events (who did what, when)	Log	Free-form, often compliance-driven
Trace-level diagnostic detail	Trace	Designed for it
Deploy markers (version comparison over 24h)	Metric (with TTL on version tag)	Pruned automatically

The rule of thumb: if a dimension’s value count exceeds 100 unique values across a 7-day window, it does not belong on a metric. It belongs on a trace or a log. The vendor’s trace and log products price differently (per-event, per-byte, with sampling) and high cardinality is normal for them. The metric product prices per series; high cardinality is the cost detonator.

Cap and alert per metric

The discipline that makes cardinality manageable in the long run is per-metric caps. Without caps, cardinality grows monotonically: every new dimension is a marginal addition that “doesn’t seem that big.” With caps, the team has to make a conscious decision when a metric approaches its limit: do we raise the cap (and accept the cost), remove the dimension (and lose some detail), or aggregate harder (and trade some precision)?

Metric tier	Cap	Typical use
Tier 1 (critical, high-value)	50,000 series	Customer-facing latency, error rates, SLO inputs
Tier 2 (standard)	5,000 series	Internal service health, deploy markers, batch job metrics
Tier 3 (debugging only)	500 series	Ephemeral metrics added during investigation, must be removed after

The cap alerts fire at 80% of the limit. The platform team gets a ping; the metric’s owner has two weeks to either justify a cap raise (with a budget impact estimate) or reduce the cardinality. If neither happens, the metric is downgraded a tier (which lowers its cap and forces the owner to address it).

The numbers above are illustrative; the right caps depend on your vendor’s pricing tier. The right way to set them is to start from the current cardinality distribution and pick caps that allow 95% of metrics to fit Tier 2 with the 5% legitimately-needed-high-cardinality metrics in Tier 1. Tier 3 is the safety valve for debugging that should always be temporary.

The AI-agent special case

AI agent fleets create a cardinality problem that ordinary instrumentation rules do not catch. An agent that logs per-invocation metrics with agent_id (47 agents) and request_id (5 million requests/day) produces 235 million unique series per day just from one metric. The cardinality compounds across the metric set; even a small agent fleet can outspend the rest of the org’s observability bill in a quarter.

The fix is per-agent metric aggregation: emit one metric per agent per minute instead of one per invocation.

Approach	Cardinality / day	Diagnostic capability
Per-invocation metric (`agent_id` + `request_id`)	235,000,000 series	Per-request drilling (impossible to query anyway at this scale)
Per-agent per-minute aggregate (counter + histogram)	47 series × 1,440 min = 67,680	Per-agent rate + latency distribution
Per-request data → traces (sampled at 1%)	(cardinality moves to trace product)	Per-request when needed, sampled

The per-agent-per-minute aggregate uses a counter (agent_invocations_total{agent_id}) for rate and a histogram (agent_latency_ms{agent_id}) for distribution. Together they answer the questions the per-invocation metric was meant to answer (how often does each agent fire, what is the latency distribution) at 1/3000th the cardinality cost. The per-request detail that is genuinely needed (which request was slow, what was the failure) lives on traces with sampling, where the cost model handles per-request data natively.

The pattern composes with the per-agent token quotas work: the quota system already knows each agent’s identity and rate; the observability metric can be a side-effect of the quota counter rather than a separate instrumentation. One source of truth, one cardinality.

Why dropping the vendor is the wrong fix

The first-instinct fix when the observability bill shocks finance is to put the vendor up for review. Solicit a quote from Honeycomb, from Grafana Cloud, from a self-hosted Prometheus + Loki + Tempo stack. The numbers look compelling because the alternative vendor’s bill is based on your current usage at their pricing, and migration projections always look optimistic.

The migration math is not optimistic in practice.

Item	Vendor migration	Cardinality fix
Time to first cost reduction	6-12 months (post-migration)	4-8 weeks
Engineer-weeks invested	24-50 (instrumentation rewrite, dashboard rebuild, alert recreation, runbook updates)	2-4
Risk of degraded incident response during transition	High (parallel systems, alert gaps, training cost)	None
Bill reduction after work complete	30-50% if cardinality is fixed on new vendor; 0% if not	40-65%
Cardinality problem replicates on new vendor?	Yes (it is a property of your instrumentation)	N/A (problem is removed)

The migration only pays off if the cardinality problem is fixed in the new system. Otherwise the new vendor’s bill grows the same way the old one did, just from a lower starting base. Teams that migrate without fixing cardinality discover this in year two on the new vendor and are back where they started.

The cardinality fix on the current vendor is faster, cheaper, lower-risk, and reduces the bill by a similar percentage. The vendor switch may still make sense for product reasons (better trace UX, different SLO tooling, vendor-specific features), but it is not the cost fix. The cost fix is cardinality.

A typical mid-market org running the 90-day cardinality review recovers $35,000 to $60,000 per month within the first quarter. The compounding effect is more valuable than the level: the cost growth curve flattens because the cardinality discipline is now in place. By year four, an org that ran the cardinality review is at $40-50k/month observability spend; an org that did not is at $130-180k/month on the same engineering surface.

Set up the weekly cardinality report. Identify the top 5 metrics by series count. Find the user_id, trace_id, or version_id tag driving each one. Move those dimensions to traces or logs. The bill drops the next month and stops growing the way it used to. The one knob that matters is the one most teams never touch.