The observability bill at a 50-engineer org goes from $8,000/month in year one to $90,000/month by year three. The growth never gets a budget review because each individual instrumentation change looks tiny: a new metric, a new tag on an existing metric, a per-customer dimension someone added to debug a spike. None of those changes show up on the monthly bill as a single line item. They show up as the bill quietly compounding because each one of them multiplied the number of unique series the vendor stores.
The teams that try to fix this usually focus on data volume (samples per second, ingest rate, log lines per day) because that is what the vendor’s dashboard surfaces in big numbers. Data volume is 5-10% of the bill at most vendor pricing tiers in 2026. Cardinality, the count of unique time series your metrics generate, is 50-70% of the bill. Optimizing for ingest rate cuts 5% when 60% is available.
The single knob that actually controls observability cost is cardinality, specifically the count of unique tag-value combinations per metric. A 90-day cardinality-first review at a typical mid-market org cuts $35,000 to $60,000 from the monthly bill with no loss of diagnostic capability and no vendor migration. The work is 2-4 engineer-weeks. The payback is positive in month one and compounds because the cost growth curve flattens, not just the level.
The piece is the operator’s guide to that review. The composition is with vendor pricing models in the observability space, but the angle here is one layer deeper: the property of your own instrumentation that drives the bill regardless of vendor.
The $90k observability bill nobody planned for
Look at the spend trajectory for a typical mid-market SaaS over three years.
| Year | Monthly observability bill | Volume contribution | Cardinality contribution | What changed |
|---|---|---|---|---|
| 1 | $8,000 | $1,800 | $5,200 | Initial instrumentation; ~250k series |
| 2 | $32,000 | $4,000 | $25,000 | Per-customer dimensions added; ~1.4M series |
| 3 | $90,000 | $7,500 | $76,000 | Trace IDs leaked into metric labels; ~4.8M series |
The volume column grows linearly with the system (more requests, more events, more log lines per minute). The cardinality column grows faster than linearly because each new tag multiplies the existing series count. By year three the cardinality cost is 8-12x the volume cost on the same instrumentation surface.
The bill conversation usually starts with the wrong number. A platform team looks at the bill, sees that Datadog charges per ingested log GB and per million metric samples, and starts a project to “reduce ingest volume.” They turn off DEBUG logs, sample non-critical traces, reduce metric sample rate from 10s to 60s. The bill drops $3,000/month. The team celebrates and the cost keeps growing because the cardinality knob is untouched.
The right conversation starts at cardinality: which metrics generate the most unique series, why, what are the tags driving the explosion, and which of those tags actually need to be on a metric versus on a trace or log.
Cardinality math: 4 tags can produce 1.4B series
Cardinality is the product of all tag values on a metric. A counter http_requests_total with no tags is 1 series. Add method (8 values: GET, POST, PUT, DELETE, etc.) and it is 8 series. Add endpoint (300 routes) and it is 8 × 300 = 2,400 series. Add status (12 HTTP codes) and it is 8 × 300 × 12 = 28,800 series. Still cheap. Now add user_id with 50,000 values: http_requests_total × method (8) × endpoint (300) × status (12) × user_id (50,000) = 1,440,000,000 potential series.
A single counter just produced 1.44 billion potential series. In practice the actual count is much lower (most combinations never fire) but the live cardinality typically lands at 30-60% of the potential, which on this example is 430M to 860M series. At Datadog’s $0.05/series/month for the standard tier, that one counter costs $21M to $43M per month.
The vendor does not stop you from creating this cardinality. It just bills you for it. The bill arrives, the platform team sees the spike, the question becomes: which metric did this, and which tag is the culprit?
| Metric example | Tags | Tag value counts | Total series |
|---|---|---|---|
http_requests_total | method, endpoint, status | 8 × 300 × 12 | 28,800 |
http_requests_total | method, endpoint, status, region | 8 × 300 × 12 × 5 | 144,000 |
http_requests_total | method, endpoint, status, region, user_id | 8 × 300 × 12 × 5 × 50,000 | 7,200,000,000 |
db_query_duration | query_name, db_name | 80 × 6 | 480 |
db_query_duration | query_name, db_name, customer_id | 80 × 6 × 400 | 192,000 |
db_query_duration | query_name, db_name, customer_id, session_id | 80 × 6 × 400 × 1,000,000 | 192,000,000,000 |
The bolded tags are the cardinality detonators. Each one is a high-uniqueness identifier (per-user, per-customer, per-session) that has no business being a metric dimension. They belong on traces (where each trace is a single event, not a time series) or logs (where each line is a single record). Putting them on a metric multiplies the series count by the cardinality of the identifier.
The three high-cardinality offenders
Most observability bill overruns reduce to three specific tag classes. Removing or aggregating them recovers 40-65% of the bill with zero diagnostic loss.
| Tag class | Why it lands on metrics | Where it actually belongs | Bill impact when removed |
|---|---|---|---|
user_id / customer_id | ”Per-tenant visibility” demand | Trace span attribute | 30-50% |
trace_id / span_id | Accidental metric labeling | Already in trace, never metric | 10-25% |
version / build_id / git_sha | Added for deploy debugging, never pruned | Trace metadata; metric only for last N versions | 5-15% |
user_id / customer_id on shared metrics. A team wants “per-tenant API latency.” Someone adds customer_id to the api_request_duration histogram. Series count multiplies by customer count. The dashboard shows per-customer p99 latency, which the team uses three times in six months. The bill triples. The right answer: keep customer_id on the trace span; query traces for per-tenant analysis; keep the metric to a rollup (api_request_duration × endpoint × status only, no per-customer breakdown).
trace_id promoted to a metric label. A common bug pattern: an OpenTelemetry SDK is misconfigured to copy trace context attributes onto every emitted metric. trace_id is unique per request (effectively infinite cardinality from the metric’s perspective). The vendor bill shows millions of one-sample series. The fix is at the SDK / collector level: explicit allow-list of attributes to copy from trace to metric, blocking trace_id and span_id by default.
version / build_id not pruned. Deploy lands; instrumentation tags every metric with version=v1.2.3 so the team can compare pre- and post-deploy behavior. Three weeks later there are 40 versions in the tag values, each with its own series. The team only ever queries the last 2-3 versions. The fix: tag with version, but at the collector level prune any version older than 30 days from the metric pipeline (traces and logs can keep the full history because they age out on their own retention curve).
The pattern across all three: high-uniqueness identifiers belong on traces and logs (which the vendor bills very differently and which scale fine with cardinality) rather than on metrics (which compound). The OpenTelemetry three-pillar separation (metrics, traces, logs) exists precisely so that each type of telemetry can handle the data class it is good at. Cardinality goes on traces and logs; metrics stay aggregated.
The cardinality report (one engineer-day, highest-leverage observability work)
The work to make cardinality manageable starts with measurement. Every major observability vendor exposes per-metric cardinality somehow; the surface is just not in the default dashboard.
| Vendor | Cardinality introspection | Where to find it |
|---|---|---|
| Datadog | Metric summary page; datadog.estimated_usage.metrics_* | Per-metric panel in Metrics Explorer |
| Prometheus | prometheus_tsdb_head_series metric, topk() against it | Self-monitoring scrape |
| Honeycomb | Dataset cardinality view | Per-dataset settings → cardinality |
| Grafana Mimir / Cortex | cortex_ingester_active_series | Self-monitoring |
| New Relic | Metric cardinality limit warnings in usage UI | Account → Usage |
The weekly cardinality report is one engineer-day to build and is the single highest-leverage piece of observability work most teams can ship in 2026. It contains:
| Column | Purpose |
|---|---|
| Metric name | Identification |
| Series count (now) | Current cardinality |
| Series count (7d ago) | Growth detection |
| Top 3 tags by value count | Which dimensions are driving it |
| Cap | The configured per-metric limit |
| Action | ”Over cap”, “Approaching cap”, “Healthy” |
The report runs weekly, posts to a #observability-cost Slack channel, and surfaces the top 20 metrics by series count plus any metric that grew >50% week-over-week. The platform team reviews the report in 15 minutes. Most weeks there is no action; the weeks where a new high-cardinality tag landed (often unintentionally) catch it before the next billing cycle.
The team that does not have this report has no way to know which metric is driving the bill until the bill arrives. The team that has the report fixes the cardinality issue in the week it appears, not in the quarter after the bill review.
Aggregation: move high-cardinality data to traces and logs
The right place for high-cardinality data is determined by the OpenTelemetry three-pillar separation. Each pillar has a different cost-vs-detail tradeoff and the high-cardinality identifiers go to the pillars that handle them well.
| Data class | Pillar | Why |
|---|---|---|
| Request counts by method/status/endpoint | Metric | Low cardinality, queried as time series |
| Per-customer latency analysis | Trace | High cardinality, queried per-request |
| Per-user error rate | Trace | High cardinality identifier |
| Aggregate error rate by service | Metric | Low cardinality |
| Audit events (who did what, when) | Log | Free-form, often compliance-driven |
| Trace-level diagnostic detail | Trace | Designed for it |
| Deploy markers (version comparison over 24h) | Metric (with TTL on version tag) | Pruned automatically |
The rule of thumb: if a dimension’s value count exceeds 100 unique values across a 7-day window, it does not belong on a metric. It belongs on a trace or a log. The vendor’s trace and log products price differently (per-event, per-byte, with sampling) and high cardinality is normal for them. The metric product prices per series; high cardinality is the cost detonator.
Cap and alert per metric
The discipline that makes cardinality manageable in the long run is per-metric caps. Without caps, cardinality grows monotonically: every new dimension is a marginal addition that “doesn’t seem that big.” With caps, the team has to make a conscious decision when a metric approaches its limit: do we raise the cap (and accept the cost), remove the dimension (and lose some detail), or aggregate harder (and trade some precision)?
| Metric tier | Cap | Typical use |
|---|---|---|
| Tier 1 (critical, high-value) | 50,000 series | Customer-facing latency, error rates, SLO inputs |
| Tier 2 (standard) | 5,000 series | Internal service health, deploy markers, batch job metrics |
| Tier 3 (debugging only) | 500 series | Ephemeral metrics added during investigation, must be removed after |
The cap alerts fire at 80% of the limit. The platform team gets a ping; the metric’s owner has two weeks to either justify a cap raise (with a budget impact estimate) or reduce the cardinality. If neither happens, the metric is downgraded a tier (which lowers its cap and forces the owner to address it).
The numbers above are illustrative; the right caps depend on your vendor’s pricing tier. The right way to set them is to start from the current cardinality distribution and pick caps that allow 95% of metrics to fit Tier 2 with the 5% legitimately-needed-high-cardinality metrics in Tier 1. Tier 3 is the safety valve for debugging that should always be temporary.
The AI-agent special case
AI agent fleets create a cardinality problem that ordinary instrumentation rules do not catch. An agent that logs per-invocation metrics with agent_id (47 agents) and request_id (5 million requests/day) produces 235 million unique series per day just from one metric. The cardinality compounds across the metric set; even a small agent fleet can outspend the rest of the org’s observability bill in a quarter.
The fix is per-agent metric aggregation: emit one metric per agent per minute instead of one per invocation.
| Approach | Cardinality / day | Diagnostic capability |
|---|---|---|
Per-invocation metric (agent_id + request_id) | 235,000,000 series | Per-request drilling (impossible to query anyway at this scale) |
| Per-agent per-minute aggregate (counter + histogram) | 47 series × 1,440 min = 67,680 | Per-agent rate + latency distribution |
| Per-request data → traces (sampled at 1%) | (cardinality moves to trace product) | Per-request when needed, sampled |
The per-agent-per-minute aggregate uses a counter (agent_invocations_total{agent_id}) for rate and a histogram (agent_latency_ms{agent_id}) for distribution. Together they answer the questions the per-invocation metric was meant to answer (how often does each agent fire, what is the latency distribution) at 1/3000th the cardinality cost. The per-request detail that is genuinely needed (which request was slow, what was the failure) lives on traces with sampling, where the cost model handles per-request data natively.
The pattern composes with the per-agent token quotas work: the quota system already knows each agent’s identity and rate; the observability metric can be a side-effect of the quota counter rather than a separate instrumentation. One source of truth, one cardinality.
Why dropping the vendor is the wrong fix
The first-instinct fix when the observability bill shocks finance is to put the vendor up for review. Solicit a quote from Honeycomb, from Grafana Cloud, from a self-hosted Prometheus + Loki + Tempo stack. The numbers look compelling because the alternative vendor’s bill is based on your current usage at their pricing, and migration projections always look optimistic.
The migration math is not optimistic in practice.
| Item | Vendor migration | Cardinality fix |
|---|---|---|
| Time to first cost reduction | 6-12 months (post-migration) | 4-8 weeks |
| Engineer-weeks invested | 24-50 (instrumentation rewrite, dashboard rebuild, alert recreation, runbook updates) | 2-4 |
| Risk of degraded incident response during transition | High (parallel systems, alert gaps, training cost) | None |
| Bill reduction after work complete | 30-50% if cardinality is fixed on new vendor; 0% if not | 40-65% |
| Cardinality problem replicates on new vendor? | Yes (it is a property of your instrumentation) | N/A (problem is removed) |
The migration only pays off if the cardinality problem is fixed in the new system. Otherwise the new vendor’s bill grows the same way the old one did, just from a lower starting base. Teams that migrate without fixing cardinality discover this in year two on the new vendor and are back where they started.
The cardinality fix on the current vendor is faster, cheaper, lower-risk, and reduces the bill by a similar percentage. The vendor switch may still make sense for product reasons (better trace UX, different SLO tooling, vendor-specific features), but it is not the cost fix. The cost fix is cardinality.
A typical mid-market org running the 90-day cardinality review recovers $35,000 to $60,000 per month within the first quarter. The compounding effect is more valuable than the level: the cost growth curve flattens because the cardinality discipline is now in place. By year four, an org that ran the cardinality review is at $40-50k/month observability spend; an org that did not is at $130-180k/month on the same engineering surface.
Set up the weekly cardinality report. Identify the top 5 metrics by series count. Find the user_id, trace_id, or version_id tag driving each one. Move those dimensions to traces or logs. The bill drops the next month and stops growing the way it used to. The one knob that matters is the one most teams never touch.


