A B2B SaaS product team ships its first AI feature in 2024. By 2026, the same team has 12 AI features in production: summarization, classification, extraction, search, an AI assistant, three flavors of auto-complete, two analytics features, and the chatbot product engineering still calls “the demo” eight months after launch. The Anthropic bill is $48,000 per month — the same kind of black-box cloud bill that plagued infrastructure spend before FinOps. Nobody can tell you what each feature costs.
The CFO asks “what’s our AI cost per customer?” The answer that arrives a week later is wrong because nobody had instrumentation in place. The team that shipped the latest feature with a 4,000-token system prompt and 1M monthly requests doesn’t realize until the following month that they alone added $12,000 to the bill.
FinOps is the engineering practice of bringing financial accountability to variable cloud spend by aligning engineering, finance, and product on continuous cost decisions, per the FinOps Foundation. Applied to LLM ops, the practice has four levers: tag every call, count tokens authoritatively, aggregate per feature, enforce per-feature budgets. This piece covers each in implementation order.
Why Your AI Bill Is a Black Box
The model pricing structure makes per-feature accounting essential, not optional. The cost gap between flagship and small models is roughly 18-20x per output token. A feature that runs on Opus when Haiku would suffice costs 18x what it should — but you cannot tell which features those are without per-feature attribution.
| Model | Input ($/MTok) | Output ($/MTok) | Use case |
|---|---|---|---|
| Claude Opus 4.5 | $15 | $75 | Complex reasoning, long-form generation |
| Claude Sonnet 4.6 | $3 | $15 | Production default, balanced quality/cost |
| Claude Haiku 4.5 | $0.80 | $4 | Classification, extraction, structured output |
| GPT-4 Turbo | $10 | $30 | Reasoning, complex agents |
| GPT-3.5 Turbo | $0.50 | $1.50 | Simple chat, classification |
A typical B2B SaaS feature processes 800-2,000 input tokens and produces 200-600 output tokens per request, per Anthropic case studies. The pattern echoes chargeback / showback frameworks used for cloud cost — same accountability problem, new line item. At Sonnet rates, that is $0.0027 to $0.0150 per request. A feature handling 100,000 requests per month costs $270 to $1,500. With 12 such features and uneven distribution, the bill ranges $5,000 to $25,000 per month — and “uneven distribution” is the part you cannot see without attribution.
Tagging at the Call Site: The One Line That Makes Everything Else Possible
Adding a feature_id tag to every LLM call is the architectural decision that determines whether per-feature accounting is possible at all. Adding it from day one is a single line of code at every call site. Adding it retroactively across a 30-feature codebase is a quarter-long migration through 30 different teams’ code.
Both major providers accept metadata that flows through to their consoles and to your usage logs. The pattern:
Anthropic accepts a metadata.user_id string up to 256 chars. OpenAI accepts a user parameter up to 64 chars. Both end up in the provider’s console and in any logs your wrapper writes. The tag should encode three things: the feature owner, the request ID, and the tenant.
| Field | Example | What it enables |
|---|---|---|
| feature_id | summarize_email_v2 | Per-feature monthly roll-up |
| request_id | req_2k4a8f9... | Trace one request through retries, fallbacks |
| tenant_id | tenant_acme_corp | Per-customer cost (essential for unit economics) |
| model_used | claude-sonnet-4-6 | Detect when a feature accidentally upgraded model |
| cached_tokens | 12000 | Track prompt-cache hit rate per feature |
This pattern works when the call site is yours to modify. It breaks when LLM calls flow through a third-party SDK that does not expose a metadata pass-through, in which case the wrapper has to be replaced or proxied.
Counting Tokens From Provider Responses, Not Estimates
Estimating tokens with tiktoken or word-count heuristics drifts 5-15% from authoritative billing. The provider response is the truth. Both Anthropic and OpenAI return token counts in every response.
The Anthropic response surfaces response.usage.input_tokens and response.usage.output_tokens. OpenAI returns usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens. Neither charges for tokens you didn’t send or receive. Use these values, not estimates.
The usage log table needs the columns to support all the queries you’ll want later:
| Column | Type | Notes |
|---|---|---|
| timestamp | timestamptz | When the call completed |
| feature_id | text | The tag from the call site |
| tenant_id | text | Per-customer attribution |
| request_id | text | Trace through retries / fallback chain |
| provider | text | anthropic / openai / gemini |
| model | text | Specific model used (matters for cost rollup) |
| input_tokens | int | From response.usage |
| output_tokens | int | From response.usage |
| cached_input_tokens | int | If prompt caching is on |
| latency_ms | int | For p50/p95 dashboards |
| error | text | null on success, error class on failure |
This pattern works when every LLM call goes through one wrapper. It breaks when half the codebase calls the SDK directly and half goes through a wrapper, because the direct calls don’t end up in the log. The fix is a lint rule that bans direct SDK imports outside the wrapper module.
Model Routing: The 18x Cost Lever Most Teams Skip
The pricing table above shows an 18-20x cost gap between flagship and small models per output token. Most teams default to flagship for everything because they tested with flagship during prototyping. Auditing each feature against the question “does this need flagship-quality output?” typically shows 60-70% of features tolerate the small model.
The small-model-first pattern routes to Haiku, validates the output, falls back to Sonnet only on low-confidence responses.
For a 10:1 success ratio (Haiku handles 10 requests for every 1 that escalates to Sonnet), the blended cost is roughly 1/10th of running Sonnet for everything. The math:
| Routing | Cost per 1M requests (avg 1k in / 300 out) |
|---|---|
| Sonnet only | $7,500 |
| Haiku only | $1,800 |
| Haiku-first, Sonnet fallback (10:1 ratio) | $2,375 |
| Haiku-first, Sonnet fallback (5:1 ratio) | $2,950 |
Confidence checks are low-cost and feature-specific. For structured extraction, validate the JSON parses and required fields are present. For classification, check the predicted class against an allowlist. For summarization, count output tokens vs input tokens to flag pathological short responses. The validator runs in microseconds; the savings compound.
| Feature class | Recommended model | Fallback policy |
|---|---|---|
| Structured extraction (JSON, key-value) | Haiku | Sonnet on JSON parse error or missing field |
| Classification (single label) | Haiku | Sonnet on low-confidence (logprobs / consensus check) |
| Summarization | Sonnet | Opus on length > 50k input or “complex source” flag |
| Creative generation | Sonnet | Opus only when explicitly requested |
| Complex reasoning, agents | Sonnet | Opus per feature decision, not per request |
| Free-form chat | Sonnet | No fallback (chat tolerates variance) |
This pattern works when the low-cost model can handle the majority of inputs. It breaks when the inputs are uniformly hard (every request is genuinely complex), in which case the fallback rate climbs above 50% and the routing overhead exceeds the savings.
Prompt Caching and System Prompt Diet
Two related cost levers on the input side. Anthropic prompt caching charges $1.25/MTok for the initial cache write and $0.30/MTok for cached reads on Sonnet, against the standard $3/MTok input rate. For a 50,000-token system prompt re-used 1,000 times per day:
| Setup | Daily input cost | Monthly cost |
|---|---|---|
| No caching | $150 | $4,500 |
| Cache write once + 999 cached reads | $0.06 + $14.97 | $450 |
| Trim system prompt to 12,000 tokens, no cache | $36 | $1,080 |
| Trim to 12,000 tokens + cache | $0.015 + $3.59 | $108 |
The system prompt diet matters independently. Most production system prompts are 2-4x larger than necessary because they accumulate examples and policy text over months without anyone removing the redundant ones. Trimming a 4,000-token system prompt to 1,000 tokens for a feature handling 1M requests/month saves $9,000 monthly at Sonnet rates.
Output token cost dominates for most features. Trimming system prompts matters but capping max_tokens and prompting for terser outputs (“respond in 2 sentences”, “JSON only, no explanation”) usually saves more. A feature averaging 600 output tokens that drops to 300 with a tighter prompt cuts output cost in half — and at $15/MTok output, that is the larger half of the bill.
This pattern works when the system prompt is stable across requests (same examples, same policy text). It breaks when the prompt varies per-request (per-tenant policy injected, retrieved context appended), because cache hits become rare. The fix is to split the prompt into a stable cached prefix and a variable suffix.
Per-Feature Budgets: From Alerting to Enforcement
Daily aggregation rolls up per-feature spend. Alerts fire at 50%, 80%, and 100% of the monthly budget. Most teams stop there. Most teams also have a story about a runaway feature that burned 10x its budget over a weekend before anyone noticed.
The hard stop is a thin gateway. Track cumulative spend per feature_id in Redis. When a request would push a feature over 100% of its monthly budget, return 429 with a clear error message. The product team controls the budget; the gateway controls the kill switch.
The gateway design has to handle a few real-world wrinkles. Per-tenant carve-outs (an enterprise customer paid for higher limits). Burst tolerance (allow 110% on a single day if the monthly budget is on track). Soft-fail (when in doubt, allow the request and alert; do not block on infrastructure failures of the gateway itself). And a clear out-of-band override path for the on-call to lift the cap during legitimate incidents.
This pattern works when the team owns the call path end-to-end. It breaks when a third-party integration calls the LLM directly without going through the gateway, in which case the budget is enforced only on the routes you control.
A 60-Day LLM FinOps Implementation Plan
The implementation sequences cleanly. Each phase produces measurable savings, and the data from each phase informs the next.
| Phase | Weeks | Action | Effort | Expected saving |
|---|---|---|---|---|
| Tag every call | 1-2 | Add feature_id, request_id, tenant_id, model_used to every LLM call site. Centralize through one wrapper. Lint against direct SDK imports outside the wrapper. | 1 engineer-week | 0 (visibility only) |
| Usage logging | 2-3 | Build the usage_log table. Write one row per LLM call with provider-returned token counts. Daily aggregation by feature_id. | 3 days | 0 (visibility only) |
| Per-feature dashboard | 3 | Surface per-feature daily spend in Slack or BI tool. Identify the top 3 features by spend. | 2 days | Sustains future savings via behavior change |
| Model routing (top 3 features) | 4-6 | Implement Haiku-first with Sonnet fallback for the top 3 features. Confidence check per feature class. | 2 weeks | 50-70% on the routed features |
| Prompt caching | 7 | Enable Anthropic prompt caching on features with large stable system prompts. Measure cache hit rate. | 3 days | 70-85% on input cost for cached features |
| System prompt diet | 8 | Audit system prompts for redundancy. Trim examples that don’t change quality. Cap max_tokens where outputs run long. | 1 week | 30-50% on input + output cost |
| Per-feature budgets | 9-10 | Set monthly budgets per feature based on observed baseline + 20% buffer. Wire alerts at 50/80%. Document override path. | 1 week | Bounds runaway costs |
A team starting at $48,000/month in LLM spend typically lands at $18,000-$24,000 after 60 days. The work is implementation discipline, not new architecture. Each phase is testable in isolation; each delivers measurable savings; none requires re-platforming.
To get started, audit your top three AI features. Pull the last 30 days of LLM provider usage from your console, identify which features they map to (this part is already painful without tagging), and decide which two could move from Sonnet to Haiku-first routing. The savings show up in week two. Pair the cost work with autonomous remediation so budget overruns trigger automatic gateway adjustments rather than a Sunday-night Slack thread.