Skip to main content
LLM FinOps: Per-Feature Cost Attribution and Token Budgets

LLM FinOps: Per-Feature Cost Attribution and Token Budgets

Muskan Sharma By Muskan Sharma
Published: May 4, 2026 10 min read

A B2B SaaS product team ships its first AI feature in 2024. By 2026, the same team has 12 AI features in production: summarization, classification, extraction, search, an AI assistant, three flavors of auto-complete, two analytics features, and the chatbot product engineering still calls “the demo” eight months after launch. The Anthropic bill is $48,000 per month — the same kind of black-box cloud bill that plagued infrastructure spend before FinOps. Nobody can tell you what each feature costs.

The CFO asks “what’s our AI cost per customer?” The answer that arrives a week later is wrong because nobody had instrumentation in place. The team that shipped the latest feature with a 4,000-token system prompt and 1M monthly requests doesn’t realize until the following month that they alone added $12,000 to the bill.

FinOps is the engineering practice of bringing financial accountability to variable cloud spend by aligning engineering, finance, and product on continuous cost decisions, per the FinOps Foundation. Applied to LLM ops, the practice has four levers: tag every call, count tokens authoritatively, aggregate per feature, enforce per-feature budgets. This piece covers each in implementation order.

Why Your AI Bill Is a Black Box

The model pricing structure makes per-feature accounting essential, not optional. The cost gap between flagship and small models is roughly 18-20x per output token. A feature that runs on Opus when Haiku would suffice costs 18x what it should — but you cannot tell which features those are without per-feature attribution.

ModelInput ($/MTok)Output ($/MTok)Use case
Claude Opus 4.5$15$75Complex reasoning, long-form generation
Claude Sonnet 4.6$3$15Production default, balanced quality/cost
Claude Haiku 4.5$0.80$4Classification, extraction, structured output
GPT-4 Turbo$10$30Reasoning, complex agents
GPT-3.5 Turbo$0.50$1.50Simple chat, classification

A typical B2B SaaS feature processes 800-2,000 input tokens and produces 200-600 output tokens per request, per Anthropic case studies. The pattern echoes chargeback / showback frameworks used for cloud cost — same accountability problem, new line item. At Sonnet rates, that is $0.0027 to $0.0150 per request. A feature handling 100,000 requests per month costs $270 to $1,500. With 12 such features and uneven distribution, the bill ranges $5,000 to $25,000 per month — and “uneven distribution” is the part you cannot see without attribution.

Tagging at the Call Site: The One Line That Makes Everything Else Possible

Adding a feature_id tag to every LLM call is the architectural decision that determines whether per-feature accounting is possible at all. Adding it from day one is a single line of code at every call site. Adding it retroactively across a 30-feature codebase is a quarter-long migration through 30 different teams’ code.

Both major providers accept metadata that flows through to their consoles and to your usage logs. The pattern:

Architecture diagram

Anthropic accepts a metadata.user_id string up to 256 chars. OpenAI accepts a user parameter up to 64 chars. Both end up in the provider’s console and in any logs your wrapper writes. The tag should encode three things: the feature owner, the request ID, and the tenant.

FieldExampleWhat it enables
feature_idsummarize_email_v2Per-feature monthly roll-up
request_idreq_2k4a8f9...Trace one request through retries, fallbacks
tenant_idtenant_acme_corpPer-customer cost (essential for unit economics)
model_usedclaude-sonnet-4-6Detect when a feature accidentally upgraded model
cached_tokens12000Track prompt-cache hit rate per feature

This pattern works when the call site is yours to modify. It breaks when LLM calls flow through a third-party SDK that does not expose a metadata pass-through, in which case the wrapper has to be replaced or proxied.

Counting Tokens From Provider Responses, Not Estimates

Estimating tokens with tiktoken or word-count heuristics drifts 5-15% from authoritative billing. The provider response is the truth. Both Anthropic and OpenAI return token counts in every response.

The Anthropic response surfaces response.usage.input_tokens and response.usage.output_tokens. OpenAI returns usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens. Neither charges for tokens you didn’t send or receive. Use these values, not estimates.

The usage log table needs the columns to support all the queries you’ll want later:

ColumnTypeNotes
timestamptimestamptzWhen the call completed
feature_idtextThe tag from the call site
tenant_idtextPer-customer attribution
request_idtextTrace through retries / fallback chain
providertextanthropic / openai / gemini
modeltextSpecific model used (matters for cost rollup)
input_tokensintFrom response.usage
output_tokensintFrom response.usage
cached_input_tokensintIf prompt caching is on
latency_msintFor p50/p95 dashboards
errortextnull on success, error class on failure

This pattern works when every LLM call goes through one wrapper. It breaks when half the codebase calls the SDK directly and half goes through a wrapper, because the direct calls don’t end up in the log. The fix is a lint rule that bans direct SDK imports outside the wrapper module.

Model Routing: The 18x Cost Lever Most Teams Skip

The pricing table above shows an 18-20x cost gap between flagship and small models per output token. Most teams default to flagship for everything because they tested with flagship during prototyping. Auditing each feature against the question “does this need flagship-quality output?” typically shows 60-70% of features tolerate the small model.

The small-model-first pattern routes to Haiku, validates the output, falls back to Sonnet only on low-confidence responses.

Architecture diagram

For a 10:1 success ratio (Haiku handles 10 requests for every 1 that escalates to Sonnet), the blended cost is roughly 1/10th of running Sonnet for everything. The math:

RoutingCost per 1M requests (avg 1k in / 300 out)
Sonnet only$7,500
Haiku only$1,800
Haiku-first, Sonnet fallback (10:1 ratio)$2,375
Haiku-first, Sonnet fallback (5:1 ratio)$2,950

Confidence checks are low-cost and feature-specific. For structured extraction, validate the JSON parses and required fields are present. For classification, check the predicted class against an allowlist. For summarization, count output tokens vs input tokens to flag pathological short responses. The validator runs in microseconds; the savings compound.

Feature classRecommended modelFallback policy
Structured extraction (JSON, key-value)HaikuSonnet on JSON parse error or missing field
Classification (single label)HaikuSonnet on low-confidence (logprobs / consensus check)
SummarizationSonnetOpus on length > 50k input or “complex source” flag
Creative generationSonnetOpus only when explicitly requested
Complex reasoning, agentsSonnetOpus per feature decision, not per request
Free-form chatSonnetNo fallback (chat tolerates variance)

This pattern works when the low-cost model can handle the majority of inputs. It breaks when the inputs are uniformly hard (every request is genuinely complex), in which case the fallback rate climbs above 50% and the routing overhead exceeds the savings.

Prompt Caching and System Prompt Diet

Two related cost levers on the input side. Anthropic prompt caching charges $1.25/MTok for the initial cache write and $0.30/MTok for cached reads on Sonnet, against the standard $3/MTok input rate. For a 50,000-token system prompt re-used 1,000 times per day:

SetupDaily input costMonthly cost
No caching$150$4,500
Cache write once + 999 cached reads$0.06 + $14.97$450
Trim system prompt to 12,000 tokens, no cache$36$1,080
Trim to 12,000 tokens + cache$0.015 + $3.59$108

The system prompt diet matters independently. Most production system prompts are 2-4x larger than necessary because they accumulate examples and policy text over months without anyone removing the redundant ones. Trimming a 4,000-token system prompt to 1,000 tokens for a feature handling 1M requests/month saves $9,000 monthly at Sonnet rates.

Output token cost dominates for most features. Trimming system prompts matters but capping max_tokens and prompting for terser outputs (“respond in 2 sentences”, “JSON only, no explanation”) usually saves more. A feature averaging 600 output tokens that drops to 300 with a tighter prompt cuts output cost in half — and at $15/MTok output, that is the larger half of the bill.

This pattern works when the system prompt is stable across requests (same examples, same policy text). It breaks when the prompt varies per-request (per-tenant policy injected, retrieved context appended), because cache hits become rare. The fix is to split the prompt into a stable cached prefix and a variable suffix.

Per-Feature Budgets: From Alerting to Enforcement

Daily aggregation rolls up per-feature spend. Alerts fire at 50%, 80%, and 100% of the monthly budget. Most teams stop there. Most teams also have a story about a runaway feature that burned 10x its budget over a weekend before anyone noticed.

The hard stop is a thin gateway. Track cumulative spend per feature_id in Redis. When a request would push a feature over 100% of its monthly budget, return 429 with a clear error message. The product team controls the budget; the gateway controls the kill switch.

Architecture diagram

The gateway design has to handle a few real-world wrinkles. Per-tenant carve-outs (an enterprise customer paid for higher limits). Burst tolerance (allow 110% on a single day if the monthly budget is on track). Soft-fail (when in doubt, allow the request and alert; do not block on infrastructure failures of the gateway itself). And a clear out-of-band override path for the on-call to lift the cap during legitimate incidents.

This pattern works when the team owns the call path end-to-end. It breaks when a third-party integration calls the LLM directly without going through the gateway, in which case the budget is enforced only on the routes you control.

A 60-Day LLM FinOps Implementation Plan

The implementation sequences cleanly. Each phase produces measurable savings, and the data from each phase informs the next.

PhaseWeeksActionEffortExpected saving
Tag every call1-2Add feature_id, request_id, tenant_id, model_used to every LLM call site. Centralize through one wrapper. Lint against direct SDK imports outside the wrapper.1 engineer-week0 (visibility only)
Usage logging2-3Build the usage_log table. Write one row per LLM call with provider-returned token counts. Daily aggregation by feature_id.3 days0 (visibility only)
Per-feature dashboard3Surface per-feature daily spend in Slack or BI tool. Identify the top 3 features by spend.2 daysSustains future savings via behavior change
Model routing (top 3 features)4-6Implement Haiku-first with Sonnet fallback for the top 3 features. Confidence check per feature class.2 weeks50-70% on the routed features
Prompt caching7Enable Anthropic prompt caching on features with large stable system prompts. Measure cache hit rate.3 days70-85% on input cost for cached features
System prompt diet8Audit system prompts for redundancy. Trim examples that don’t change quality. Cap max_tokens where outputs run long.1 week30-50% on input + output cost
Per-feature budgets9-10Set monthly budgets per feature based on observed baseline + 20% buffer. Wire alerts at 50/80%. Document override path.1 weekBounds runaway costs

A team starting at $48,000/month in LLM spend typically lands at $18,000-$24,000 after 60 days. The work is implementation discipline, not new architecture. Each phase is testable in isolation; each delivers measurable savings; none requires re-platforming.

To get started, audit your top three AI features. Pull the last 30 days of LLM provider usage from your console, identify which features they map to (this part is already painful without tagging), and decide which two could move from Sonnet to Haiku-first routing. The savings show up in week two. Pair the cost work with autonomous remediation so budget overruns trigger automatic gateway adjustments rather than a Sunday-night Slack thread.

Muskan Sharma

Written by

Muskan Sharma Author

Engineer at Zop.Dev

ZopDev Resources

Stay in the loop

Get the latest articles, ebooks, and guides
delivered to your inbox. No spam, unsubscribe anytime.