A team ships a generative-AI summarisation feature. The first month it costs $400 in Bedrock invocations. The second month it costs $1,200 as adoption grows. The third month it costs $9,200 because every user has discovered the feature and every request invokes Claude Sonnet 4. The line item on the AWS bill is clear. What is not clear is whether the cost is justified, what the most expensive call pattern actually is, whether Haiku would have been good enough for half of it, or whether prompt caching would have cut the bill in half.
Bedrock is now the fastest-growing item on any AWS bill that includes a production generative-AI feature. AWS Cost Explorer surfaces the dollar amount. It does not surface the breakdown that would let an operator do anything about it: which model is driving most of the cost, which prompts have cache-hit potential that is not being used, which calls would land just as well on a smaller sibling model.
ZopNight ships Bedrock cost visibility plus 10 recommendation rules. The visibility breaks the spend down four ways. The recommendation rules each flag a specific pattern in the usage data and quantify the savings if the operator acts on it. This post walks through what each rule detects, why prompt caching is the highest-impact lever, when model switching is the right call, and how Bedrock spend rolls up into the per-team and per-feature cost reports that drive unit economics.
Why Bedrock is the fastest-growing line on the AWS bill
Bedrock pricing is per-token. Input tokens are cheap; output tokens are 4 to 5 times more expensive. Cached prompt tokens are around 90% cheaper than uncached input tokens. The per-token rate varies by model: Claude Sonnet 4 is the middle tier, Opus is the high tier, Haiku is the low tier, with a price-per-output-token ratio between Haiku and Opus of roughly 1:15.
A workload’s monthly cost is the product of (requests per month) × (average tokens per request) × (per-token rate). All three factors are volatile in a way that traditional cloud workloads are not.
| Workload | Pre-AI cost (EC2 + RDS) | Post-AI cost | Bedrock share | Driver |
|---|---|---|---|---|
| Customer support reply suggestions | $1,200 | $14,000 | 91% | Sonnet 4, no caching |
| Data extraction from invoices | $400 | $4,800 | 92% | Opus for tasks Haiku could do |
| Daily report summaries | $200 | $1,100 | 82% | Real-time endpoint for batch task |
Each of these workloads is fixable with a recommendation that requires reading the usage pattern, not just the dollar amount. Reply suggestions fix with prompt caching. Invoice extraction fixes with a model switch from Opus to Haiku. Daily reports fix by moving from real-time to batch endpoint. Cost Explorer cannot suggest any of these because the data it presents stops at the dollar line item.
What Bedrock visibility surfaces
The visibility breaks the spend down on four dimensions, each independently selectable as the primary axis.
| Dimension | What it answers |
|---|---|
| Model | Which model (Sonnet 4, Haiku, Opus, Titan, Mistral) drives most cost |
| Region | Cross-region inference cost; data-residency mismatches |
| Usage type | Input vs output vs cached tokens; ratio of each |
| Application tag | Which feature, team, or customer is driving the spend |
The most-asked first question is usually the model breakdown. The most-actionable first question is usually the usage-type breakdown: a workload where output tokens are 80% of the cost is a workload where the right fix is different from a workload where input tokens dominate. Cached-token share is the proxy for “is prompt caching enabled and effective”; a workload with 0% cached-token share is a recommendation candidate.
The application tag breakdown is the one that connects Bedrock spend to the rest of the cost model. A correctly tagged Bedrock invocation rolls up under the team / customer / feature that owns it. The team owning “AI-summarisation” sees their Bedrock spend in their team cost report; the support team sees their reply-suggestion cost in theirs. Without the tag the spend lands in an Untagged band that is the same incentive-creating shape as in Cost Flow.
The 10 recommendation rules
Each rule fires on a specific usage pattern in the cost and usage data. Each comes with a quantified savings estimate based on the trailing 30 days.
| # | Rule | Pattern detected | Typical savings |
|---|---|---|---|
| 1 | Switch to cheaper sibling model | Sonnet 4 calls under 200 output tokens for extraction tasks | 5-10x reduction |
| 2 | Enable prompt caching | Stable system prompt longer than 2,000 tokens, no caching | 30-50% of total |
| 3 | Drop streaming | Calls with stream=true where the output is buffered server-side | 0% cost, simpler code |
| 4 | Trim prompt bloat | Average prompt tokens above 5,000 with low information density | 20-40% of input cost |
| 5 | Region alignment | Inference region does not match app region | Cross-region latency + transfer cost |
| 6 | Retry storms | Inferences re-tried more than 3 times in 5 minutes | 30-300% over-spend on failed calls |
| 7 | Idle provisioned throughput | Provisioned throughput utilisation under 25% | 75% of provisioned cost |
| 8 | Opus → Sonnet downgrade | Opus calls under 200 output tokens, classification-style task | 3x reduction |
| 9 | Real-time → batch | Daily / hourly use-cases on real-time endpoint | 50% reduction |
| 10 | Haiku → Sonnet upgrade | Tasks where Haiku quality is borderline, Sonnet cost rise <12% | quality win, small cost |
Rules 1 and 8 are model-switching rules pointing down (cheaper). Rule 10 is the only rule pointing up: when the team has been using Haiku for a task where Sonnet would not increase cost meaningfully and would improve quality, the rule recommends the upgrade. Most rules drive cost down; one drives quality up. The two together prevent the operator from over-optimising on price.
Rules 6 and 7 are operational rather than model-related. Retry storms (rule 6) usually indicate an upstream bug that produces malformed prompts, and the cost is paid 3 to 10 times for the same logically-failed request. Idle provisioned throughput (rule 7) is the Bedrock equivalent of an idle reserved instance: a reservation paid for and not used.
Prompt caching: the highest-impact lever
Anthropic’s prompt caching pricing is the largest single lever for any workload with a stable system prompt. Cached tokens are roughly 90% cheaper than uncached input tokens. A workload whose every request starts with a 4,000-token system prompt that does not change saves 35 to 50% of total cost the moment caching is enabled, because the 4,000 cached tokens drop from full-price to 10% of full-price.
| Workload pattern | Cache-hit potential | Typical reduction |
|---|---|---|
| Long system prompt, short user message | high (system prompt cached) | 35-50% |
| RAG with stable context block | high (context cached) | 25-40% |
| Few-shot with stable examples | medium-high (examples cached) | 20-35% |
| Pure user-generated prompts, no shared prefix | low | 0-5% |
| Conversation with growing history | high (history cached) | 30-45% |
Rule 2 fires when the visibility data shows the first three patterns and no cached-token share. The recommendation copy includes the specific prompts whose caching would yield the most savings, and the change required (typically a single cache_control annotation in the API request). The operator who acts on the recommendation ships a one-line code change and sees the Bedrock bill drop by a third on the next billing cycle.
Model switching: when Haiku beats Sonnet (and when it doesn’t)
Model switching is the highest-leverage cost lever after caching. A Haiku call is roughly one-tenth the cost of a Sonnet call for the same input and output token count. The decision of when Haiku suffices is the decision rules 1, 8, and 10 are about.
| Task type | Right model | Cost ratio vs Sonnet |
|---|---|---|
| Classification (5-20 output tokens) | Haiku | 0.08 |
| Short extraction (<200 output tokens, structured) | Haiku | 0.10 |
| Summarisation (200-500 output tokens) | Sonnet | 1.00 (baseline) |
| Reasoning / multi-step (>500 output tokens) | Sonnet or Opus | 1.00-3.00 |
| Creative writing (long output, nuanced) | Opus | 3.00 |
| Code generation (medium-long output) | Sonnet | 1.00 |
Rule 1 flags Sonnet calls in the first two rows (classification, short extraction). Rule 8 flags Opus calls in the same rows. Both rules quantify the savings and link to a sampled set of recent calls so the operator can validate the recommendation against actual prompts before flipping the model.
Rule 10 goes the other way: Haiku calls where the task is in the third or fourth row, where Sonnet would handle it better at a small cost delta. The signal here is usually output quality complaints in user feedback combined with task-type indicators in the prompt itself.
The recommendation engine does not auto-flip models. The cost savings calculation is automatic; the quality decision is the operator’s. A flag in the recommendation says “validated on N=50 sample prompts at acceptable quality” only when the engine has run a quality check; otherwise the operator runs their own evaluation before switching.
Bedrock + the chargeback schema: per-feature LLM cost
A correctly tagged Bedrock invocation lands in the same cost-attribution graph as every other AWS resource. The team that owns the AI feature sees its LLM bill in its team cost report. The customer that uses the feature appears in the per-customer Cost Flow when the customer tag is applied at invocation time.
This composition is the architectural payoff. Bedrock is not a separate cost-tracking system; it is a resource type that participates in the same chargeback schema as EC2, RDS, and S3. The team owning the “AI summarisation” feature opens its team cost report and sees Bedrock spend alongside its compute and database costs. The CFO opens Cost Flow and sees the AI feature’s spend as one band among many.
The unit economics overlay extends naturally: per-MAU cost of the AI feature, per-1k-API-requests cost, per-transaction cost. The team that ships the AI summarisation feature can answer “is this feature profitable at our pricing” with the same chart they use for any other feature.
How to use Bedrock visibility day to day
The weekly workflow is short and lands the operator on a decision in under 15 minutes.
| Step | Action | Where |
|---|---|---|
| 1 | Open Bedrock visibility | Sidebar → Cost → Bedrock |
| 2 | Scan the model breakdown | Default view |
| 3 | Open Recommendations tab | Sub-tab on the same page |
| 4 | Sort by estimated savings | Default sort |
| 5 | Pick the top 1-3 recommendations | Action items for the week |
| 6 | Ship the fix | Code or config change |
| 7 | Re-check next week | Same workflow |
For most teams the first month of using Bedrock visibility produces 30-60% cost reduction without quality degradation. The driver is usually rule 2 (enable prompt caching) plus rule 1 (downgrade Sonnet → Haiku for classification calls) plus rule 9 (move daily batch work off the real-time endpoint). After the first round the marginal recommendations get smaller, and the workflow settles into a 5-minute weekly check.
What ZopNight does not yet ship: cross-provider LLM cost (OpenAI direct, Azure OpenAI, Gemini direct API), per-prompt cost attribution for invocations from background jobs without tags, and automatic A/B testing across models with cost-vs-quality scoring. Each is a future direction; the current deliverable is the Bedrock-specific visibility and the 10 rules that act on it.
Bedrock is on track to be one of the top three line items on most companies’ AWS bills within 18 months. The teams that treat it as a tagged, attributable, recommendation-driven cost from the start will spend a manageable fraction of their AI investment on inference. The teams that wait until the line item is unmanageable will pay a year of inflated bills before they catch up. Pick the rules. Ship the fixes. Watch the bill. That is the work the work is for.


