Submit feedback on
Underutilized Bedrock Provisioned Throughput on Low-Volume Workloads
We've received your feedback.
Thanks for reaching out!
Oops! Something went wrong while submitting the form.
Close
Underutilized Bedrock Provisioned Throughput on Low-Volume Workloads
Taylor Houck
CER:

CER-0317

Service Category
AI
Cloud Provider
AWS
Service Name
AWS Bedrock
Inefficiency Type
Suboptimal Pricing Model
Explanation

Amazon Bedrock Provisioned Throughput allows teams to reserve dedicated inference capacity for foundation models by purchasing model units with hourly billing under a commitment term. This capacity is billed continuously — whether or not any tokens are actually processed — making it a fixed cost that only pays off when sustained, high-volume token consumption justifies the premium over on-demand pricing. In practice, teams frequently purchase Provisioned Throughput to avoid on-demand throttling limits, but actual usage often falls well below the committed capacity, resulting in significant overspend compared to what on-demand pricing would have cost for the same workload.

The waste is compounded by the fact that Provisioned Throughput commitments cannot be canceled before the term expires — billing continues hourly until the commitment period ends. This means a team that overestimates its inference needs at the time of purchase is locked into paying for unused capacity for the full duration. The problem is especially common in early-stage AI deployments where usage patterns are not yet well understood, or in workloads with variable or unpredictable token volumes that are poorly suited to fixed-capacity reservations.

The cost impact can be substantial. A single model unit for even a moderately priced model can cost tens of thousands of dollars per month, and if actual token consumption would have cost only a fraction of that amount under on-demand pricing, the difference represents pure waste. Organizations running multiple Provisioned Throughput reservations across different models or environments can multiply this inefficiency significantly.

Relevant Billing Model

Amazon Bedrock offers two primary inference pricing modes:

  • On-demand pricing — charges per token processed, with separate rates for input tokens and output tokens, and no upfront commitment or minimum usage. This is a pure pay-per-use model where costs scale directly with actual consumption.
  • Provisioned Throughput — billed hourly per model unit regardless of actual usage. The hourly rate depends on the model selected, the number of model units purchased, and the commitment duration (no commitment, 1-month, or 6-month terms). Longer commitments offer lower hourly rates.

Each model unit delivers a specific throughput level measured in input and output tokens per minute, though exact throughput specifications per model unit are not publicly documented — AWS directs customers to contact their account team for these details. Provisioned Throughput pricing for most models is also not listed publicly on the pricing page.

Additionally, batch inference is available at a 50% discount compared to on-demand pricing for select models, offering a middle ground for non-real-time workloads that do not require dedicated capacity.

Detection
  • Identify all active Provisioned Throughput reservations across the organization, including the model, number of model units, commitment term, and remaining commitment duration.
  • Review actual token consumption over a representative period for each Provisioned Throughput reservation and compare it against the committed capacity to assess utilization levels.
  • Evaluate whether the equivalent on-demand cost for the actual tokens processed would be lower than the fixed Provisioned Throughput cost over the same period.
  • Assess whether workload patterns are consistent and sustained enough to justify reserved capacity, or whether usage is variable, bursty, or declining over time.
  • Confirm whether any Provisioned Throughput reservations are serving custom fine-tuned models, which require Provisioned Throughput and cannot use on-demand inference.
  • Examine whether non-real-time workloads currently using Provisioned Throughput could be shifted to batch inference to reduce costs without requiring reserved capacity.
  • Review upcoming commitment expiration dates to identify reservations that should not be renewed without a fresh cost-benefit analysis.
Remediation
  • For workloads with low, variable, or unpredictable token volumes, switch to on-demand inference when the current Provisioned Throughput commitment expires — do not renew reservations that are consistently underutilized.
  • Conduct a breakeven analysis before purchasing or renewing any Provisioned Throughput commitment by comparing projected on-demand costs against the fixed hourly cost of the reservation over the full commitment term.
  • For non-real-time workloads such as batch processing, document summarization, or offline content generation, consider using batch inference mode, which offers a significant discount compared to on-demand pricing and avoids the need for reserved capacity.
  • Start with shorter commitment terms (1-month) or no-commitment Provisioned Throughput when first adopting reserved capacity, and only move to longer terms (6-month) once sustained high utilization has been demonstrated over multiple billing cycles.
  • Engage your AWS account team to obtain model unit throughput specifications and Provisioned Throughput pricing details before committing, so that breakeven calculations are based on accurate data rather than estimates.
  • Establish a periodic review process to reassess all active Provisioned Throughput reservations against actual usage, ensuring that renewals are justified by current consumption patterns.
Submit Feedback