Unbounded Consumption (OWASP LLM10) Explained: Prevent Denial-of-Wallet and Runaway LLM Costs

Prefer watching? Check out the video summary below.

By Sapumal Herath · Owner & Blogger, AI Buzz · Last updated: February 11, 2026 · Difficulty: Beginner

LLM apps don’t fail like normal software.

With most web apps, “bad traffic” can cause downtime. With LLM apps, bad traffic can also cause runaway costs—because inference is expensive and pricing is often usage-based.

OWASP calls this risk category LLM10: Unbounded Consumption. In plain English: your AI app allows excessive and uncontrolled inference, leading to DoS/service degradation, economic losses (Denial-of-Wallet), and even model theft/replication.

Note: This article is for educational purposes only. It is not legal, security, or compliance advice.

🎯 What “Unbounded Consumption” means (plain English)

Unbounded Consumption happens when users (or attackers) can make your AI system do too much work—too many requests, too many tokens, too many tool calls, too many retries—without effective limits.

Think of it as “resource exhaustion,” but for modern LLM systems where the resource is:

tokens (input + output),
tool calls (agent steps),
latency (slow jobs piling up),
and money (pay-per-use cloud inference).

⚡ Why this is a 2025–2026 “real problem” (not theory)

OWASP explicitly expanded what used to be thought of as “denial of service” into a broader category: Unbounded Consumption—because the harm is not only downtime. It’s also unexpected cost and resource misuse in large-scale LLM deployments.

If you run a customer-facing chatbot, a RAG assistant, or an agent that can use tools, you should assume cost-abuse and runaway usage will happen eventually—accidentally or intentionally.

🧨 The 7 most common real-world failure patterns

Here are the patterns OWASP highlights (plus a practical “agent loop” pattern teams see constantly):

Failure pattern	What it looks like	Main impact	First guardrail to add
Variable-length input flood	Lots of requests with huge inputs or mixed sizes	Latency spikes, outages, higher cost	Max input size + rate limits
Denial of Wallet (DoW)	High-volume usage against metered inference	Billing incident / financial loss	Quotas + budgets + anomaly alerts
Context-window overflow loops	Requests exceed limits; retries and re-processing pile up	Throughput collapse	Hard token caps + reject early
Resource-intensive queries	Prompts designed to maximize compute	Slowdown + cost	Timeouts + throttling + pricing tiers
Model extraction via API	High-volume probing to replicate behavior	IP loss + cost	Rate limits + auth + monitoring
Functional model replication	Using outputs to create training data for a “shadow” model	IP loss	Abuse detection + watermarking + access controls
Side-channel attempts	Trying to infer model/system details via responses and behavior	Security compromise risk	Sandboxing + least privilege + monitoring
Agent/tool loops (very common)	Agent keeps calling tools repeatedly (“just one more step…”)	Cost + unintended actions	Max steps + max tool calls + approvals

✅ The “Anti-Runaway” controls that actually work (copy/paste)

Use this as a baseline checklist. You don’t need all of it on day one—but you do need a plan.

🔐 A) Authentication and access boundaries

Require authentication for any non-trivial usage (avoid open anonymous endpoints).
Role-based access for advanced features (long context, file uploads, tool access).
Separate dev/staging/prod keys and quotas.

🧱 B) Hard limits (the non-negotiables)

Max input size (characters/tokens) and reject early.
Max output tokens per request.
Max context window usage per session (cap conversation history growth).
Max concurrent requests per user/org.

🧮 C) Budgets and quotas (prevent denial-of-wallet)

Daily/weekly token budgets per user and per org.
Spend caps per API key/environment.
Tiered limits: free trial / standard / elevated (with review).

⏱️ D) Timeouts, throttling, and graceful degradation

Timeout expensive requests (don’t let them run forever).
Throttle repeated retries and request storms.
Graceful degradation: partial functionality under load instead of full failure.

🧰 E) Agent/tool controls (stop loops and “rogue” escalation)

Max steps per run (hard stop).
Max tool calls per run (hard stop).
Queued action limits (don’t allow unlimited “pending actions”).
Read-only by default for tools; write actions require human approval.

🔍 F) Monitoring and anomaly detection (make it observable)

Track cost per user/org and alert on spikes.
Track tokens per request and tokens per session.
Track tool-call volume and step counts for agents.
Track latency (p50/p95) and timeouts; cost incidents often show up as performance incidents first.

🧾 G) Reduce extraction risk (high-level)

Rate limits and quotas reduce high-volume probing.
Limit exposure of extra probability details (only provide what’s needed).
Watermarking (where applicable) can help detect unauthorized reuse patterns.

🧯 “First 30 minutes” playbook for a cost/runaway incident

If you suddenly see a cost spike or throughput collapse, your goal is containment first—then diagnosis.

Throttle immediately: tighten per-user and per-org rate limits.
Reduce max tokens: lower input/output caps temporarily.
Disable high-cost features: long-context mode, file uploads, web browsing, or expensive tools.
Cap agent loops: reduce max steps/tool calls and force draft-only mode for actions.
Identify the top spenders: users/keys/IP ranges and temporarily block or require step-up verification.
Preserve evidence: request metadata, token counts, tool-call logs, timestamps (privacy-safe).
Post-incident fix: add regression tests, update quotas, and create a permanent alert rule.

Pair this with a full incident routine: AI Incident Response (Practical Playbook)

🧪 Mini-labs (quick exercises that prevent surprise bills)

Mini-lab 1: “Top 10 most expensive prompts” review

Export usage logs for the last 7 days (token counts + cost).
List the 10 most expensive sessions/prompts.
Decide: should these be blocked, capped, cached, or moved behind a higher tier?

Mini-lab 2: Agent loop test

Run your most complex agent workflow with step limits set to 10, then 20.
Verify the agent stops safely (clear message + next steps) when the limit is reached.
Add a “human approval required” step for any write action.

Mini-lab 3: Denial-of-wallet simulation (safe)

Create a test API key with strict quotas.
Simulate bursts of usage (within your test environment) and verify alerts trigger.
Verify the system degrades gracefully instead of failing hard.

🚩 Red flags that should slow down your rollout

No rate limiting and no quotas (“unlimited by default”).
No max token caps or step limits (“it will be fine”).
Agents can call tools repeatedly with no loop controls.
No cost monitoring per user/org/key.
Logs exist, but you can’t tie spend to a user/key/workflow.

🔗 Keep exploring on AI Buzz

📚 Further reading (official references)

🏁 Conclusion

Unbounded Consumption is the “LLM billing incident” risk category: too much inference, too many tokens, too many steps, and too little control.

If you want a safe baseline: set hard limits, add quotas, cap agent loops, monitor cost per user/org, and keep an incident playbook ready. That’s how you ship LLM apps without surprise outages—or surprise bills.

AI Buzz

AI Insights, Guides, and Trends Made Simple

Unbounded Consumption (OWASP LLM10) Explained: How to Prevent Denial-of-Wallet, Tool Loops, and Runaway LLM Costs (With a Practical Checklist)