Prefer watching? Check out the video summary below.
By Sapumal Herath · Owner & Blogger, AI Buzz · Last updated: February 11, 2026 · Difficulty: Beginner
LLM apps don’t fail like normal software.
With most web apps, “bad traffic” can cause downtime. With LLM apps, bad traffic can also cause runaway costs—because inference is expensive and pricing is often usage-based.
OWASP calls this risk category LLM10: Unbounded Consumption. In plain English: your AI app allows excessive and uncontrolled inference, leading to DoS/service degradation, economic losses (Denial-of-Wallet), and even model theft/replication.
Note: This article is for educational purposes only. It is not legal, security, or compliance advice.
🎯 What “Unbounded Consumption” means (plain English)
Unbounded Consumption happens when users (or attackers) can make your AI system do too much work—too many requests, too many tokens, too many tool calls, too many retries—without effective limits.
Think of it as “resource exhaustion,” but for modern LLM systems where the resource is:
- tokens (input + output),
- tool calls (agent steps),
- latency (slow jobs piling up),
- and money (pay-per-use cloud inference).
⚡ Why this is a 2025–2026 “real problem” (not theory)
OWASP explicitly expanded what used to be thought of as “denial of service” into a broader category: Unbounded Consumption—because the harm is not only downtime. It’s also unexpected cost and resource misuse in large-scale LLM deployments.
If you run a customer-facing chatbot, a RAG assistant, or an agent that can use tools, you should assume cost-abuse and runaway usage will happen eventually—accidentally or intentionally.
🧨 The 7 most common real-world failure patterns
Here are the patterns OWASP highlights (plus a practical “agent loop” pattern teams see constantly):
| Failure pattern | What it looks like | Main impact | First guardrail to add |
|---|---|---|---|
| Variable-length input flood | Lots of requests with huge inputs or mixed sizes | Latency spikes, outages, higher cost | Max input size + rate limits |
| Denial of Wallet (DoW) | High-volume usage against metered inference | Billing incident / financial loss | Quotas + budgets + anomaly alerts |
| Context-window overflow loops | Requests exceed limits; retries and re-processing pile up | Throughput collapse | Hard token caps + reject early |
| Resource-intensive queries | Prompts designed to maximize compute | Slowdown + cost | Timeouts + throttling + pricing tiers |
| Model extraction via API | High-volume probing to replicate behavior | IP loss + cost | Rate limits + auth + monitoring |
| Functional model replication | Using outputs to create training data for a “shadow” model | IP loss | Abuse detection + watermarking + access controls |
| Side-channel attempts | Trying to infer model/system details via responses and behavior | Security compromise risk | Sandboxing + least privilege + monitoring |
| Agent/tool loops (very common) | Agent keeps calling tools repeatedly (“just one more step…”) | Cost + unintended actions | Max steps + max tool calls + approvals |
✅ The “Anti-Runaway” controls that actually work (copy/paste)
Use this as a baseline checklist. You don’t need all of it on day one—but you do need a plan.
🔐 A) Authentication and access boundaries
- Require authentication for any non-trivial usage (avoid open anonymous endpoints).
- Role-based access for advanced features (long context, file uploads, tool access).
- Separate dev/staging/prod keys and quotas.
🧱 B) Hard limits (the non-negotiables)
- Max input size (characters/tokens) and reject early.
- Max output tokens per request.
- Max context window usage per session (cap conversation history growth).
- Max concurrent requests per user/org.
🧮 C) Budgets and quotas (prevent denial-of-wallet)
- Daily/weekly token budgets per user and per org.
- Spend caps per API key/environment.
- Tiered limits: free trial / standard / elevated (with review).
⏱️ D) Timeouts, throttling, and graceful degradation
- Timeout expensive requests (don’t let them run forever).
- Throttle repeated retries and request storms.
- Graceful degradation: partial functionality under load instead of full failure.
🧰 E) Agent/tool controls (stop loops and “rogue” escalation)
- Max steps per run (hard stop).
- Max tool calls per run (hard stop).
- Queued action limits (don’t allow unlimited “pending actions”).
- Read-only by default for tools; write actions require human approval.
🔍 F) Monitoring and anomaly detection (make it observable)
- Track cost per user/org and alert on spikes.
- Track tokens per request and tokens per session.
- Track tool-call volume and step counts for agents.
- Track latency (p50/p95) and timeouts; cost incidents often show up as performance incidents first.
🧾 G) Reduce extraction risk (high-level)
- Rate limits and quotas reduce high-volume probing.
- Limit exposure of extra probability details (only provide what’s needed).
- Watermarking (where applicable) can help detect unauthorized reuse patterns.
🧯 “First 30 minutes” playbook for a cost/runaway incident
If you suddenly see a cost spike or throughput collapse, your goal is containment first—then diagnosis.
- Throttle immediately: tighten per-user and per-org rate limits.
- Reduce max tokens: lower input/output caps temporarily.
- Disable high-cost features: long-context mode, file uploads, web browsing, or expensive tools.
- Cap agent loops: reduce max steps/tool calls and force draft-only mode for actions.
- Identify the top spenders: users/keys/IP ranges and temporarily block or require step-up verification.
- Preserve evidence: request metadata, token counts, tool-call logs, timestamps (privacy-safe).
- Post-incident fix: add regression tests, update quotas, and create a permanent alert rule.
Pair this with a full incident routine: AI Incident Response (Practical Playbook)
🧪 Mini-labs (quick exercises that prevent surprise bills)
Mini-lab 1: “Top 10 most expensive prompts” review
- Export usage logs for the last 7 days (token counts + cost).
- List the 10 most expensive sessions/prompts.
- Decide: should these be blocked, capped, cached, or moved behind a higher tier?
Mini-lab 2: Agent loop test
- Run your most complex agent workflow with step limits set to 10, then 20.
- Verify the agent stops safely (clear message + next steps) when the limit is reached.
- Add a “human approval required” step for any write action.
Mini-lab 3: Denial-of-wallet simulation (safe)
- Create a test API key with strict quotas.
- Simulate bursts of usage (within your test environment) and verify alerts trigger.
- Verify the system degrades gracefully instead of failing hard.
🚩 Red flags that should slow down your rollout
- No rate limiting and no quotas (“unlimited by default”).
- No max token caps or step limits (“it will be fine”).
- Agents can call tools repeatedly with no loop controls.
- No cost monitoring per user/org/key.
- Logs exist, but you can’t tie spend to a user/key/workflow.
🔗 Keep exploring on AI Buzz
📚 Further reading (official references)
🏁 Conclusion
Unbounded Consumption is the “LLM billing incident” risk category: too much inference, too many tokens, too many steps, and too little control.
If you want a safe baseline: set hard limits, add quotas, cap agent loops, monitor cost per user/org, and keep an incident playbook ready. That’s how you ship LLM apps without surprise outages—or surprise bills.





Leave a Reply