Skip to main content

Command Palette

Search for a command to run...

Your Agent Has a Spending Problem

Cost is an architectural property, not a billing surprise. Treat your agent as a metered process, or a few sessions will eat a month of budget in a week

Updated
15 min read
Your Agent Has a Spending Problem
A
AI Architect. I design the boring control plane plumbing that keeps your impressive demos from quietly setting themselves on fire in production.

Last updated: June 2026, reflecting current prompt caching, model routing, and orchestrator level budget enforcement patterns.

Picture a familiar moment.

An agent has been running in production for two or three weeks. Adoption is climbing. The dashboards from Part 4 are mostly green. The team is feeling rather good about itself. Then someone opens the cost report and discovers that a handful of sessions, from a handful of users, burned more compute in one week than the entire pilot budget for the month.

Open the traces. It is always one of the same few stories:

  • A planning loop that forgot how to stop.
  • A retrieval step that fired on every single turn instead of once per session.
  • A context window that doubled every turn because nobody summarized the history.
  • A retry policy that cheerfully reran an expensive call eight times before giving up.
  • A pile of tool descriptions that quietly added thousands of tokens to every model call.

The fix is the same in every case: bound the system before the system unbounds your budget.

Cost in an agentic system is not a billing concern. It is an architectural property. Treat it like one from day one, or rewrite the architecture later under pressure, probably at 3 AM, probably during a demo.


Why Cost Is a Reliability Problem

Engineers love to file cost under finance and reliability under engineering. In agentic systems, that tidy separation falls apart.

When an agent's cost is unbounded:

  • Rate limiting becomes inevitable. You throttle to stay within budget, which degrades the latency and throughput your users actually feel.
  • Pricing becomes a fence. If serving one user reliably costs more than they pay you, you are pushed into tiered access, which is a product redesign wearing a billing costume.
  • Fallbacks become coupled. "When cost exceeds X, fall back to a cheaper model" stops being a config value and starts touching every code path.
  • Incidents become economic. A cost spike at 3 AM is an incident, whether or not your runbook has the good manners to admit it.

Cost overruns force the same cascading rework that any reliability boundary would. An agent that costs 100 times its budget on the wrong query is not a billing surprise. It is an availability failure that happens to be denominated in dollars.

The reframe that helps: treat cost as part of your SLO, not as an apology you make after it.


The Cost Decomposition

Every agent's cost factors into a small number of pieces. The tidy version looks like this:

session_cost ≈ cost_per_token × tokens_per_call × calls_per_turn × turns_per_session

That is a useful mental model, and it is also lying to you a little. There is no single cost_per_token. Input tokens, output tokens, and cached tokens are priced very differently. Output usually runs three to five times the price of input. One current frontier model, for example, charges five dollars per million input tokens and twenty five dollars per million output tokens, which is exactly five times. Cached input is cheaper still, often a tenth of the normal rate.

So the honest version is closer to this:

session_cost ≈ Σ over calls of ( input_tokens × input_price + output_tokens × output_price )
   where input_price itself splits into cached and uncached rates

Why does this matter? Because three of the tactics later in this article (output caps, prompt caching, and structured output) only make sense once you stop pretending input and output cost the same thing. Keep the simple formula as a map of your levers. Keep the honest one for when the bill arrives.

Read as a map, the factors point at four levers:

  • cost per token is the model and caching lever
  • tokens per call is the prompt and output lever
  • calls per turn is the reasoning depth lever
  • turns per session is the conversation shape lever

The tempting move is to grab the most visible lever, usually "switch to a cheaper model," and call it a day. That misses the part where these multiply.


The Multiplier Effect

Here is the part that is easy to miss until you write it down.

A 20 percent cut in any single factor cuts total cost by 20 percent. Fine. But the factors compound. Halve the prompt size, halve the calls per turn, and halve the turns per session, and you have cut cost to one eighth of where it started. A factor of eight, out of three ordinary looking changes.

One honest caveat the breathless version always skips: these factors are not fully independent. Summarizing history to shrink tokens per call costs you an extra model call, which nudges calls per turn back up. A routing classifier adds a call. Retrieving everything once instead of every turn moves cost from one factor to another rather than deleting it. So the clean factor of eight is a best case you approach, not a guarantee you collect.

The real lesson survives anyway. A cost project that pushes on one lever usually disappoints. A cost project that pushes on all four, while respecting the tradeoffs between them, tends to compound in your favor.


Patterns That Cap Cost per Call

These reduce tokens per call and the effective cost per token.

Prompt caching

All three major closed vendors (OpenAI, Anthropic, and Google) expose some form of prompt caching: a stable prefix (system instructions, tool catalogs, long standing context) gets cached on the server and reused at a fraction of the input price. The discounts are real and large, roughly half off with OpenAI and as much as 90 percent off cache hits with Anthropic and Google.

Three things the marketing pages tend to mumble:

  • Order matters. Put the most stable content first and the least stable content last: system instructions, then tool descriptions, then any session stable documents, then conversation history, then the current user message. Anything that changes per turn belongs after the prefix, not inside it.
  • Retrieved documents are only cacheable if they actually repeat. In ordinary retrieval the documents change every query. Drop them into the prefix and you will not just miss the cache, you will bust it for everything behind them. They count as a stable prefix only under a retrieve once per session pattern.
  • Caching is not free. Some vendors charge a premium to write the cache, more than a normal input token, and discount only the reads. If a prefix is not reused enough times, caching costs you money. There is a break even point, and short low reuse sessions sit on the wrong side of it.

The payoff is biggest exactly where agent prompts are already fat: long system prompts, sprawling tool catalogs, and a generous pile of examples.

Tool description trimming

Every tool description rides along in the system prompt on every call. Forty tools with chatty descriptions can easily be several thousand tokens, paid on every call, in every session, forever, including the calls where the agent was only ever going to use one of them. Two moves:

  • Trim each description to the minimum a capable model needs to choose correctly.
  • Route to a subset of tools per task category, so the agent only ever sees the tools relevant to what it is doing right now.

Output token caps

Set an explicit ceiling on output length, both in the API parameter and in the instructions. The classic money fire here is an unconstrained reasoning step that thinks out loud for several thousand tokens before it gets around to answering.

Structured outputs

When the output is meant to be read by a machine, force a schema (JSON schema or a function call format). This deletes the verbose framing prose and reduces both cost and the rate at which your parser falls over.

Smaller models for subtasks

Most agent workloads are not one heroic frontier model call. They are a sequence: classify the request, pick a tool, fill in arguments, summarize a tool response, decide the next step. Plenty of those steps do not need the expensive model, and quietly resent being charged as if they did.


Model Routing, the Biggest Single Lever

Model routing means choosing the cheapest model that can actually do the job, per step, decided at runtime. It is one of the largest cost levers in any agent architecture, and one of the most neglected.

The pattern that holds up:

  1. A small, cheap classifier (or even a fine tuned encoder) sorts the incoming request.
  2. Routine, narrow tasks go to a smaller, cheaper model.
  3. Hard tasks (multistep planning, ambiguous reasoning, novel tool composition) go to the frontier model.
  4. Filling in tool arguments and writing short summaries almost never need the frontier model.

The common failure is sending every step to the frontier model "just to be safe." Safe for whom? It compounds with every other multiplier in the decomposition: you end up paying the top rate per token precisely on the tasks that deserve it least.

Architecturally, the routing layer sits between the orchestrator and the model provider, driven by policy, not baked into the agent loop where it will be quietly forgotten.


Patterns That Cap Calls per Turn

Now the second factor: how many model calls happen inside a single user turn.

Single shot tool calling versus planning loops

A loop that calls the model to think, then to plan, then to act, then to reflect, then to decide whether to keep going, with each step as its own call, multiplies cost per turn with great enthusiasm. Often the same result comes from one tool calling prompt that lets the model emit a plan and the first action together.

Early exit conditions

Planning loops need a stopping condition that is not only "the model feels done." Add structural exits: maximum iterations reached, no new information gained since the last iteration, budget below threshold, confidence above target.

Budget aware reasoning

Drop the remaining budget into the model's context as a soft hint: "You have four cents of compute left for this session. Reach the answer in as few calls as you can." Modern models do respond to this. They do not respond reliably enough to replace hard enforcement, so treat it as a nudge that shifts behavior toward shorter chains, and nothing more. A polite suggestion is not a budget.

Retrieve once versus retrieve every turn

If retrieval can be done up front, do that. Iterative retrieval, where the agent decides to fetch more on every turn, is more flexible and multiplies retrieval cost across the whole session. Pay for iteration only when the task genuinely needs it.


Patterns That Cap Turns per Session

The third factor: how long a session runs.

Hard turn limits with a graceful handoff

Set an explicit ceiling on turns per session. When it is hit, the agent either delivers its best answer or hands off to a human. "Let me try harder" is not a valid response to a turn limit.

Loop detection

If the agent calls the same tool with substantially the same arguments three times in a row, pull the plug. This catches the very common case where the model is stuck repeating a step that stopped producing new information two attempts ago.

Session timeouts

Independent of activity, sessions should expire. Long sessions accumulate context, and that context inflates every later call. Treat a suspiciously long session as a smell worth investigating, not as a heartwarming sign of engagement.

Per session cost caps

The bluntest lever: a hard ceiling on dollars per session. As it is approached, the orchestrator shifts into graceful degradation, namely a cheaper model, a shorter context, and an honest "I should hand this to a human" instead of a small fortune in quiet retries.


Hard Budgets at the Architecture Boundary

Soft constraints, like a budget number in the prompt, are useful and insufficient. Probabilistic systems will cheerfully violate soft constraints under the wrong conditions. Hard budgets have to live outside the model.

A workable enforcement model:

  • Per user budgets guard against one runaway account.
  • Per session budgets cap the blast radius of one bad reasoning chain.
  • Per tenant budgets, for B2B products, stop one customer from drinking the shared cost pool.
  • Per task type budgets stop low value tasks from spending like high value ones.

The enforcement layer belongs in the orchestrator, not in the agent. It checks before every model call. When the budget is spent, it returns a structured signal, a special tool result or a system message, that the agent can react to gracefully.

This separation is the whole point. The agent should never decide whether it has budget. The agent should only decide what to do now that it does not.


Cost Observability, the Bridge From Part 4

None of this works without being able to see cost.

The instrumentation from Part 4 (the per span cost field, the per session aggregates, the distribution dashboards) is the foundation that makes cost bounded architecture possible. Three things to keep tight:

  • Per call cost attribution. Every model call's cost is computed when you log it and tagged to a trace, a session, a user, and a task type.
  • Cost per resolved task as your headline metric. Cost per call lies to you. A cheap call retried ten times costs more than one expensive call that simply worked.
  • Drift alerts on cost. A 30 percent rise week over week in cost per resolved task is a signal. Paging someone about one expensive session is noise.

Skip observability and every other pattern in this article becomes vibes based guesswork.


Failure Modes That Show Up as Cost

A short tour of the failure modes that hit the cost dashboard first, roughly in the order systems tend to meet them:

  • Planning loops that never terminate. No iteration cap, no convergence signal, frontier model on every step. A perpetual motion machine that runs on your money.
  • Context bloat. History grows without bound, and every later call pays the accumulated token tax.
  • Tool catalog creep. New tools added, old tools never removed, per call token cost drifting up quietly over months.
  • Retrieval over fetching. The retrieval count is set high "just in case," and every extra document fattens the next call.
  • Frontier model for everything. No routing layer, so every classification, every argument fill, and every two line summary pays premium prices.
  • Aggressive retries. Tool failures retried many times, each retry dragging a full model call along with it.
  • Unbounded parallelism. Multi agent or branching reasoning that fans out with no cap on the breadth times depth product.

These show up on the cost dashboard before they show up anywhere else, which makes that dashboard one of your better early warning systems, assuming anyone reads it.


What to Build First

A pragmatic order of operations:

  1. Per span cost attribution in your observability layer, carried over from Part 4. Non negotiable foundation.
  2. A hard per session cost cap in the orchestrator. The safety net you build before the trapeze.
  3. Output token caps on every model call. Cheap to add, immediate payoff.
  4. Prompt structure rework so the stable prefix comes first and caching can engage.
  5. A model routing layer between orchestrator and provider. High impact once observability exists.
  6. Loop detection and turn limits in the agent runtime.
  7. Per user and per tenant budgets when you go multi tenant.
  8. Drift alerts on cost per resolved task on the dashboards.

Each step leans on the one before it. Resist the urge to skip ahead to model routing before observability exists, unless you enjoy routing while blindfolded.


The Architect's Mental Model

The framing that survives contact with production:

An agent is a metered process. Treat cost as a first class architectural constraint: instrumented like latency, capped like memory, and decomposed like any other compound metric.

Reliability engineering has spent decades bounding resources that compound. Agent cost is just the newest member of that family. The patterns are not new. They are the same bounded resource thinking that gave us rate limiting, circuit breakers, bulkheads, and quotas in classical distributed systems. The vocabulary simply has not finished migrating yet.

What is genuinely new is that reasoning itself is the metered resource, and the system gets to decide, within limits, how much to spend on any given decision. That freedom is what makes agents powerful. It is also what makes them a thrilling way to set money on fire if you operate them without caps.

Bounded cost is bounded behavior. Everything else is hope, and hope does not appear on the pricing page.


This is Part 5 of Architecting Agents, a series on building agentic AI systems that survive production.

Architecting Agents

Part 5 of 5

Agentic AI is the most exciting shift in software in a decade — and one of the easiest to get wrong in production. Architecting Agents is a series of deep-dive essays from a senior software architect's chair: why agentic systems fail under real-world load, which classical distributed-systems patterns still apply, and how to design agents that survive contact with production. Less hype. More engineering. Written for engineers who ship.

Start from the beginning

Why Most Agentic AI Systems Fail in Production — A Software Architect's Perspective

Old engineering principles aren't optional. They're load-bearing