The two causes of your token bill¶

Applies to: Sieve v1.0.x

If you run an LLM agent for real work, the bill is the part nobody warned you about. It starts small, it grows with use, and the worst of it is invisible — most of what you pay for on any given turn is text the model has already seen, or text you never meant to send.

There's a temptation to treat this as one problem with one fix. It isn't. An agent's token bill has two distinct causes, and they need two genuinely different kinds of tool. This post is about telling them apart — because once you can, the question stops being "which tool wins" and becomes "which of my two problems am I looking at right now."

The bill is mostly things you didn't choose¶

Start with where the tokens actually go, because it's rarely where people assume.

When your agent calls a tool, the model doesn't just pay for your request — it pays for the machinery of asking. Anthropic's own pricing documentation spells this out: the tools parameter alone adds hundreds of tokens of schema to every request (the docs quote per-model figures in the high hundreds), the bash tool adds a fixed overhead, and a single web fetch pulls the fetched page straight into your context — "Average web page (10 kB): ~2,500 tokens... Research paper PDF (500 kB): ~125,000 tokens". A tool result you glance at once and never need again can cost more than the entire conversation around it.

Now add the part that repeats. On every turn, a typical agent re-sends its system prompt, its full tool catalogue, its persona, and the conversation so far. The variable part of the request — what you actually typed — is often the smallest thing in the payload. We walked through the mechanics of that growth in The hidden cost of context; the short version is that the fixed overhead, multiplied across every turn of a long session, is the bill.

So the cost has two shapes, and they're not the same shape:

Verbose machine output — JSON tool results, logs, search dumps, fetched pages, code listings. Big, one-off, and mostly structural noise around a small signal.
Repeated standing context — the system prompt, tool schemas, persona, and history that ride along on every single turn, plus the absence of any memory that would let the agent not re-send it all.

These call for different interventions, and conflating them is why "just reduce my tokens" never quite works.

Two different jobs¶

Compressing verbose output is a content problem. You have a 10,000-token JSON blob; you want the model to get its meaning at a fraction of the size without losing the parts that matter. This is hard in an interesting way — it's about understanding the shape of the content (a deeply nested object, an AST, a log stream) and squeezing it losslessly enough that the answer doesn't change.

Reducing repeated context is a traffic problem. The model has already seen your tool schemas and your standing instructions; the fix is to stop re-sending what it's seen, and to remember durable facts so they can be supplied on demand instead of permanently parked in the prompt. This isn't about any single payload's shape — it's about what crosses the wire, turn after turn, and what gets remembered between turns.

You can have either problem without the other. An agent that does a lot of web research and tool-calling has a verbose-output problem even in a short session. A long-running personal assistant that mostly chats has a repeated-context problem even though no individual message is large. Most real agents have both, in different proportions — which is exactly why one tool rarely covers the whole bill.

Two tools, two halves¶

This is where it's worth being concrete, and fair to the projects doing this work.

Headroom is, in its own words, "the context compression layer for AI agents" — it targets the first problem. Its job is taking verbose content and making it smaller while "accuracy [is] preserved on standard benchmarks": JSON, code, logs, the bulky machine output that coding agents generate constantly. It's Apache 2.0, runs locally, and offers library, proxy, agent-wrap, and MCP modes. If your bill is dominated by tool outputs and search results, that's the shape of problem it's built for.

Sieve — the project behind this blog — targets the second. It's a proxy that strips the context the model has already seen from every outbound turn, and backs that with an encrypted local store of durable facts it can inject only when a turn needs them, rather than keeping everything in the prompt forever. It also refuses to invent answers about things it was never told. If your bill is dominated by the same standing apparatus re-sent on every turn, and by an agent that forgets you between sessions, that's its half.

Notice these are different halves. One makes a big payload smaller; the other stops a payload from being re-sent and gives the agent a memory so it doesn't have to be. They're not competing for the same job — they're addressing the two causes named above. In principle they compose: compression handling the verbose one-off content, a reduction-and-memory layer handling the repeated standing content. We think that's the honest picture of the space, and a better mental model than "pick the one tool that fixes tokens."

What this is worth to you¶

Set the percentages aside for a moment — every tool in this space quotes a big reduction number, and the numbers depend entirely on your workload. The value to you as a user is more concrete than any headline figure, and it's worth naming plainly:

Sessions that don't fall over. The most common real complaint isn't the monthly invoice — it's hitting a limit or a context wall in the middle of work. Spending fewer tokens per turn is, before anything else, more room to keep going.
A bill you can reason about. Both kinds of tool are observable: you can see what was sent before and after. A cost you can inspect is a cost you can manage, instead of a number that arrives at month-end.
Less re-explaining yourself. For the repeated-context half specifically, the payoff isn't only tokens — it's an agent that remembers your preferences and your project across sessions, so you stop re-establishing the same ground every time you open it.
Privacy you don't have to trade for savings. Both Headroom and Sieve run locally; Sieve additionally keeps its memory store encrypted on your own disk with no telemetry. Cutting your token bill shouldn't mean shipping your context to one more third party.

Honest limits¶

A few things we won't claim, because they aren't ours to claim yet.

We haven't run the two together. The "they compose" argument above is architectural — it follows from what each tool does, not from a tested pipeline we've measured. Treat it as a sound hypothesis, not a benchmarked result. If you stack them, we'd genuinely like to hear how it goes.

Reduction has a warm-up. A memory-and-reduction layer with an empty store can't save you much on day one; the savings arrive as it learns. Compression, by contrast, helps on the very first verbose payload. That difference matters when you're deciding which problem to tackle first.

The numbers are yours, not ours. Whatever either project's headline percentage, the only figure that means anything is the one you measure on your own workload. Sieve emits per-request diagnostics precisely so you can check ours rather than trust it; Headroom exposes its own stats. Use them.

The takeaway¶

The next time the bill jumps, the useful first question isn't "what cuts tokens" — it's "which of my two problems is this." If it's verbose tool output drowning a small signal, you want compression. If it's the same standing context re-sent every turn and an agent with no memory, you want reduction. Most agents need both, and the good news is the tooling for each now exists, is open source, and runs on your own machine. Knowing which half you're looking at is most of the battle.

This post was drafted with AI assistance and reviewed by the Sieve maintainer before publication. Cost figures are quoted from Anthropic's pricing documentation and Headroom's description is quoted from its README, both fetched on 2026-06-15; if we've misrepresented either, open an issue and we'll correct it. Sieve is open source under Apache 2.0.