Why Sieve¶

Applies to: Sieve v1.0.x

When you set out to reduce the cost of running an LLM agent, you have three reasonable architectural choices: compact the prompts, add a memory library, or interpose a proxy. Sieve is the third. This post is about why we picked it — and what you give up by doing so.

The three shapes¶

Compaction is what every agent framework already does. When the conversation gets too long, the framework keeps the last N turns, summarises the rest, and ships the summary plus the recent turns to the model. LangChain has it. The OpenAI Assistants API has it. Claude Code has it. It's the default.

Compaction is lossy by design. The summary throws away detail. If turn 3 contained a fact the model needs on turn 47, and the summariser dropped it, the model can't see it any more. The frame work has no theory of what the model needs next — only what happened before.

A memory library fixes the loss by storing facts in a database and retrieving them on demand. The library imports into your agent code. You call something like memory.recall(query) before each turn and inject the result into the prompt. Letta (formerly MemGPT), mem0, llama-index all sit at this layer. The mental model is "a vector store with smart access patterns."

The cost: you change your agent code. Every place the agent talks to the model now has to ask the library "what should I include this time?" That's fine if you control the agent. It's harder if you're using Cursor, Cline, Claude Code, or any of the increasingly- common closed-binary agent products — you can't import a library into something you don't have the source of.

A proxy sits between the agent and the model. The agent talks to it as if it were the LLM endpoint; it forwards a thinner version of the prompt to the actual LLM. The agent doesn't know the proxy exists. The model doesn't know the proxy exists. The integration is one URL change in the agent's config — that's the whole story.

This is what Sieve is.

What you give up¶

The proxy shape has real costs. We didn't pick it because it's obviously best; we picked it because, for the use case we care about, the trade-offs come out in its favour.

You give up the ability to know what the agent meant¶

A library lives inside the agent and knows that this turn is a follow-up to that earlier exchange. It has access to the agent's internal state, the tool calls it just made, the user's role, the session identity. A proxy sees only what comes over the wire — a prompt and a response. It has to infer the rest.

In practice, this means the proxy has to be smart about reading prompts. Sieve runs a classifier on every inbound request to decide whether it's a fact-share, a recall query, a tool call, or filler. A library wouldn't need this — the agent could just tell it.

The agent has its own scratchpad. The proxy doesn't see it. If the agent is mid-tool-call, the proxy can't help — it sees only the final prompt to the LLM, after the agent has already decided what to put in.

You add a network hop¶

Even at loopback, that's milliseconds of latency. For interactive use, undetectable. For high-throughput batch use, it adds up.

What you get back¶

For each cost, there's a corresponding gain — and they're the gains we cared about for v1.

Your agent code doesn't change¶

This is the load-bearing benefit. If you're using Cursor or Claude Code or any closed-binary agent, you can't add a library to it. You can change the URL it talks to. Sieve was designed for exactly that case: you point your existing tooling at Sieve and it just works.

One Sieve serves many agents¶

A library has to be imported into each agent that wants memory. A proxy serves them all from one place. On a developer workstation this is mostly aesthetic — but if you're running Sieve in front of a shared LLM endpoint for a team, the architectural difference matters. The memory store is one thing, not N things.

The trust boundary is yours¶

A library imports into your agent — its code runs in your process. A proxy is a separate process you control. You can restart it, audit its logs, swap its store backend, run it under a different user, put a firewall between it and the agent. The boundary between "my agent" and "my memory" becomes a real boundary, not a Python import.

For users who care about where their memory store sits — and the "local-first, encrypted-on-disk, no telemetry" framing in Sieve's README is aimed at exactly those users — this is a real win.

Where Sieve fits in the landscape¶

We don't think Sieve replaces compaction or memory libraries. They operate at different layers:

Layer	Owner	Tradeoffs
Compaction	The agent framework	Lossy, no theory of future need, free
Memory library	Your agent code	High control, requires source access
Proxy (Sieve)	Independent process	Transparent, no source access needed, network hop

If you're building your own agent from scratch and you have the source, a library is often the right choice. If you're trying to reduce the cost or improve the recall of an agent you didn't write — or want a memory layer you can audit and replace independently — a proxy is a better fit.

Sieve is opinionated about the second case. That's why we built it.

What this means in practice¶

If you have an agent that:

Hits the same LLM endpoint repeatedly
Sends a lot of context per turn (tool schemas, history, instructions)
Is something you can configure the LLM URL of, but maybe not modify

…then Sieve is worth a serious look. You change one URL, you run sieve-install, and the agent starts paying noticeably less per turn. The model starts seeing leaner prompts. The recall of facts across long sessions stops being something the agent itself has to solve.

If you have an agent that doesn't fit that description — say, you're building it from scratch and you want fine-grained control over what the model sees on each turn — a memory library will probably serve you better. We say this honestly because we'd rather have a smaller user base of correctly-targeted users than a larger one of frustrated ones.

To try Sieve against your own agent:

pipx install llm-sieve
sieve-install

Then point your agent at http://127.0.0.1:11435 instead of the LLM endpoint it currently talks to. Run sieve demo first to see what it does in a controlled setting; run sieve benchmark to measure the token reduction on your own hardware.

This post was drafted with AI assistance and reviewed by the Sieve maintainer before publication. Code examples were verified to run against Sieve v1.0.0.