Skip to content

Blog

Writings from the Sieve team about agent infrastructure, LLM context management, and the design decisions behind Sieve.

Posts are drafted with AI assistance and reviewed by the Sieve maintainer before publication. Every quantitative claim either links to a source or to a measurement we ran ourselves against a specific Sieve version — the version applicability is noted at the top of each post.

The two causes of your token bill

Applies to: Sieve v1.0.x

If you run an LLM agent for real work, the bill is the part nobody warned you about. It starts small, it grows with use, and the worst of it is invisible — most of what you pay for on any given turn is text the model has already seen, or text you never meant to send.

There's a temptation to treat this as one problem with one fix. It isn't. An agent's token bill has two distinct causes, and they need two genuinely different kinds of tool. This post is about telling them apart — because once you can, the question stops being "which tool wins" and becomes "which of my two problems am I looking at right now."

What always-on agents stand to gain from a context proxy

Applies to: Sieve v1.0.x

The most interesting agents of 2026 don't run in your terminal. They live in your chat apps. OpenClaw describes itself as "a self-hosted gateway that connects your favorite chat apps and channel surfaces… to AI coding agents." Hermes, from Nous Research, runs "on a $5 VPS, a GPU cluster, or serverless infrastructure" and lives on "Telegram, Discord, Slack, WhatsApp, Signal, and CLI — all from a single gateway process."

The defining property of this generation isn't a feature. It's that they're always on. And always-on is, by some distance, the heaviest context workload we've seen — which makes it worth working through what a context-reduction proxy would change for them.

The hidden cost of context

Applies to: Sieve v1.0.x

The conventional wisdom about LLM API costs is straightforward: tokens × price per token = bill. If you double the prompt, you double the cost. Predictable, linear, easy to budget.

This is wrong in a way that matters more the longer your agent runs.

Persistent memory for Ollama, in about five minutes

Applies to: Sieve v1.0.x

Ollama gives you a local LLM endpoint that is fast, private, and completely stateless. Close the chat, and everything you told the model is gone. Keep the chat open, and every turn re-sends a growing history until the context window fills up. Ask a local model about something it was never told, and — depending on the model — it may simply make something up.

This guide adds a persistent, encrypted memory to any Ollama setup using Sieve, without changing your client code beyond one URL.

Sieve, mem0, Zep: three shapes of agent memory

Applies to: Sieve v1.0.x

If you're shopping for a memory layer for an LLM agent in 2026, three credible shapes are on the table: an SDK you call from application code (mem0), a managed platform you push conversations into (Zep), and a transparent proxy that sits in the traffic path (Sieve). They get compared as if they were interchangeable. They aren't — and the differences that matter are architectural, not benchmark decimals.

We build Sieve, so read this knowing where our incentives sit. In exchange: every claim about the other two links to their docs and repos, fetched and quoted on 2026-06-10, and we'll be plain about where each of them is the better choice.

Compute is the bottleneck. Tokens are just the price tag.

Applies to: Sieve v1.0.x

There are three forces shaping what AI gets to exist in 2026, and only one of them gets talked about properly.

The first is the model. Which lab has the best one, what it can do, who can use it. This gets most of the press.

The second is the application. Which agent product works, which codebase you can let Cursor loose on, whether your CRM has Copilot yet. This gets most of the funding.

The third is the compute the first two run on. There isn't enough of it. There hasn't been enough of it since GPT-4. Every price you've ever seen for an LLM token is a passthrough of a scarcity that begins in a TSMC fab in Hsinchu and ends on your invoice.

This post is about that third force. It is also about Sieve, which is the small lever it sits behind. We won't claim Sieve fixes the compute bottleneck — nothing currently in the field does, and the honest framing of what Sieve contributes is more interesting than the dishonest one.

Why Sieve

Applies to: Sieve v1.0.x

When you set out to reduce the cost of running an LLM agent, you have three reasonable architectural choices: compact the prompts, add a memory library, or interpose a proxy. Sieve is the third. This post is about why we picked it — and what you give up by doing so.