On Agent Memory Fidelity (Decant)

June 08, 2026

Since Sonnet 3.7 came out right before my last internship, I’ve barely hand-written code. Coding agents are getting insanely good, and I keep trusting these models to do more and more. I think we’re heading toward a world where most (if not all) code is written through and by agents.

That’s a double-edged sword. When you’re not writing code line by line, you stop thinking as hard about every edge case. The intent behind a change doesn’t live in the diff anymore, it lives in the chat transcript. Those transcripts start mattering more than ever, and they can be millions of tokens large. You rarely need all of that loaded into context to find what you’re after. So I’ve been toying with the idea of embedding compression. Most of the interesting work in the space happens through alternative training methods that make embeddings reversibly compressible, but I wanted to see how far a pure harness-engineering approach could go.

What follows is an exploration from a harness-perspective. It’d need a lot more testing to know if it’s any good, but I thought it was worth putting out there. Hopefully it inspires you to think of cooler solutions to this problem.

Index

0Background

Points, deconstructed Proposal

1Decant

Fidelity Engine Updated (Git) Blame User Control

2Evaluation

Selective Memory Fanout Provenance Lookup

3What I'd Test Next

4Afterword

Reading time
~2.3K words, ~9 min

Last updated
June 08, 2026

Code
GitHub

Background

Agents are constrained in how much they can keep in-memory by their context windows, usually within 400K-1M tokens. Once the context window gets past 50%, model behavior degrades because old, irrelevant, stale, or noisy information remains in the prompt. This is called context rot. There’s 2 main ways for context to rot in coding agents: (1) irrelevant chats and information are included in the prompt, or (2) unnecessarily long, sparse messages.

To be concrete, there are three major points I’m interested in:

Agent chats describe intent better than the code written.
Every message, no matter how irrelevant, stays in-context.
Messages can be 10s of thousands of tokens long, largely due to tool calls and reasoning.

Agentic coding tools periodically “compress” chat history when you’re about to fill up your context window. That helps with huge tool outputs, but only after they’ve been replayed for a while. It shrinks stale work, but does not remove it: the summary still rides through every future prompt as input tokens. And it weakens provenance. The original chat may contain the best explanation of why a line exists, but compaction flattens that history into a lossy blob with no reliable way to zoom back into the exact decision. The two failures are timing and addressability: compaction comes too late, and once it runs, the useful parts are no longer easy to control.

Points, deconstructed

Point #1 can be addressed by building infrastructure that maps commits to agent chat session IDs. This has already been built as git-ai, a Git extension. This allows the agent (and user) to trace back user’s intention with that piece of code assuming it was made with Codex, Claude Code, OpenCode, etc.

Point #2 can be fixed by allowing the agent to treat its context as a variable. Old work should not stay in the prompt just because it happened earlier in the chat. In theory, we can make an agent that can read, grep, and bash over its own context to edit it, similar to recursive language models (RLMs). We’ll call this an RGB-agent (credit to Alexis).

Point #3 is more interesting. It could be fixed by “compressing” messages into lower fidelity. There’s no way to reversibly compress embeddings of frontier models yet, though there is research on it. If we consider context as a long string, we can use the same RGB-agent from point #2 to compress and edit context after every message. However, we’d still have to continuously re-ingest long context over and over again, so token consumption piles up.

Proposal

Instead of a long context string with a separate RGB-agent, Context can be viewed as an object with Message objects within. Each Message contains data (timeSent, content, etc.). From this view, I believe that adding fidelity to Message lets an agent filter out obviously-irrelevant context without having to re-ingest the full transcript of Context at full fidelity.

That seems closer to how conversations work for humans. I do not remember every detail of a conversation (even mid-conversation). Some key parts stay vivid, some collapse into a gist, and tangents might disappear.

Agent sessions should have the same adaptive forgetting. Where it differs from humans: when detail matters again, the agent should be able to fetch the message in full fidelity.

The core distinction is between hiding context from the prompt and deleting it from memory. Decant lowers old work into cheap topic/message views, but keeps the exact messages addressable so the agent can reopen them only when needed.

Decant is my prototype of that idea. It gives agents (and users) tools to change message fidelity and explore Git blame-style code provenance.

Decant: Context as Message Objects w/ Fidelities

Decant is an OpenCode plugin prototype. The idea is to stop treating context as one long string. Instead, treat it as message objects grouped into topics/categories, then let the agent decide how much of each topic or message should stay in the next prompt.

There are two control layers:

Topic: multiple messages that describe the same thread of work. A topic can be rendered as full, summary, compressed, placeholder, or hidden.
Message: the building block of a topic. A single message can inherit the topic setting, force itself back to full, force a summary, or be hidden.

The agent does not have to reread the whole transcript. It can start from the topic map, lower the fidelity of stale topics, look at message summaries, then fetch exact messages later if the summary is not enough.

To be concrete, below is a table summary of important fields of the Topic and Message objects in Decant.

`Topic` Field	What It Stores	Why It Exists
`topicID`	Stable ID for the topic. Currently named `blobID` in the prototype code.	Lets messages, tools, and UI controls refer to the same topic.
`summary`	Short topic-level summary across multiple messages.	Replaces many messages when the topic is rendered cheaply.
`messageIDs[]`	Message IDs assigned to the topic.	Lets Decant reopen exact messages when the topic summary is not enough.
`tokenEstimate`	Rough total size estimate for the topic.	Makes large topics visible before they eat the prompt.
`fidelity`	Topic render setting: `full`, `summary`, `compressed`, `placeholder`, or `hidden`.	Controls how much of the topic appears in the next prompt.

`Message` Field	What It Stores	Why It Exists
`parts[]`^†	Ordered OpenCode message parts: text, reasoning, tool calls/results. Each part has its own type-specific fields.	A message isn’t a continuous string, but a data structure.
`parts[].text`^†	Text or reasoning content when present.	Keeps normal prose and model-visible content.
`topicID`	Topic ID the message belongs to.	Applies topic fidelity through this grouping.
`summary`	Short message-level summary.	Replaces the full message when a topic or message is rendered cheaply.
`tokenEstimate`	Rough size estimate for the raw message.	Makes large messages visible even when provider token accounting is absent.
`fidelityOverride` / `hidden`	Message-level render controls.	Lets one message escape the topic default or disappear from future prompts.

^†: Already existing OpenCode metadata.

To be clear: the raw message data still exist, making it reversibly compressible. The prompt just gets a cheaper view by default and could at any moment use a tool call message_detail to “zoom in” to a message when needed.

Topic fidelity controls how a whole topic appears: full, summary, compressed, placeholder, or hidden. Message-level controls can force one message back to full, force it to summary, or hide it.

Fidelity Engine

Decant annotates the conversation as it happens. On each assistant turn, it asks the model to return the normal answer plus a hidden <annotation> tag, something like:

// ...Actual message...
// <annotation>
{
    "topic": "stable snake_case topic label for this assistant response",
    "is_new_topic": "boolean: whether this response starts a new topic",
    "message_summary": "summary of this assistant message only",
    "topic_summary": "running summary of the whole topic so far",
    "placeholder": "short 5-10 word stub for the topic",
    "key_facts": [
        "facts or decisions worth preserving through compression"
    ]
}
// </annotation>

Decant strips that annotation from the visible chat and stores it next to the session.

That gives the next prompt a menu of renderings. Recent work can stay full. Finished work can become message summaries. Distant work can become topic summaries. Old dead ends can become placeholders or disappear.

The annotation is usually on the order of 300-500 output tokens. You’re accepting to pay a consistent, small “tax” to allow the agent the ability to bookkeep and visualize the chat. The agent can call view_context and set_fidelity to change the map before the next turn.

During OpenCode compaction, Decant also rewrites the compaction prompt so the summarizer respects those annotations instead of flattening every old turn equally. It can use set_fidelity to hide or shrink unimportant context.

Updated (Git) Blame

As I said before, I think most code will be written by agents. Not only does this make code less readable for humans oftentimes, but it is also dangerous because tons of code is being written without explicit intention. The agents are taking care of edge cases and interpret specifications in a way that the user may not have intended.

The data in the user’s messages (their intention) is just as important as the code output. And in large organizations with tons of people pushing code, this is especially the case.

One possible workflow with a blame_lookup tool call:

Agent runs tool call over a file:line
Run git blame on the line, get commit hash
Check Agent Session <> Commit mapping, get session ID
Spawn sub-agent on session with original question about intent + topics in the session
View message summaries of selected topics, zoom in as necessary

Now you’ve (1) allowed agents to make use of other agents’ transcripts, and (2) created an efficient way for them to search for what they need with the fidelity engine. Pretty cool.

Specifically for the “Agent Session <> Commit mapping”, Decant checks if you have git-ai installed to see if there’s an existing mapping. It also keeps a local backup that I implemented in a hidden .opencode folder that’s in every project.

User Control

Decant exposes the context map to the developer for transparency and manual control. You can see what the model is likely to read, how large each topic is, and what has been collapsed or hidden.

The sidebar and /context show the conversation grouped by topic. Each topic has a token estimate and a fidelity setting. From /context, you can set a topic to full, summary, compressed, placeholder, or hidden. You can also override individual messages’ fidelity.

Decant’s /context popup: topics, token share, and fidelity controls.

/blame is a manual trigger over the aforementioned blame_lookup tool call. You could enter file:line, or ask a natural-language question about past chats. Decant runs blame_lookup as we discussed in the last section, finds the relevant prior session, and answers why a line is present given past conversations about it.

Evaluation

I wanted to test whether Decant actually changes the shape of agent memory. Old conversations should not sit in every future prompt just because they happened earlier. But they also cannot disappear, because later tasks may need an exact old fact or the reason behind a line of code.

I tested that in three ways:

The first eval asks whether old facts can be recovered after the old session has been compressed.
The second asks what happens when unrelated future work piles up.
The third starts from code: given a file:line, can Decant route back to the Codex session/message that explains it?

All of this is small and hand-built. The point is to check the memory routes, not to claim Decant makes agents better programmers per-se. That’d require more rigorous benchmarks with harder problems.

I also use 3 main methods throughout the evaluations that are worth defining.

Method	What it does
Default compaction	Carries a compacted summary of the old transcript. Default behavior of OpenCode.
RGB-agent	Writes a smaller memory file, then carries that file forward.¹
Decant	Keeps old detail outside the prompt, then opens exact messages only when needed.

Selective Memory Under Future Work

I started with 8 old topics, then asked 4 exact recall questions and 48 unrelated current-work questions. I ran this three times with openai/gpt-5.5.

Default compaction recovered 3 of 12 old facts. RGB and Decant recovered all 12. Decant did it without carrying old context into the unrelated turns, which was 22%~ cheaper.

Condition	Runs Passed	Old-Fact Recall	Cost vs Default²	Old Context Carried³
Default compaction	0/3	3/12	baseline	259K chars
RGB-agent	3/3	12/12	+1%	199K chars
Decant	3/3	12/12	-22%	0 chars

Fanout

Fanout is where carrying memory gets dumb. The old-memory demand stays fixed at 4 recall questions. The unrelated work grows from 24 to 96 future turns. RGB rereads its memory file on every one of those turns. Decant only opens old memory for the recall questions.

Unrelated Future Turns	RGB Query Tokens	Decant Query Tokens	Decant Total Cost vs RGB	RGB Old Context Carried
24	380K	158K	-46%	871K chars
48	731K	277K	-44%	1.8M chars
96	1.4M	514K	-30%	3.5M chars

Updated Blame / Provenance Lookup

The provenance eval starts from code, not from a memory question. Given a file:line, Decant follows blame to a commit, maps that commit to a Codex session, then opens the topic/message that explains the line.

file:line -> git blame commit -> agent session -> topic/message -> rationale

The past chats here are Codex-generated eval fixtures. The repos are tiny, the commits are seeded, and the decoy sessions are known ahead of time. This is not evidence that Decant handles a messy real codebase yet. It checks whether the blame-to-chat route works when the evidence exists.

A post-hoc GPT-5.5 judge scored each answer from 0 to 1.⁴

Condition	Judge Score	Cost vs Default²	Output + Reasoning
Default compaction	0.93	baseline	3.6K
RGB-agent	0.95	+26%	4.5K
Decant	0.93	-18%	1.6K

Scores are basically tied. The useful signal is the route Decant took: the old chat is still inspectable. Decant can find the supporting session/message, cite it, and answer from that evidence instead of relying on one flattened summary.

What I’d Test Next

The next eval should be SWE-recall: real SWE-bench tasks, hints_text as prior issue-discussion memory, and the official grader still used for patch quality. Then compare Decant against Default, RGB-agent, and embedding retrieval. That is where the idea either becomes useful for coding or stays a memory demo.

I also want failure-focused checks: wrong summaries, stale summaries, useful detail hidden too aggressively, bad topic boundaries, and expensive recovery routes. Those are the failures that would actually make this dangerous or pointless.

Afterword

The prototype works enough to make the eval question sharper. I do not think the hard part is proving that agents can summarize old chats. They can. The hard part is proving that a context map helps during real coding work: fewer stale mistakes, better provenance, lower prompt load, and recoverable detail when another agent picks up the thread later.

I’m glad I was able to get this working without model training. That path still seems worth trying, but the immediate question is more basic: can memory infrastructure around frontier models make long-running agent work less lossy and less sticky?

In these evals, RGB-agent means an editor turn writes rgb-context.md, then the next turn receives only that rewritten file.↩︎
Estimated cost uses a fixed GPT-5.5 standard price model as of 2026-05-17: input_tokens * $5/M + cache_read_tokens * $0.50/M + (output_tokens + reasoning_tokens) * $30/M. Tool/runtime costs are not included. For non-GPT runs, treat this as a normalized estimate, not the provider bill.↩︎
Old context carried is the size of the maintained memory artifact multiplied across unrelated future turns. It is a prompt-exposure metric, not a billing metric. The memory-infra summary is in artifacts/benchmark-runs/memory-infra-frontier.md.↩︎
Judge scores are post-hoc semantic scores over saved artifacts, not the original benchmark pass/fail bit. I used openai/gpt-5.5 as the judge. For provenance lookup, the judge saw the question, expected rationale facts, forbidden distractor terms, final answer, and expected session/message citation. Full judge output is in artifacts/benchmark-runs/blog-judge/default-compaction-gpt55-judge/.↩︎