The Memory Problem — Voidwolf Blog

I had a conversation with a coworker recently that spiralled (in the best way) into something I've been half-thinking about for a while. We started on spec-driven development and ended up somewhere more interesting: whether the way we think about LLM memory is fundamentally backwards.

Let me explain what I mean.

Context is not free

There's a tempting approach to giving an LLM working memory: pull in a load of markdown files and dump them into the context. It's easy to understand, easy to inspect, and developers can edit the files directly. I get the appeal. I've done it.

The problem is that humans and LLMs handle information very differently. When you skim a document, you naturally filter, you know what's relevant before you've fully read it. LLMs can't do that. Everything in the context window gets equal weight. Larger models can work around this, but at the smaller end of the scale the signal-to-noise ratio becomes a real problem. You're spending tokens and reasoning capacity on things that just don't matter for the current query.

A vector store or graph database is a rough approximation of what humans do naturally: retrieve the relevant parts, not the whole. It's not perfect, sometimes you do need the full picture, but for the "I need to remember this specific detail" problem, it's a better fit than flooding the context.

There's another thing I noticed in my own experiments: when you give a model a tool to manage its own memory, it forgets to use it. Constantly. The fix that actually worked was making retrieval automatic, every conversation turn fetches relevant context from the memory store and injects it into the prompt before the model sees the user's message. The agent can still call the tool explicitly, but the passive retrieval eliminates a lot of silent failures.

Spec drift is the same problem with a different name

Software specs and LLM memory have something uncomfortable in common. They both drift.

In traditional development, the spec is written, development starts, edge cases happen, things change, and the spec stops reflecting reality. The code becomes the source of truth. The spec becomes an artefact of original intent. That's fine, mostly, because human developers carry the delta in their heads. They pass it on. They can look at a weird piece of code and infer what probably happened.

An agent can't do any of that. Every session starts from zero. When it looks at your codebase it's reconstructing context from scratch, like working a crime scene. The spec helps, but the evidence of what actually happened is buried in the divergence between what the spec says and what the code does. This matters a lot more when the thing trying to understand the codebase has no continuous memory.

I've been thinking about this in terms of three approaches, none of which are the answer on their own.

The living spec. The spec is never static, it gets updated alongside the code, becoming a kind of detailed changelog. The model always has an accurate picture of the system. The problem is this requires discipline to maintain and creates its own overhead.

The ADR approach. You have a high-level master spec and capture decisions and changes in separate architecture decision records over time. Some tools do this well. The issue is scalability, over a long-lived project, the context accumulates and you're back to the signal-to-noise problem.

The test spec. Tests are unambiguous, runnable, and they break when reality diverges from intent. Encoding the spec as a test suite the agent has to build against gives it grounded, verifiable truth. But tests only capture the what, not the why. You don't know why a decision was made, only whether it holds.

The honest answer is that the truth sits somewhere between all three. More than that, I think agentic development hasn't introduced new problems so much as it's stress-tested what was already fragile. The pressure LLMs put on spec-and-memory workflows might be forcing us to reckon with something that was always broken.

We're thinking about memory backwards

Here's the idea I keep coming back to, and I'm not sure yet if it's brilliant or completely wrong.

Current approaches to LLM memory tend to work like this: take a conversation, summarise it, store the fragments, assume the knowledge is retained. The assumption is that memory is a collection of facts.

But that's not how human memory works. We don't store memories as facts and recall them verbatim. We reconstruct memories, we rebuild them from our current knowledge and worldview at the moment of recall. That's why we forget bad times and misremember good ones. Our memories are beliefs about the state of the world, not recordings of it.

If that's true, then the approach of "summarise, store, retrieve fragments" is solving the wrong problem. Instead of changing what gets stored, we should be changing how we recall. Keep everything. Derive a contextual interpretation on demand, based on what's relevant now. The memories don't change, the interpretation of them does, depending on what we're trying to decide.

One approach I've been thinking about experimenting with is Bayesian analysis. Given a set of memories, derive the most likely "belief" about something based on the prior evidence, and update that belief as new information comes in. The goal isn't to return raw facts, it's to answer the why behind them, to give the model something more like a reasoned opinion than a lookup result.

The example I used in the conversation: instead of surfacing the memory "we use Rust for this project" in isolation, derive the belief: "we use Rust because Node became increasingly difficult to secure over time and the team grew more comfortable with Rust's performance and safety guarantees." That belief informs future decisions in a way a raw fact doesn't.

I have no idea if this would actually work or just generate expensive, confident hallucinations. But as a direction it feels closer to what we actually want from LLM memory: not a filing cabinet, but something closer to reasoning about the past.

Agentic development keeps surfacing the same uncomfortable truth: the tools and workflows we've built for human developers, with their continuous memory and intuitive context management, don't translate cleanly. We're not just adding AI to existing workflows. We're finding out which parts of those workflows we only got away with because humans are remarkably good at filling in the gaps.

The memory problem isn't solved. But at least it's getting more interesting.