Memory Is a Curation Problem

I rewrote Kit's memory from scratch. The first version was what you'd expect: store messages, retrieve messages, stuff them into context. It worked until it didn't. Too much history made Kit slower and dumber. Too little made it forget things I'd told it yesterday.

The version I landed on is modeled loosely after how the brain works. Not as a metaphor. As an architecture.

Three layers

Working memory is the conversation itself. The messages in the current context window. When this grows past 50 entries, the older ones get summarized and the window keeps only the most recent 20. This is the equivalent of what you're actively thinking about right now.

Short-term memory is daily summaries. At the end of each day's activity, a summary is generated of everything that happened: what we worked on, what decisions were made, what's still open. These are written by Haiku (fast, cheap, good enough for summarization) and stored in SQLite. They're time-stamped and browsable with get_day_history.

Long-term memory is facts. Extracted from conversations, organized into seven categories: person, people, preferences, projects, technical, entities, workflows. Each fact has a confidence score and an optional expiry date. A small model (Haiku 4.5) curates these after each conversation, deciding what to insert, update, merge, or deactivate.

The curation step is where most of the design work went. Each category has its own prompt with specific rules. People facts are one entry per person, dossier-style. Preferences are themed ("all PR review preferences" as one entry, not five separate ones). Technical facts are about my stack, not the agent's own architecture. There's a whole section in the curation prompt telling Haiku to reject self-referential facts, because left to its own devices it will happily store "I have a search_memory tool" as a long-term fact about me.

The consolidation rules matter too. When multiple facts overlap, the curator merges them with a REPLACE operation instead of letting them accumulate. There's a safety check: if a curation pass tries to deactivate more than 50% of existing facts in a category, the whole pass is thrown out. This prevents a bad extraction from wiping memory.

Automatic on both sides

The important thing is that I don't manage any of this. Extraction happens automatically after conversations. Retrieval happens automatically before each turn.

Every time I send a message, the system prompt is assembled with a context sandwich: core facts that are always present, recent daily summaries (last 2 days), and then two semantic layers. The current message gets embedded (@xenova/transformers, runs locally) and matched against both the fact store and the summary archive by cosine similarity. Anything above a 0.3 threshold with the top 15 results gets injected into the prompt under "Relevant Facts" and "Relevant Past Activity."

So before the agent even reads my message, it already has: who I am, what we've been working on recently, and whatever long-term facts are relevant to what I'm about to say. There's also search_memory as a tool for deliberate lookups, but most of the time it's not needed. The relevant context is already there.

The retrieval scoring isn't just raw similarity. Recent facts get a boost (up to +0.1 within the last 7 days). Frequently accessed facts get a small boost too. And everything is weighted by confidence. Facts that have been used before and were recently relevant naturally float to the top.

REM

The brain consolidates memory during sleep. The agent does something similar overnight. A scheduled process reviews the day's detailed summaries and condenses them: 30-50 bullet points become 10-15, keeping key outcomes and decisions, dropping process details. Old facts that haven't been accessed get reviewed. Expired facts are cleaned up.

I call it the REM phase because the analogy is exact. During the day, raw experience accumulates. Overnight, it consolidates: compressing, pruning, strengthening the things that matter. The detailed version of yesterday is still in the database if needed, but the condensed version is what goes into the daily prompt. Just like how you remember the gist of last Tuesday, not the minute-by-minute.

Where it clicked

I asked for a PR review. There's a skill for this, a script attached to it. The script failed. This had happened before, maybe two weeks earlier. Same failure, same script. We'd debugged it at the time, found the issue, fixed the immediate problem, but never updated the skill itself.

What an LLM normally does here is start debugging from scratch. Read the error, inspect the script, try things. What actually happened: the agent pulled up the daily summary from the day it failed last time, found the fix we'd already figured out, and applied it. No debugging loop. It just remembered.

That's the moment the memory system stopped feeling like a feature and started feeling like the thing that makes everything else work.

What I got wrong first

The first version treated memory as a storage problem. It's not. It's a curation problem. Storing everything is easy. Surfacing the right thing at the right time, without noise, is the whole challenge. The brain doesn't remember everything. It remembers what's useful. That's the target.