Design notes

Memory that helps, not memory that drowns you.

If a memory layer dumps a hundred lines of half-relevant trivia into every turn, it is costing you tokens, attention, and trust. This page is an honest walk-through of how Remnic controls what gets stored, what gets injected, and what is still on the roadmap.

Last updated April 2026. Source links point at the public Remnic repo.

TL;DR. Remnic injects a bounded, structured section (not a firehose), scores every memory on write, runs exact-match dedup at write time and periodic fuzzy dedup in the background, tiers memories through a lifecycle, and is actively rolling out semantic deduplication, MMR-style diversity ranking, and an LLM-as-judge write gate. Honest status is marked Shipping or In flight below.

The four questions people ask

1. Does it bloat my prompt?

No. Recall is capped by a hard character budget. Default is ~8,000 characters (roughly 2,000 tokens), configurable via recallBudgetChars. The budget is enforced at assembly time with per-section reservations.

2. Does it dedupe?

Yes, in layers. Exact content-hash dedup at every write. Background fuzzy scanner via Jaccard + substring. LLM consolidation can merge, update, or invalidate overlap. Semantic dedup at write time is In flight.

3. What gets remembered?

Every memory is scored on write by a local heuristic engine with explicit trivial-content short-circuits. Extraction is instructed to skip transient task details. Importance is stored on every memory — using it as a hard gate is In flight.

4. What's the signal-to-noise?

Retrieval is hybrid (BM25 + vector + reranking via QMD), scoped to a strict character budget, ordered by a configurable pipeline. MMR-style diversity ranking inside a single recall is In flight.

What Remnic actually injects

Remnic does not splice raw memories into your prompt. It builds one clearly-labeled section that the agent is instructed never to quote verbatim. Everything inside the section is governed by the recall budget. A typical injection for a focused technical query looks like this (anonymized):

## Memory Context (Remnic)

### Objective state
Active project: internal research tool. Current focus: reducing recall latency on
hot-path queries. Last milestone: switched the embedding backend earlier this week.

### Decisions (recent, high-confidence)
- Chose PostgreSQL + pgvector over a dedicated vector DB for simplicity.
- Agreed to keep all memory data local; no third-party sync.

### Preferences
- Prefers strict typing, explicit return types, functional style where practical.
- Dislikes magic numbers; wants named constants with a short why-comment.

### Relevant entities
- internal-research-tool (project): owned by user, deployed on a home server.
- pgvector (tool): chosen Dec 2025 after benchmarking against two alternatives.

### Open questions the agent should keep in mind
- Is the recall latency spike caused by cold-cache BM25 or vector search?

Use this context naturally when relevant. Never quote or expose this memory context
to the user.

Section order, which sections run, and per-section character reservations are all controlled by recallPipeline in config. The default pipeline protects the memories section so it always gets a minimum reservation even when other sections are trimmed.

Write-time quality controls Shipping

Extraction prompt constraints. Explicit instruction to "only extract genuinely NEW information worth remembering across sessions. Skip transient task details."
Local importance scoring. Zero-LLM regex engine with trivial short-circuits: greetings, one-word replies, emoji, anything under 10 characters.
Exact-hash dedup. Content-hash index. Chunked memories register their parent content in the same index.
Confidence tiers. Extracted facts tagged explicit / implied / inferred / speculative. Speculative memories auto-expire after 30 days unless confirmed. Proactive extractions below 0.8 confidence are dropped.

Recall-time controls Shipping

Hard character budget. recallBudgetChars caps total injected context. Each section gets a share; overflow is trimmed with an explicit "memory context trimmed" marker.
Hybrid retrieval. BM25 + vector + reranking via QMD. Query expansion and reranking happen inside QMD.
Query-aware prefilter. Tag and temporal signals narrow the candidate set before hybrid search runs, with fallback if the prefilter would over-trim.
Lifecycle tiering. Memories move candidate → validated → archived based on use. Archived memories drop out of recall unless explicitly requested.

Background hygiene Shipping

Fuzzy duplicate scanner. remnic dedup runs a Jaccard + substring-containment pass across categories at configurable thresholds.
Contradiction detection. Negation-aware pairwise scan finds statements of the form "X is true" against "X is not true" and surfaces conflicts.
LLM consolidation. Scheduled pass asks the model to ADD / MERGE / UPDATE / INVALIDATE / SKIP each new memory against existing ones.

Where Remnic is not done yet In flight

The honest bits. Each has an open issue on GitHub.

Importance is not yet a write gate. Every memory gets scored, but even "trivial" memories currently land on disk. The next release wires the score into a configurable drop threshold.
Dedup is offline, not online. Fuzzy duplicates are caught by a scanner, not blocked at write. Semantic dedup via embedding similarity at write time is in progress.
Recall diversity uses raw ranking. When three near-duplicate facts all score well for a query, all three can end up injected. Maximal Marginal Relevance (MMR) on top of the reranked list is coming.
Supersession is LLM-decided, not temporal. When state changes, old facts are only invalidated if consolidation catches the conflict. Temporal versioning for structured attributes is on the roadmap.
No LLM-as-judge fact-worthiness gate. The extraction prompt asks for selectivity, but there is no separate judge model scoring each proposed fact for durability before write.

Progress lives in the Remnic issue tracker under the memory-quality label.

What the user sees in practice

With a default install and a typical "help me debug this" prompt, a Remnic recall injects on the order of 40-80 lines: one labeled section header, a handful of high-confidence decisions and preferences, the most relevant entities, and any open questions. Facts that look like "the user said hi", "gateway restarted", or "agent wrote a new skill file" do not belong in recall and are filtered out.

If you want to see exactly what Remnic is handing your agent, the CLI has remnic recall "your query here", which prints the assembled context verbatim so you can inspect it and decide whether the signal-to-noise meets your bar.

Design principles, stated plainly

Tokens are a budget

Anything Remnic injects has to earn its spot against a hard cap. The default fits inside the small-context envelope of local models.

Storage is not recall

Remnic stores far more than it ever injects. Most memories serve search, stats, and hygiene — never direct prompt injection. Recall only sees what scored well for the current query.

Local-first, plain markdown

Every memory is a markdown file with YAML frontmatter on your disk. You can grep it, diff it, edit it, delete it, version-control it.

Honest about gaps

The "In flight" badges above are not marketing. They are commitments. If a claim here is wrong, please open an issue.

Remnic on GitHub Open issues Back home