Benchmarks

Memory benchmarks that test the hard parts.

Remnic's benchmark harness covers published memory-agent suites and Remnic-specific regression packs. The goal is not to tune for a toy fixture; it is to measure long-horizon recall, preference updates, temporal supersession, structured planning, graph retrieval, QMD, and exact evidence recall in a reproducible way.

Latest AMB work

Provider bridge, verifier, and PersonaMem evidence.

Remnic now ships a native Agent Memory Benchmark provider bridge plus a SOTA verifier. The current PersonaMem 128k artifact is useful evidence, but it is intentionally not published as a SOTA claim until clean Remnic and AMB git provenance is rerun.

  • Full PersonaMem 128k artifact: 2,727 queries, 1,660 correct, 60.87% accuracy.
  • Target beaten: 52% Gemini-1.5-Flash baseline from the PersonaMem paper.
  • SOTA label withheld: the original verifier manifest recorded dirty or missing provenance.
  • Current verifier fails publishable runs unless result rows, LLM IDs, query count, and git provenance are clean.
Published suite

Coverage across nine public memory benchmarks.

AMA-Bench

Long-horizon action/observation trajectory memory

MemoryArena

Interdependent multi-session planning and dependency recall

AMemGym

Interactive personalization and latest-state recall

LongMemEval

Long-term conversational memory across temporal sessions

LoCoMo

Long conversation QA with dialogue, speaker, and session cues

BEAM

Extreme-scale conversation memory with plan and chat references

PersonaMem-v2

Implicit preference learning and preference updates

MemoryAgentBench

Event, date, keypoint, and conflict-resolution memory

MemBench

Factual and reflective step/time recall

Leaderboard safety

Exact cue recall without answer leakage.

Remnic can retrieve exact evidence from turn numbers, dates, speakers, plan ids, field names, preference updates, and keypoints when those cues are visible in the user question or were stored in memory. The harness keeps hidden scoring metadata out of answering recall, so high scores represent Remnic's retrieval behavior rather than benchmark leakage.

  • Full runs use isolated benchmark memory stores, not production user memory.
  • Artifacts record dataset versions, seed, model ids, judge ids, runtime profile, commit SHA, and manifest data.
  • Hidden gold answers, target ids, final state, and evidence labels stay out of answering recall.
  • Visible cue anchors are derived from stored memory or user-visible prompts and stripped before answer scoring when needed.
  • Quick mode is treated as smoke testing only; full mode is required for public or leaderboard-style claims.
How to read results

Quick runs are smoke tests. Full runs are the credible numbers.

A public result should name the dataset version, Remnic commit, model, judge, seed, runtime profile, and artifact manifest. It should also say whether QMD, graph recall, temporal supersession, and explicit cue recall were enabled. Remnic.ai will label smoke fixture results separately from full published-dataset runs.