AMA-Bench
Long-horizon action/observation trajectory memory
Remnic's benchmark harness covers published memory-agent suites and Remnic-specific regression packs. The goal is not to tune for a toy fixture; it is to measure long-horizon recall, preference updates, temporal supersession, structured planning, graph retrieval, QMD, and exact evidence recall in a reproducible way.
Remnic now ships a native Agent Memory Benchmark provider bridge plus a SOTA verifier. The current PersonaMem 128k artifact is useful evidence, but it is intentionally not published as a SOTA claim until clean Remnic and AMB git provenance is rerun.
Long-horizon action/observation trajectory memory
Interdependent multi-session planning and dependency recall
Interactive personalization and latest-state recall
Long-term conversational memory across temporal sessions
Long conversation QA with dialogue, speaker, and session cues
Extreme-scale conversation memory with plan and chat references
Implicit preference learning and preference updates
Event, date, keypoint, and conflict-resolution memory
Factual and reflective step/time recall
Remnic can retrieve exact evidence from turn numbers, dates, speakers, plan ids, field names, preference updates, and keypoints when those cues are visible in the user question or were stored in memory. The harness keeps hidden scoring metadata out of answering recall, so high scores represent Remnic's retrieval behavior rather than benchmark leakage.
A public result should name the dataset version, Remnic commit, model, judge, seed, runtime profile, and artifact manifest. It should also say whether QMD, graph recall, temporal supersession, and explicit cue recall were enabled. Remnic.ai will label smoke fixture results separately from full published-dataset runs.