Remnic on LongMemEval + LoCoMo.
We publish reproducible numbers against the two benchmarks the 2026 agent-memory landscape converges on. Every row below links to the raw artifact JSON so you can inspect the exact seed, commit SHA, dataset version, and per-task scores that produced it.
LongMemEval-S
LongMemEval from Wu et al. (ICLR 2025) evaluates long-term memory across information extraction, multi-session reasoning, temporal reasoning, and knowledge updates. We run the "S" split (≈500 items) — the same split every comparable system publishes on.
| Run date | Model | Seed | F1 | Contains-answer | LLM judge | Artifact |
|---|---|---|---|---|---|---|
| 2026-04-20 | gpt-4o-mini | 42 | — | — | — | LongMemEval mock000 (placeholder) |
LoCoMo-10
LoCoMo from Maharana et al. (ACL 2024) evaluates very long conversational memory: multi-session dialogue transcripts and QA probes across single-hop, multi-hop, temporal, open-domain, and adversarial categories. We run the 10-conversation split (LoCoMo-10).
| Run date | Model | Seed | F1 | Contains-answer | ROUGE-L | LLM judge | Artifact |
|---|---|---|---|---|---|---|---|
| 2026-04-20 | gpt-4o-mini | 42 | — | — | — | — | LoCoMo mock000 (placeholder) |
How we run these
The full runbook lives at
docs/benchmarks/runbook.md . Short version:
- Download the datasets via
scripts/bench/fetch-datasets.sh --help(prints HuggingFace commands; never auto-fetches). - Run
pnpm exec remnic bench published --name longmemeval --dataset ./bench-datasets/longmemeval --model gpt-4o-mini --seed 42. - Verify with
pnpm exec tsx scripts/bench/verify-artifact.ts docs/benchmarks/results/*.json. - Commit the artifacts and tag the release.
Every artifact records schemaVersion,
benchmarkId, datasetVersion,
system.name, system.version,
system.gitSha, model,
seed, aggregate metrics, per-task scores, and
timestamps. Any breaking schema change bumps
BENCHMARK_ARTIFACT_SCHEMA_VERSION.
Why only two benchmarks?
Buyers shortlist on LongMemEval + LoCoMo. MemPalace, Mastra, Supermemory, OMEGA, Zep, Mem0, Letta all publish on one or both of these. Adding a third benchmark before these two are reproducible is premature. After the first release-tagged numbers, the roadmap is LongMemEval-M / LongMemEval-L, LoCoMo-50, and Mastra-style observational-memory ablations — tracked under the optional-follow-ups section of issue #566.