Benchmarks

Remnic on LongMemEval + LoCoMo.

We publish reproducible numbers against the two benchmarks the 2026 agent-memory landscape converges on. Every row below links to the raw artifact JSON so you can inspect the exact seed, commit SHA, dataset version, and per-task scores that produced it.

Status: pipeline live, first public numbers pending. The dataset loaders, deterministic harness, artifact schema, CI regression gate, and runbook all landed in issue #566. The `docs/benchmarks/results/` directory currently contains only placeholder (mock) artifacts so the pipeline can be verified end-to-end on a fresh clone. Real numbers will replace them in a tagged release.

LongMemEval-S

LongMemEval from Wu et al. (ICLR 2025) evaluates long-term memory across information extraction, multi-session reasoning, temporal reasoning, and knowledge updates. We run the "S" split (≈500 items) — the same split every comparable system publishes on.

Run date Model Seed F1 Contains-answer LLM judge Artifact
2026-04-20 gpt-4o-mini 42 LongMemEval mock000 (placeholder)

LoCoMo-10

LoCoMo from Maharana et al. (ACL 2024) evaluates very long conversational memory: multi-session dialogue transcripts and QA probes across single-hop, multi-hop, temporal, open-domain, and adversarial categories. We run the 10-conversation split (LoCoMo-10).

Run date Model Seed F1 Contains-answer ROUGE-L LLM judge Artifact
2026-04-20 gpt-4o-mini 42 LoCoMo mock000 (placeholder)

How we run these

The full runbook lives at docs/benchmarks/runbook.md . Short version:

  1. Download the datasets via scripts/bench/fetch-datasets.sh --help (prints HuggingFace commands; never auto-fetches).
  2. Run pnpm exec remnic bench published --name longmemeval --dataset ./bench-datasets/longmemeval --model gpt-4o-mini --seed 42.
  3. Verify with pnpm exec tsx scripts/bench/verify-artifact.ts docs/benchmarks/results/*.json.
  4. Commit the artifacts and tag the release.

Every artifact records schemaVersion, benchmarkId, datasetVersion, system.name, system.version, system.gitSha, model, seed, aggregate metrics, per-task scores, and timestamps. Any breaking schema change bumps BENCHMARK_ARTIFACT_SCHEMA_VERSION.

Why only two benchmarks?

Buyers shortlist on LongMemEval + LoCoMo. MemPalace, Mastra, Supermemory, OMEGA, Zep, Mem0, Letta all publish on one or both of these. Adding a third benchmark before these two are reproducible is premature. After the first release-tagged numbers, the roadmap is LongMemEval-M / LongMemEval-L, LoCoMo-50, and Mastra-style observational-memory ablations — tracked under the optional-follow-ups section of issue #566.