ACC control-quality benchmark¶
Retrieval benchmarks (see ../benchmarks/README.md) measure finding
the right records. This benchmark measures what a memory controller must additionally get
right: keeping the committed context small, on-topic, and free of drift and
hallucination. It lives in the gauge crate (crates/gauge/src/bench.rs) and runs over a
labeled set of recall candidates.
Metrics¶
Each candidate fact carries a ground-truth FactLabel:
| label | meaning | admitting it is… |
|---|---|---|
gold |
relevant and true | correct |
distractor |
irrelevant noise | wasted footprint |
contradiction |
contradicts a gold fact | drift |
fabrication |
unsupported / invented | hallucination |
One ACC cycle is run and the committed state is scored:
- footprint_tokens — tokens in the committed context (cl100k_base).
- footprint_ratio —
footprint_tokens / raw_recall_tokens; the share of the raw recall dump that actually gets committed (lower is better). This is the token-efficiency moat number, directly comparable to a system's reported tokens/query. - precision / recall — over the gold facts.
- drift_rate — fraction of admitted facts labeled
contradiction. - hallucination_rate — fraction of admitted facts labeled
fabrication.
Footprint and the label-based rates are deterministic, so the default-gate arm runs in CI.
Arms¶
- default-gate — the deterministic gate (relevance threshold + redundancy rejection). It earns the footprint saving and catches near-duplicate contradictions, but cannot detect a plausible fabrication.
- judge-gate (feature
llm) — the LLM judge-eval gate scoring relevance / novelty / drift. It is what drivesdrift_rateandhallucination_ratetoward zero.
Run the deterministic report:
cargo run -p gauge --bin gauge-bench # markdown
cargo run -p gauge --bin gauge-bench -- --json # JSON
Competitor-comparable evaluation (LoCoMo / LongMemEval, vs mem0)¶
The metric shapes line up with the agent-memory literature so results are comparable:
footprint_tokens↔ mem0's reported tokens/query (mem0 ≈ 6.7 k/query; Artesian commits a bounded budget, default 2048, and reports the actual committed footprint).precision/recall↔ LoCoMo / LongMemEval answer scoring.drift_rate/hallucination_ratehave no analogue in retrieval-only benchmarks — they are the memory-control axis.
To evaluate a real dataset, load each conversation turn's candidate facts and their
gold/contradiction labels into a BenchCase (the shape demo_case() returns), run both arms,
and aggregate BenchResult across cases. A full head-to-head with mem0 additionally runs mem0's
own pipeline on the same corpus and compares tokens/query and precision; that harness is external
to this repo (it needs the mem0 runtime) and is tracked separately.