Skip to content

ACC control-quality benchmark

Retrieval benchmarks (see ../benchmarks/README.md) measure finding the right records. This benchmark measures what a memory controller must additionally get right: keeping the committed context small, on-topic, and free of drift and hallucination. It lives in the gauge crate (crates/gauge/src/bench.rs) and runs over a labeled set of recall candidates.

Metrics

Each candidate fact carries a ground-truth FactLabel:

label meaning admitting it is…
gold relevant and true correct
distractor irrelevant noise wasted footprint
contradiction contradicts a gold fact drift
fabrication unsupported / invented hallucination

One ACC cycle is run and the committed state is scored:

  • footprint_tokens — tokens in the committed context (cl100k_base).
  • footprint_ratiofootprint_tokens / raw_recall_tokens; the share of the raw recall dump that actually gets committed (lower is better). This is the token-efficiency moat number, directly comparable to a system's reported tokens/query.
  • precision / recall — over the gold facts.
  • drift_rate — fraction of admitted facts labeled contradiction.
  • hallucination_rate — fraction of admitted facts labeled fabrication.

Footprint and the label-based rates are deterministic, so the default-gate arm runs in CI.

Arms

  • default-gate — the deterministic gate (relevance threshold + redundancy rejection). It earns the footprint saving and catches near-duplicate contradictions, but cannot detect a plausible fabrication.
  • judge-gate (feature llm) — the LLM judge-eval gate scoring relevance / novelty / drift. It is what drives drift_rate and hallucination_rate toward zero.

Run the deterministic report:

cargo run -p gauge --bin gauge-bench            # markdown
cargo run -p gauge --bin gauge-bench -- --json  # JSON

Competitor-comparable evaluation (LoCoMo / LongMemEval, vs mem0)

The metric shapes line up with the agent-memory literature so results are comparable:

  • footprint_tokens ↔ mem0's reported tokens/query (mem0 ≈ 6.7 k/query; Artesian commits a bounded budget, default 2048, and reports the actual committed footprint).
  • precision / recall ↔ LoCoMo / LongMemEval answer scoring.
  • drift_rate / hallucination_rate have no analogue in retrieval-only benchmarks — they are the memory-control axis.

To evaluate a real dataset, load each conversation turn's candidate facts and their gold/contradiction labels into a BenchCase (the shape demo_case() returns), run both arms, and aggregate BenchResult across cases. A full head-to-head with mem0 additionally runs mem0's own pipeline on the same corpus and compares tokens/query and precision; that harness is external to this repo (it needs the mem0 runtime) and is tracked separately.