Open benchmark suite for AI agent memory systems. Tests whether a memory system can handle the temporal, adversarial, and provenance demands that production AI agents actually face.
| Dimension | Weight | Fixtures | What it measures |
|---|---|---|---|
| Temporal | 25% | 10 | Current-state retrieval after updates, point-in-time queries, fact expiration, recency preference |
| Conflict | 25% | 10 | Contradiction detection, supersession, trust-level resolution, challenged facts, exclusive slots |
| Multi-hop | 20% | 10 | 2-hop and 3-hop reasoning chains, reverse lookups, cycle detection |
| Abstention | 20% | 10 | "I don't know" on unknown topics, retracted facts, equal-confidence conflicts, low-confidence data, future states |
| Provenance | 10% | 10 | Source attribution, span citation, derived-fact chains, multi-source corroboration |
50 total fixtures. Each fixture stores realistic assertions (employee records, financial data, infrastructure state, policy documents) and queries the system for correctness.
# Install
npm install agent-memory-eval
# Set credentials
export VERIFIEDSTATE_API_KEY="your-api-key"
export VERIFIEDSTATE_NAMESPACE_ID="your-namespace-id"
# Run all benchmarks
npx agent-memory-eval --verbose
# Run specific categories
npx agent-memory-eval --category temporal,conflict --output markdown
# Save results
npx agent-memory-eval --save results.jsonImplement the MemorySystemAdapter interface and pass it as a module path:
import type { MemorySystemAdapter } from 'agent-memory-eval';
export class MyMemoryAdapter implements MemorySystemAdapter {
name = 'my-memory-system';
version = '1.0.0';
async store(params) { /* ... */ }
async query(params) { /* ... */ }
async queryAt(params) { /* ... */ }
async getConflicts(factId) { /* ... */ }
async reset() { /* ... */ }
}
export default MyMemoryAdapter;Then run:
npx agent-memory-eval --adapter ./path/to/my-adapter.jsimport { runBenchmark, temporalFixtures } from 'agent-memory-eval';
import { MyAdapter } from './my-adapter';
const result = await runBenchmark(new MyAdapter(), {
categories: ['temporal', 'conflict'],
verbose: true,
});
console.log(`Composite score: ${(result.composite_score * 100).toFixed(1)}%`);| Flag | Description | Default |
|---|---|---|
--adapter <name> |
Built-in adapter name or path to custom module | verifiedstate |
--api-key <key> |
API key (or set via env var) | - |
--namespace <id> |
Namespace ID (or set via env var) | - |
--base-url <url> |
Override the base URL | - |
--category <list> |
Comma-separated: temporal, conflict, multihop, abstention, provenance | all |
--output <format> |
table, json, or markdown |
table |
--verbose |
Print per-fixture pass/fail during run | false |
--save <path> |
Write JSON results to file | - |
Temporal reasoning is fundamental. AI agents that cannot distinguish "the customer was on the Free plan" from "the customer is on the Enterprise plan" will give dangerously stale answers. Most vector-only memory systems fail here.
Conflict detection matters because real-world data is messy. Two sources will disagree. The memory system must detect this, not silently pick one.
Multi-hop retrieval tests whether the system can chain facts together. "Alice is on team X, team X is in office Y, therefore Alice is in office Y." Flat retrieval cannot do this.
Abstention is the most underrated capability. A memory system that hallucinates when it does not know something is worse than one that says "I don't know." We test retracted facts, unresolvable conflicts, and topics never stored.
Provenance closes the trust loop. If the system says the churn rate is 4.2%, the user needs to know that came from the analytics dashboard, not a Slack message.
Each fixture is scored 0.0 to 1.0 based on the checks it defines (content containment, abstention correctness, conflict detection, source citation). Category scores are the mean of their fixture scores. The composite score is the weighted sum:
composite = temporal * 0.25 + conflict * 0.25 + multihop * 0.20 + abstention * 0.20 + provenance * 0.10
See the Verified Memory Protocol Specification for the full protocol that VerifiedState implements.
Apache 2.0