agent-memory-eval

Open benchmark suite for AI agent memory systems. Tests whether a memory system can handle the temporal, adversarial, and provenance demands that production AI agents actually face.

What it tests

Dimension	Weight	Fixtures	What it measures
Temporal	25%	10	Current-state retrieval after updates, point-in-time queries, fact expiration, recency preference
Conflict	25%	10	Contradiction detection, supersession, trust-level resolution, challenged facts, exclusive slots
Multi-hop	20%	10	2-hop and 3-hop reasoning chains, reverse lookups, cycle detection
Abstention	20%	10	"I don't know" on unknown topics, retracted facts, equal-confidence conflicts, low-confidence data, future states
Provenance	10%	10	Source attribution, span citation, derived-fact chains, multi-source corroboration

50 total fixtures. Each fixture stores realistic assertions (employee records, financial data, infrastructure state, policy documents) and queries the system for correctness.

Quick start

Against VerifiedState

# Install
npm install agent-memory-eval

# Set credentials
export VERIFIEDSTATE_API_KEY="your-api-key"
export VERIFIEDSTATE_NAMESPACE_ID="your-namespace-id"

# Run all benchmarks
npx agent-memory-eval --verbose

# Run specific categories
npx agent-memory-eval --category temporal,conflict --output markdown

# Save results
npx agent-memory-eval --save results.json

Against a custom memory system

Implement the MemorySystemAdapter interface and pass it as a module path:

import type { MemorySystemAdapter } from 'agent-memory-eval';

export class MyMemoryAdapter implements MemorySystemAdapter {
  name = 'my-memory-system';
  version = '1.0.0';

  async store(params) { /* ... */ }
  async query(params) { /* ... */ }
  async queryAt(params) { /* ... */ }
  async getConflicts(factId) { /* ... */ }
  async reset() { /* ... */ }
}

export default MyMemoryAdapter;

Then run:

npx agent-memory-eval --adapter ./path/to/my-adapter.js

Programmatic usage

import { runBenchmark, temporalFixtures } from 'agent-memory-eval';
import { MyAdapter } from './my-adapter';

const result = await runBenchmark(new MyAdapter(), {
  categories: ['temporal', 'conflict'],
  verbose: true,
});

console.log(`Composite score: ${(result.composite_score * 100).toFixed(1)}%`);

CLI options

Flag	Description	Default
`--adapter <name>`	Built-in adapter name or path to custom module	`verifiedstate`
`--api-key <key>`	API key (or set via env var)	-
`--namespace <id>`	Namespace ID (or set via env var)	-
`--base-url <url>`	Override the base URL	-
`--category <list>`	Comma-separated: temporal, conflict, multihop, abstention, provenance	all
`--output <format>`	`table`, `json`, or `markdown`	`table`
`--verbose`	Print per-fixture pass/fail during run	`false`
`--save <path>`	Write JSON results to file	-

Why these dimensions matter

Temporal reasoning is fundamental. AI agents that cannot distinguish "the customer was on the Free plan" from "the customer is on the Enterprise plan" will give dangerously stale answers. Most vector-only memory systems fail here.

Conflict detection matters because real-world data is messy. Two sources will disagree. The memory system must detect this, not silently pick one.

Multi-hop retrieval tests whether the system can chain facts together. "Alice is on team X, team X is in office Y, therefore Alice is in office Y." Flat retrieval cannot do this.

Abstention is the most underrated capability. A memory system that hallucinates when it does not know something is worse than one that says "I don't know." We test retracted facts, unresolvable conflicts, and topics never stored.

Provenance closes the trust loop. If the system says the churn rate is 4.2%, the user needs to know that came from the analytics dashboard, not a Slack message.

Scoring

Each fixture is scored 0.0 to 1.0 based on the checks it defines (content containment, abstention correctness, conflict detection, source citation). Category scores are the mean of their fixture scores. The composite score is the weighted sum:

composite = temporal * 0.25 + conflict * 0.25 + multihop * 0.20 + abstention * 0.20 + provenance * 0.10

Protocol specification

See the Verified Memory Protocol Specification for the full protocol that VerifiedState implements.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
README.md		README.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-memory-eval

What it tests

Quick start

Against VerifiedState

Against a custom memory system

Programmatic usage

CLI options

Why these dimensions matter

Scoring

Protocol specification

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-memory-eval

What it tests

Quick start

Against VerifiedState

Against a custom memory system

Programmatic usage

CLI options

Why these dimensions matter

Scoring

Protocol specification

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages