Skip to content
This repository was archived by the owner on Apr 21, 2026. It is now read-only.

verifiedstate/agent-memory-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

agent-memory-eval

Open benchmark suite for AI agent memory systems. Tests whether a memory system can handle the temporal, adversarial, and provenance demands that production AI agents actually face.

What it tests

Dimension Weight Fixtures What it measures
Temporal 25% 10 Current-state retrieval after updates, point-in-time queries, fact expiration, recency preference
Conflict 25% 10 Contradiction detection, supersession, trust-level resolution, challenged facts, exclusive slots
Multi-hop 20% 10 2-hop and 3-hop reasoning chains, reverse lookups, cycle detection
Abstention 20% 10 "I don't know" on unknown topics, retracted facts, equal-confidence conflicts, low-confidence data, future states
Provenance 10% 10 Source attribution, span citation, derived-fact chains, multi-source corroboration

50 total fixtures. Each fixture stores realistic assertions (employee records, financial data, infrastructure state, policy documents) and queries the system for correctness.

Quick start

Against VerifiedState

# Install
npm install agent-memory-eval

# Set credentials
export VERIFIEDSTATE_API_KEY="your-api-key"
export VERIFIEDSTATE_NAMESPACE_ID="your-namespace-id"

# Run all benchmarks
npx agent-memory-eval --verbose

# Run specific categories
npx agent-memory-eval --category temporal,conflict --output markdown

# Save results
npx agent-memory-eval --save results.json

Against a custom memory system

Implement the MemorySystemAdapter interface and pass it as a module path:

import type { MemorySystemAdapter } from 'agent-memory-eval';

export class MyMemoryAdapter implements MemorySystemAdapter {
  name = 'my-memory-system';
  version = '1.0.0';

  async store(params) { /* ... */ }
  async query(params) { /* ... */ }
  async queryAt(params) { /* ... */ }
  async getConflicts(factId) { /* ... */ }
  async reset() { /* ... */ }
}

export default MyMemoryAdapter;

Then run:

npx agent-memory-eval --adapter ./path/to/my-adapter.js

Programmatic usage

import { runBenchmark, temporalFixtures } from 'agent-memory-eval';
import { MyAdapter } from './my-adapter';

const result = await runBenchmark(new MyAdapter(), {
  categories: ['temporal', 'conflict'],
  verbose: true,
});

console.log(`Composite score: ${(result.composite_score * 100).toFixed(1)}%`);

CLI options

Flag Description Default
--adapter <name> Built-in adapter name or path to custom module verifiedstate
--api-key <key> API key (or set via env var) -
--namespace <id> Namespace ID (or set via env var) -
--base-url <url> Override the base URL -
--category <list> Comma-separated: temporal, conflict, multihop, abstention, provenance all
--output <format> table, json, or markdown table
--verbose Print per-fixture pass/fail during run false
--save <path> Write JSON results to file -

Why these dimensions matter

Temporal reasoning is fundamental. AI agents that cannot distinguish "the customer was on the Free plan" from "the customer is on the Enterprise plan" will give dangerously stale answers. Most vector-only memory systems fail here.

Conflict detection matters because real-world data is messy. Two sources will disagree. The memory system must detect this, not silently pick one.

Multi-hop retrieval tests whether the system can chain facts together. "Alice is on team X, team X is in office Y, therefore Alice is in office Y." Flat retrieval cannot do this.

Abstention is the most underrated capability. A memory system that hallucinates when it does not know something is worse than one that says "I don't know." We test retracted facts, unresolvable conflicts, and topics never stored.

Provenance closes the trust loop. If the system says the churn rate is 4.2%, the user needs to know that came from the analytics dashboard, not a Slack message.

Scoring

Each fixture is scored 0.0 to 1.0 based on the checks it defines (content containment, abstention correctness, conflict detection, source citation). Category scores are the mean of their fixture scores. The composite score is the weighted sum:

composite = temporal * 0.25 + conflict * 0.25 + multihop * 0.20 + abstention * 0.20 + provenance * 0.10

Protocol specification

See the Verified Memory Protocol Specification for the full protocol that VerifiedState implements.

License

Apache 2.0

About

Open benchmark suite for AI agent memory systems. Tests temporal reasoning, conflict detection, multi-hop retrieval, abstention, and provenance.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors