Seer

Agent evaluation framework for Glean agents using LLM-as-judge methodology

Seer evaluates AI agents built in Glean's Agent Builder. It runs agents, scores their responses across multiple dimensions using a research-backed judge architecture, and tracks results over time.

Setup

bun install

cp .env.example .env
# Add your GLEAN_API_KEY (needs chat + search + agents scopes)

bun run db:push

Web UI

cd web && bun install && bun run dev

Quick Start

1. Generate an eval set

bun run src/cli.ts generate <agent-id> --count 5

Uses Glean's ADVANCED agent with company search to find real input values from your CRM/documents and generate grounded evaluation guidance.

2. Run evaluation

# Quick mode (coverage + faithfulness, 2 judge calls/case)
bun run src/cli.ts run <set-id>

# Deep mode (+ factuality verification via company search)
bun run src/cli.ts run <set-id> --deep

# Multi-judge (Opus 4.6 + GPT-5)
bun run src/cli.ts run <set-id> --multi-judge

3. View results

bun run src/cli.ts results <run-id>

Or use the Web UI for formatted results with markdown rendering and research-backed tooltips.

How Scoring Works

Three judge calls, each measuring something different:

Call	Dimensions	What it checks against	Needs expected answer?
Coverage	Topical Coverage, Response Quality	Eval guidance (themes to cover)	Yes
Faithfulness	Groundedness, Hallucination Risk	Agent's own retrieved documents	No
Factuality	Factual Accuracy	Live company data (judge searches independently)	No

Categorical scale (not 1-10): full → substantial → partial → minimal → failure

Categories are 15% more reliable than continuous scales (SJT research). The judge commits to a defined bucket instead of picking an arbitrary number.

Configuration

Option A: Settings UI

Open /settings in the web UI. Saves to data/settings.json.

Option B: .env file

GLEAN_API_KEY=your_key_here
GLEAN_BACKEND=https://scio-prod-be.glean.com
GLEAN_INSTANCE=scio-prod

Commands

# Eval sets
bun run src/cli.ts set create --name <name> --agent-id <id>
bun run src/cli.ts set add-case <set-id> --query <query>
bun run src/cli.ts set view <set-id>
bun run src/cli.ts list sets

# Generate
bun run src/cli.ts generate <agent-id> --count <n>

# Run & results
bun run src/cli.ts run <set-id> [--deep] [--multi-judge]
bun run src/cli.ts results <run-id>
bun run src/cli.ts list runs

Architecture

CLI ←→ Shared SQLite ←→ Web UI
              ↓
        Eval Engine
      ├── Agent Runner    (runworkflow API)
      ├── Smart Generator (ADVANCED agent + company tools)
      ├── Judge           (4-call architecture, Opus 4.6)
      └── Metrics         (latency, tool calls)

See docs/evaluation-framework.md for the full evaluation philosophy and docs/architecture.md for system design.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
src		src
templates		templates
web		web
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
bun.lock		bun.lock
drizzle.config.ts		drizzle.config.ts
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seer

Setup

Web UI

Quick Start

1. Generate an eval set

2. Run evaluation

3. View results

How Scoring Works

Configuration

Option A: Settings UI

Option B: .env file

Commands

Architecture

License

About

Uh oh!

Releases

Packages

Languages

askscio/seer

Folders and files

Latest commit

History

Repository files navigation

Seer

Setup

Web UI

Quick Start

1. Generate an eval set

2. Run evaluation

3. View results

How Scoring Works

Configuration

Option A: Settings UI

Option B: .env file

Commands

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages