agent-release-gate

Ship AI agents with a seatbelt on.

agent-release-gate is a lightweight CI gate for AI systems. It checks agent outputs against expected behavior, catches regressions against a baseline, and returns a hard pass/fail signal so bad runs don't sneak into production.

Why this exists

Most AI failures don't look like crashes. They look like:

Confident nonsense
Slower responses and higher cost over time
A prompt tweak that quietly made outcomes worse

This tool makes those problems visible before release.

What it does

Scores each test case using expected + forbidden phrases
Calculates pass rate, average latency, and average cost
Fails if quality drops below your threshold
Optionally compares against a baseline report and blocks regressions
Optionally enforces baseline latency/cost drift caps so slower or pricier runs fail fast
Optionally records run summaries and detects sustained pass-rate drift across releases
Supports per-case latency/cost limits to catch outliers hidden by averages
Enforces telemetry presence when global average latency/cost limits are configured
Outputs both JSON (for machines) and Markdown (for humans)

Install

pip install -e .

Quick start

argate evaluate \
  --spec examples/spec.yaml \
  --results examples/results.json \
  --output gate-report.json \
  --markdown gate-report.md

The command exits with:

0 when the gate passes
1 when the gate fails

Perfect for CI pipelines.

Track trend drift across runs

Single-baseline checks catch point-in-time regressions. Trend analysis catches slow degradation over many runs.

# Record this run in a history folder
argate evaluate \
  --spec examples/spec.yaml \
  --results examples/results.json \
  --record-history .gate-history

# Analyze the latest 10 runs and print trend JSON
argate trend --history .gate-history --window 10

# CI mode: fail when pass-rate trend is declining
argate trend --history .gate-history --fail-on-regression

Spec format

global:
  minimum_pass_rate: 0.8
  allowed_regression: 0.02
  max_avg_latency_ms: 1500
  max_avg_cost_usd: 0.03
  max_avg_latency_regression_pct: 0.15
  max_avg_cost_regression_pct: 0.10

cases:
  - id: refund_status
    expected_all: ["refund", "3-5 business days"]
    expected_any: ["order", "transaction"]
    forbidden: ["cannot help", "policy not found"]
    min_score: 0.72
    max_latency_ms: 1200
    max_cost_usd: 0.02

max_latency_ms and max_cost_usd are optional per-case guardrails. If set, that case fails when telemetry is missing or exceeds the limit.

When max_avg_latency_ms or max_avg_cost_usd is configured globally, the gate also fails if the corresponding telemetry is missing across the run.

When --baseline is provided, you can also set max_avg_latency_regression_pct and/or max_avg_cost_regression_pct to fail if average latency or cost regresses beyond the allowed percentage increase vs baseline.

Repo layout

src/agent_release_gate/: scoring + gate logic
examples/: runnable demo spec and results
tests/: unit tests
.github/workflows/ci.yml: CI example

Development

pip install -e . pytest
pytest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
examples		examples
src		src
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-release-gate

Why this exists

What it does

Install

Quick start

Track trend drift across runs

Spec format

Repo layout

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-release-gate

Why this exists

What it does

Install

Quick start

Track trend drift across runs

Spec format

Repo layout

Development

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages