Ship AI agents with a seatbelt on.
agent-release-gate is a lightweight CI gate for AI systems. It checks agent outputs against expected behavior, catches regressions against a baseline, and returns a hard pass/fail signal so bad runs don't sneak into production.
Most AI failures don't look like crashes. They look like:
- Confident nonsense
- Slower responses and higher cost over time
- A prompt tweak that quietly made outcomes worse
This tool makes those problems visible before release.
- Scores each test case using expected + forbidden phrases
- Calculates pass rate, average latency, and average cost
- Fails if quality drops below your threshold
- Optionally compares against a baseline report and blocks regressions
- Optionally enforces baseline latency/cost drift caps so slower or pricier runs fail fast
- Optionally records run summaries and detects sustained pass-rate drift across releases
- Supports per-case latency/cost limits to catch outliers hidden by averages
- Enforces telemetry presence when global average latency/cost limits are configured
- Outputs both JSON (for machines) and Markdown (for humans)
pip install -e .argate evaluate \
--spec examples/spec.yaml \
--results examples/results.json \
--output gate-report.json \
--markdown gate-report.mdThe command exits with:
0when the gate passes1when the gate fails
Perfect for CI pipelines.
Single-baseline checks catch point-in-time regressions. Trend analysis catches slow degradation over many runs.
# Record this run in a history folder
argate evaluate \
--spec examples/spec.yaml \
--results examples/results.json \
--record-history .gate-history
# Analyze the latest 10 runs and print trend JSON
argate trend --history .gate-history --window 10
# CI mode: fail when pass-rate trend is declining
argate trend --history .gate-history --fail-on-regressionglobal:
minimum_pass_rate: 0.8
allowed_regression: 0.02
max_avg_latency_ms: 1500
max_avg_cost_usd: 0.03
max_avg_latency_regression_pct: 0.15
max_avg_cost_regression_pct: 0.10
cases:
- id: refund_status
expected_all: ["refund", "3-5 business days"]
expected_any: ["order", "transaction"]
forbidden: ["cannot help", "policy not found"]
min_score: 0.72
max_latency_ms: 1200
max_cost_usd: 0.02max_latency_ms and max_cost_usd are optional per-case guardrails. If set, that case fails when telemetry is missing or exceeds the limit.
When max_avg_latency_ms or max_avg_cost_usd is configured globally, the gate also fails if the corresponding telemetry is missing across the run.
When --baseline is provided, you can also set max_avg_latency_regression_pct and/or max_avg_cost_regression_pct to fail if average latency or cost regresses beyond the allowed percentage increase vs baseline.
src/agent_release_gate/: scoring + gate logicexamples/: runnable demo spec and resultstests/: unit tests.github/workflows/ci.yml: CI example
pip install -e . pytest
pytestMIT