Skip to content

feat: alerts with webhook delivery + outcome integration#20

Open
govindkavaturi-art wants to merge 1 commit intomainfrom
feat/alerts-with-webhook-delivery
Open

feat: alerts with webhook delivery + outcome integration#20
govindkavaturi-art wants to merge 1 commit intomainfrom
feat/alerts-with-webhook-delivery

Conversation

@govindkavaturi-art
Copy link
Copy Markdown
Member

Summary

Ports the alerts feature to OSS and ships a new webhook-based delivery path. Deliberately excludes SendGrid/email — OSS self-hosters configure their own alert_webhook_url and forward to Slack / Discord / ntfy / SMTP relay / whatever they run. Hosted cueapi.ai retains managed email delivery via SendGrid (see HOSTED_ONLY.md).

What's new

Alert model + migrations

  • app/models/alert.py — id, user_id, cue_id (nullable), execution_id (nullable), alert_type, severity, message, alert_metadata (DB column metadata), acknowledged, created_at
  • CHECK: alert_type IN ('outcome_timeout', 'verification_failed', 'consecutive_failures')
  • CHECK: severity IN ('info', 'warning', 'critical')
  • Indexes: user_id, (user_id, created_at), execution_id
  • Migration 018 creates the alerts table. Migration 019 adds alert_webhook_url (String 2048) + alert_webhook_secret (String 64) to users.

Services

  • alert_service.create_alert — persists the row, then fire-and-forget schedules deliver_alert. Dedup window 5 minutes on (user_id, alert_type, execution_id|cue_id) so flapping executions don't flood user inboxes.
  • alert_service.count_consecutive_failures — walks execution history backwards, stops at the first non-failed row. Threshold 3.
  • alert_webhook.deliver_alert — HMAC-SHA256 signing over {timestamp}.{sorted_payload_json} (same scheme as the existing webhook signer), 10s timeout, SSRF re-resolve at delivery time (DNS rebind protection), never raises. Best-effort — a user's slow endpoint must not block outcome reporting.

Headers

Alert webhook POSTs carry:

  • X-CueAPI-Signature: v1=<hex>
  • X-CueAPI-Timestamp: <unix>
  • X-CueAPI-Alert-Id: <uuid>
  • X-CueAPI-Alert-Type: <type>
  • User-Agent: CueAPI/1.0

Endpoints

  • GET /v1/alerts — list with alert_type / since / limit / offset filters, 400 invalid_filter for unknown types, auth-scoped
  • PATCH /v1/auth/me — accepts alert_webhook_url; empty string clears; SSRF-validated at set time (400 invalid_alert_webhook_url)
  • GET /v1/auth/alert-webhook-secret — lazily generates a 64-char hex secret on first call; returns same value on subsequent calls
  • POST /v1/auth/alert-webhook-secret/regenerate — rotates; requires X-Confirm-Destructive: true

Outcome integration (hooks wired into record_outcome, post-commit)

  1. verification_failed — fires when execution.outcome_state == 'verification_failed'. This state is set by the rule engine from PR feat: verification modes + evidence fields + transport combo rejection (hosted parity) #18; on current main, the hook is dormant (no caller sets that state during record_outcome). Once PR feat: verification modes + evidence fields + transport combo rejection (hosted parity) #18 merges, the hook activates automatically with no further code change — only the migration chain needs rebasing.
  2. consecutive_failures — on any success=false, calls count_consecutive_failures. Fires if streak ≥ 3. Works on current main independently of PR feat: verification modes + evidence fields + transport combo rejection (hosted parity) #18.
  3. outcome_timeout — deferred. Requires a deadline-checking poller that cueapi-core doesn't have yet. The CHECK constraint and router accept the type already, so wiring is drop-in when that poller lands.

All three branches are wrapped in try/except — alert firing can never break outcome reporting.

Merge-order dependencies

Tests — 36 new, all passing

Suite Tests Covers
test_alert_model.py 6 CRUD, CHECK rejection, parametrized valid types, index existence
test_alert_service.py 7 create, dedup window, dedup-doesn't-cross-types, consecutive-failures streak logic + threshold
test_alert_webhook_delivery.py 7 no-URL short-circuit, URL-without-secret skip, SSRF block, HMAC recompute, timeout/non-2xx/RuntimeError all swallowed
test_alerts_api.py 8 empty list, own alerts, type filter, invalid type rejected, pagination, cross-user scoping, auth required
test_alert_webhook_config.py 6 set URL, empty clears, SSRF rejection, lazy secret gen, confirmation required, rotation
test_outcome_triggers_alert.py 3 end-to-end verification_failed + consecutive_failures + isolated-failure-does-not-fire

Full-suite delta: +36 passing, 0 new failures. Pre-existing 7 SDK-integration failures (ModuleNotFoundError: cueapi) unchanged — environment-dependent, CI handles that.

Migration chain applied cleanly on a blank DB: 016 → 018 → 019 (alembic upgrade head).

Notable decisions

  1. Alert model shape mirrors private's (severity / message / acknowledged) rather than the minimal spec shape. Severity is useful for receivers (info vs warning vs critical); message + metadata separate the human summary from the structured context. The spec's required CHECK on alert_type is added.
  2. alert_metadata Python attr, metadata DB column. metadata is reserved on SQLAlchemy's Base class. Mapping via Column("metadata", JSONB) matches private's layout so a future ORM sync is frictionless.
  3. Dedup window 5 min; threshold 3. Magic numbers chosen to match the spec. Both are module-level constants (DEDUP_WINDOW_SECONDS, CONSECUTIVE_FAILURE_THRESHOLD) — easy to flag for env-var-ification later.
  4. Fire-and-forget via asyncio.create_task (not a queue). Keeps OSS dependency-free — no arq/celery/dramatiq for alert delivery. Failures log + return. For hosted scale, the SendGrid path bypasses this.
  5. Lazy-generated signing secret. Registration doesn't pre-generate (keeps the user row narrow for users who never touch alerts). First GET populates; rotation invalidates immediately.
  6. URL-without-secret skips with a warning. Edge case: a user sets a URL but never calls GET /alert-webhook-secret. Rather than POST an unsigned payload, we log and skip — receivers can trust that any POST they get is signed.
  7. outcome_timeout deferred. No deadline poller in OSS core (the docker-compose.yml poller handles next_run, not outcome_deadline_at). Surfaced in the CHECK constraint and router already so the wiring is trivial when a deadline-checker lands.

Documentation

  • New README "Alerts" section with alert-type table, query examples, webhook setup walkthrough
  • examples/alert_webhook_receiver.py — 30-line Flask receiver demonstrating signature verification
  • CHANGELOG [Unreleased] entry
  • Explicit callout in README that email delivery is hosted-only and points at HOSTED_ONLY.md

Test plan

  • 36 new tests pass locally
  • Full pytest tests/ — no new failures
  • Migrations 018 + 019 apply cleanly on blank DB
  • Alerts table + CHECK constraints + indexes verified via \d alerts
  • HMAC signature scheme matches app.utils.signing.sign_payload (receivers use the same verification as regular webhook callers)

🤖 Generated with Claude Code

Ports the alerts feature to OSS. Deliberately excludes SendGrid/email
— self-hosters configure alert_webhook_url and forward to their own
Slack/Discord/ntfy/SMTP relay. Hosted cueapi.ai keeps managed email.

Model + migrations:
- app/models/alert.py: id/user_id/cue_id/execution_id/alert_type/
  severity/message/alert_metadata (column 'metadata')/acknowledged/
  created_at. CHECK on alert_type IN ('outcome_timeout',
  'verification_failed', 'consecutive_failures'). CHECK on severity.
  Indexes: user_id, (user_id, created_at), execution_id.
- alembic 018: alerts table.
- alembic 019: users.alert_webhook_url (String 2048) +
  alert_webhook_secret (String 64), both nullable.
- 018.down_revision = '016' intentionally — PR #18 introduces 017 but
  isn't merged yet. When PR #18 merges first, rebase this PR to chain
  017 -> 018. Documented in the migration docstring.

Services:
- app/services/alert_service.py: create_alert with 5-min dedup on
  (user_id, alert_type, execution_id|cue_id). count_consecutive_failures
  walks execution history backwards, stops at first non-failed.
  Threshold = 3. Webhook delivery is fire-and-forget via
  asyncio.create_task.
- app/services/alert_webhook.py: deliver_alert with HMAC-SHA256 over
  '{timestamp}.{sorted_payload_json}', 10s timeout, SSRF re-resolve at
  delivery, never raises. No-URL short-circuits silently. URL-without-
  secret logs a warning and skips.

Router + auth:
- app/routers/alerts.py: GET /v1/alerts with alert_type/since/limit/
  offset filters, 400 on invalid type, auth-scoped.
- app/routers/auth_routes.py: PATCH /me accepts alert_webhook_url
  (empty string clears; SSRF-validated). GET /alert-webhook-secret
  lazy-generates on first call. POST /alert-webhook-secret/regenerate
  requires X-Confirm-Destructive.

Integration into outcome_service.record_outcome (post-commit):
- verification_failed alert fires when execution.outcome_state ==
  'verification_failed'. Dormant on current main (the rule engine that
  sets this state lives in PR #18); activates automatically once #18
  merges. No rebase of integration code required — only the migration
  chain needs updating.
- consecutive_failures alert fires when the streak reaches 3 on a
  failed outcome. Independent of PR #18 — works on current main.
- outcome_timeout alert firing deferred — requires a deadline-checking
  poller that cueapi-core doesn't have yet. CHECK constraint and
  router already accept the type so the wiring is drop-in when that
  poller lands.
- Alert firing is wrapped in try/except — must never break outcome
  reporting.

Tests (36 new, all passing):
- test_alert_model.py (6): CRUD, CHECK rejection for invalid
  type/severity, parametrized valid types, index existence.
- test_alert_service.py (7): create persists, dedup within window,
  dedup doesn't cross alert types, consecutive_failures counter +
  streak-breaking + threshold constant.
- test_alert_webhook_delivery.py (7): no-URL short-circuit, URL-
  without-secret skip, SSRF block, HMAC signature recomputation,
  timeout/non-2xx/RuntimeError all swallowed.
- test_alerts_api.py (8): empty list, own alerts, type filter, invalid
  type rejected, pagination, cross-user scoping, auth required.
- test_alert_webhook_config.py (6): set valid URL, empty string clears,
  SSRF rejection at config, lazy secret generation, confirmation
  required, rotation.
- test_outcome_triggers_alert.py (3): verification_failed end-to-end
  (seeds outcome_state to exercise the integration path), consecutive
  failures end-to-end, isolated failure does NOT fire.

Full-suite delta: +36 passing, 0 new failures. Pre-existing SDK-
integration failures (cueapi Python package not installed locally)
unchanged.

Docs:
- README 'Alerts' section with alert types, querying, webhook setup.
- examples/alert_webhook_receiver.py: 30-line Flask receiver with
  signature verification.
- CHANGELOG [Unreleased] entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@govindkavaturi-art govindkavaturi-art enabled auto-merge (squash) April 17, 2026 02:45
Copy link
Copy Markdown
Collaborator

@argus-qa-ai argus-qa-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CI checks passing. Approved by Argus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants