Run Failure/Timeout Alerting #78

Open
opened 2026-02-23 10:07:08 +00:00 by ottomata · 0 comments
Owner

Tasks

Migration 000007_create_run_alerts.up.sql:

CREATE TABLE run_alerts (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    job_id      UUID NOT NULL REFERENCES jobs(id) ON DELETE CASCADE,
    alert_type  TEXT NOT NULL CHECK (alert_type IN ('webhook', 'email')),
    target      TEXT NOT NULL,
    on_failure  BOOLEAN NOT NULL DEFAULT TRUE,
    on_timeout  BOOLEAN NOT NULL DEFAULT TRUE,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Alert Service internal/services/alert_service.go:

  • DispatchAlert(ctx, run *domain.Run) — called by run service when a run ends in failed or timed_out
  • Query all alerts for run.JobID where condition matches
  • For webhook: POST to target URL with JSON payload (run ID, job name, status, exit code)
  • For email: send via SMTP (configured in Config) with a simple text template
  • Dispatch is async (goroutine) — never blocks the run lifecycle
  • Log success/failure of each dispatch at INFO level

Alert API (Admin/Operator):

  • GET /jobs/:id/alerts
  • POST /jobs/:id/alerts — Body: { alert_type, target, on_failure, on_timeout }
  • DELETE /alerts/:id

Alert UI on job detail page:

  • List alerts, add/delete form

Acceptance Criteria

  • On job failure, configured webhook receives a POST within 5 seconds
  • On job timeout, configured email is sent
  • Alert dispatch failure does NOT affect run status
  • Alert configuration is manageable via API and UI
### Tasks **Migration** `000007_create_run_alerts.up.sql`: ```sql CREATE TABLE run_alerts ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), job_id UUID NOT NULL REFERENCES jobs(id) ON DELETE CASCADE, alert_type TEXT NOT NULL CHECK (alert_type IN ('webhook', 'email')), target TEXT NOT NULL, on_failure BOOLEAN NOT NULL DEFAULT TRUE, on_timeout BOOLEAN NOT NULL DEFAULT TRUE, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); ``` **Alert Service** `internal/services/alert_service.go`: - `DispatchAlert(ctx, run *domain.Run)` — called by run service when a run ends in `failed` or `timed_out` - Query all alerts for `run.JobID` where condition matches - For `webhook`: `POST` to target URL with JSON payload (run ID, job name, status, exit code) - For `email`: send via SMTP (configured in `Config`) with a simple text template - Dispatch is async (goroutine) — never blocks the run lifecycle - Log success/failure of each dispatch at INFO level **Alert API** (Admin/Operator): - `GET /jobs/:id/alerts` - `POST /jobs/:id/alerts` — Body: `{ alert_type, target, on_failure, on_timeout }` - `DELETE /alerts/:id` **Alert UI** on job detail page: - List alerts, add/delete form ### Acceptance Criteria - [ ] On job failure, configured webhook receives a POST within 5 seconds - [ ] On job timeout, configured email is sent - [ ] Alert dispatch failure does NOT affect run status - [ ] Alert configuration is manageable via API and UI
ottomata added this to the Phase 8 project 2026-02-23 10:09:19 +00:00
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
ottomata/acsm#78
No description provided.