Cost Anomaly Detection
ZopNight detects unusual cost spikes across five dimensions using a dual-method algorithm (percentage deviation + z-score). Each anomaly includes a severity level, root cause analysis, and affected resource breakdown. Detection runs as a scheduled cron job.
Detection dimensions
Anomalies are detected independently across five dimensions. Each dimension has its own detector and can be subscribed to separately via Notifications.
| Dimension | anomalyType | What it catches |
|---|---|---|
| Organization | org | Total org spend spike — triggers dimension-level detection for the org |
| Cloud Account | cloud_account | Spend across all resources in a single cloud account growing beyond baseline |
| Resource Group | resource_group | A resource group's aggregated cost spiking — detects membership changes, group-level schedule failures |
| Resource | resource | A specific resource resized, restarted, usage spike, or reservation expiry |
| Team | team | Cost attributed to a team out of trend — filters out cross-team redistribution (retagging) |
Subscription-driven detection
cost.anomaly.*) are evaluated. Subscriptions can be scoped to specific entities — e.g. a single cloud account or resource group — so you only get alerts for what you care about.Detection algorithm
ZopNight uses a dual-method approach on a 7-day rolling average baseline:
- Primary — percentage deviation: compares today's cost against the 7-day rolling average. Requires at least 4 days of data.
- Secondary — z-score: measures how many standard deviations today's cost is above the mean (positive spikes only — cost drops are not flagged). Requires at least 14 days of data and only upgrades severity (never downgrades). Only applied when the standard deviation exceeds 10% of the mean to avoid false escalation on stable costs.
Dimensions averaging less than $1/day are skipped to reduce noise.
Severity levels
| Severity | % threshold | Z-score threshold |
|---|---|---|
warning | Exceeds baseline by 30% | ≥2.0σ |
critical | Exceeds baseline by 100% | ≥3.0σ |
emergency | Exceeds baseline by 500% | ≥5.0σ |
info | Not a detection threshold — mapped from suppressed records | — |
info is a derived state
info is not a detection severity — no anomaly is ever detected as info. When an anomaly is suppressed (e.g. caused by new resources first seen within 3 days), the GET /reports/anomalies endpoint remaps its severity to info so the frontend renders it as a muted blue dot rather than an alert.Root cause analysis
Each anomaly is enriched with a root cause derived from a 14-day lookback over cost records and activity events. Root cause checks run against an in-memory index (no per-resource database queries) and include:
| Check | Signal | Example |
|---|---|---|
| Instance resize | Hourly rate changed by >50% | "Hourly rate increased from $0.83 to $1.67" |
| Usage spike | On-time increased by >50% | "Usage hours increased from 12h to 24h/day" |
| Reservation expiry | Purchase type flipped Reserved → OnDemand | "Reservation expired — switched to On-Demand pricing" |
| New resource | First seen within 3 days | "New resource — added 2 days ago" |
| Schedule failure | Resources kept running past stop time | "Schedule 'WorkHours' may have failed" |
For org-level anomalies, additional checks detect new account onboarding (60%+ of the spike from new resources) and resource type spikes (e.g. "50 EC2 instances each +$5/day").
Resource group and team dimensions use a cascading root cause strategy: membership diff first, then group-level drill-down, then per-resource fallback — with early exit when a step explains the spike.
Notification cooldown & dedup
- 7-day notification cooldown — once an anomaly key (org / type / key) is notified, re-notification is suppressed for 7 days. The rolling-average baseline absorbs the new cost level during that window so a single resize event doesn't page you for a week straight.
- Duplicate-record dedup — when the same key produces another anomaly within 7 days at similar cost (<20% delta), the new anomaly is dropped before storage. The first occurrence is kept; subsequent rows are the same event repeating. A cost increase of 20% or more is treated as a genuinely new event.
- New-resource suppression — anomalies triggered primarily by resources first seen within 3 days are marked
suppressedand not notified. They still appear in the UI asinfoseverity dots.
Notification event types
Anomaly notifications use 15 event types (3 severities × 5 dimensions), each independently subscribable via Notifications:
cost.anomaly.warning # org-level
cost.anomaly.critical
cost.anomaly.emergency
cost.anomaly.cloud_account.warning # per cloud account
cost.anomaly.cloud_account.critical
cost.anomaly.cloud_account.emergency
cost.anomaly.resource_group.warning # per resource group
cost.anomaly.resource_group.critical
cost.anomaly.resource_group.emergency
cost.anomaly.resource.warning # per resource
cost.anomaly.resource.critical
cost.anomaly.resource.emergency
cost.anomaly.team.warning # per team
cost.anomaly.team.critical
cost.anomaly.team.emergencySubscriptions support scope filtering — subscribe org-wide or to specific entities (a particular cloud account, resource group, or team).
Two-level anomaly drawer
Anomalies appear on cost trend charts as pulsing severity-coloured dots (daily view only). Clicking a dot or the dismissible banner opens a two-level drawer:
- Level 1 — list: anomaly cards with severity badge, date, summary text, expected → actual cost change, and filter tabs (all / emergency / critical / warning / info).
- Level 2 — detail: stat cards (Actual / Expected / Deviation %), parsed insight bullets, a resource-type breakdown table, and an affected-resources table with expandable rows showing per-resource root cause. Resource UIDs are copiable.
Detail opens stacked over the list — closing detail reveals the list, so you can walk a triage workflow without losing context. For a single anomaly, the detail panel opens directly.
API
/reports/anomaliesList anomaly records for a date window. Suppressed records' severity is mapped to info in the response.
Query parameters: from, to (YYYY-MM-DD, required), page, limit (optional, default page=1, limit=20).
{
"data": {
"items": [
{
"id": "a1b2c3d4-...",
"orgId": "org_001",
"date": "2026-04-25",
"anomalyType": "resource",
"anomalyKey": "i-0abc123def456",
"severity": "warning",
"method": "percent",
"actualCostUsd": 28.90,
"expectedCostUsd": 12.40,
"deviationPercent": 133.06,
"rootCause": "Insights:\n- Hourly rate increased from $0.42 to $0.97\n...",
"suppressed": false,
"notified": true,
"notifiedAt": "2026-04-25T04:35:00Z",
"createdAt": "2026-04-25T04:30:12Z"
}
],
"total": 12,
"hasMore": true,
"page": 1,
"limit": 20
}
}Anomalies also appear inline on trend data points via GET /reports/trends — each trend item may include an anomaly object with severity, actualCostUsd, expectedCostUsd, deviationPercent, topResources (array of contributing resources), and rootCause (enriched insight text).
Detection schedule
Anomaly detection runs as a cron job at 04:30 UTC daily. It analyses the most recent day with complete cost rollups (typically yesterday).
See Reports for the surrounding cost views and Notifications for how to subscribe to anomaly alerts.