Cost Anomaly Detection

ZopNight detects unusual cost spikes across five dimensions using a dual-method algorithm (percentage deviation + z-score). Each anomaly includes a severity level, root cause analysis, and affected resource breakdown. Detection runs as a scheduled cron job.

Detection dimensions

Anomalies are detected independently across five dimensions. Each dimension has its own detector and can be subscribed to separately via Notifications.

Dimension	`anomalyType`	What it catches
Organization	`org`	Total org spend spike — triggers dimension-level detection for the org
Cloud Account	`cloud_account`	Spend across all resources in a single cloud account growing beyond baseline
Resource Group	`resource_group`	A resource group's aggregated cost spiking — detects membership changes, group-level schedule failures
Resource	`resource`	A specific resource resized, restarted, usage spike, or reservation expiry
Team	`team`	Cost attributed to a team out of trend — filters out cross-team redistribution (retagging)

Subscription-driven detection

Only dimensions with active notification subscriptions (prefixed cost.anomaly.*) are evaluated. Subscriptions can be scoped to specific entities — e.g. a single cloud account or resource group — so you only get alerts for what you care about.

Detection algorithm

ZopNight uses a dual-method approach on a 7-day rolling average baseline:

Primary — percentage deviation: compares today's cost against the 7-day rolling average. Requires at least 4 days of data.
Secondary — z-score: measures how many standard deviations today's cost is above the mean (positive spikes only — cost drops are not flagged). Requires at least 14 days of data and only upgrades severity (never downgrades). Only applied when the standard deviation exceeds 10% of the mean to avoid false escalation on stable costs.

Dimensions averaging less than $1/day are skipped to reduce noise.

Severity levels

Severity	% threshold	Z-score threshold
`warning`	Exceeds baseline by 30%	≥2.0σ
`critical`	Exceeds baseline by 100%	≥3.0σ
`emergency`	Exceeds baseline by 500%	≥5.0σ
`info`	Not a detection threshold — mapped from suppressed records	—

info is a derived state

info is not a detection severity — no anomaly is ever detected as info. When an anomaly is suppressed (e.g. caused by new resources first seen within 3 days), the GET /reports/anomalies endpoint remaps its severity to info so the frontend renders it as a muted blue dot rather than an alert.

Root cause analysis

Each anomaly is enriched with a root cause derived from a 14-day lookback over cost records and activity events. Root cause checks run against an in-memory index (no per-resource database queries) and include:

Check	Signal	Example
Instance resize	Hourly rate changed by >50%	"Hourly rate increased from $0.83 to $1.67"
Usage spike	On-time increased by >50%	"Usage hours increased from 12h to 24h/day"
Reservation expiry	Purchase type flipped Reserved → OnDemand	"Reservation expired — switched to On-Demand pricing"
New resource	First seen within 3 days	"New resource — added 2 days ago"
Schedule failure	Resources kept running past stop time	"Schedule 'WorkHours' may have failed"

For org-level anomalies, additional checks detect new account onboarding (60%+ of the spike from new resources) and resource type spikes (e.g. "50 EC2 instances each +$5/day").

Resource group and team dimensions use a cascading root cause strategy: membership diff first, then group-level drill-down, then per-resource fallback — with early exit when a step explains the spike.

Notification cooldown & dedup

7-day notification cooldown — once an anomaly key (org / type / key) is notified, re-notification is suppressed for 7 days. The rolling-average baseline absorbs the new cost level during that window so a single resize event doesn't page you for a week straight.
Duplicate-record dedup — when the same key produces another anomaly within 7 days at similar cost (<20% delta), the new anomaly is dropped before storage. The first occurrence is kept; subsequent rows are the same event repeating. A cost increase of 20% or more is treated as a genuinely new event.
New-resource suppression — anomalies triggered primarily by resources first seen within 3 days are marked suppressed and not notified. They still appear in the UI as info severity dots.

Notification event types

Anomaly notifications use 15 event types (3 severities × 5 dimensions), each independently subscribable via Notifications:

cost.anomaly.warning              # org-level
cost.anomaly.critical
cost.anomaly.emergency

cost.anomaly.cloud_account.warning # per cloud account
cost.anomaly.cloud_account.critical
cost.anomaly.cloud_account.emergency

cost.anomaly.resource_group.warning # per resource group
cost.anomaly.resource_group.critical
cost.anomaly.resource_group.emergency

cost.anomaly.resource.warning      # per resource
cost.anomaly.resource.critical
cost.anomaly.resource.emergency

cost.anomaly.team.warning          # per team
cost.anomaly.team.critical
cost.anomaly.team.emergency

Subscriptions support scope filtering — subscribe org-wide or to specific entities (a particular cloud account, resource group, or team).

Two-level anomaly drawer

Anomalies appear on cost trend charts as pulsing severity-coloured dots (daily view only). Clicking a dot or the dismissible banner opens a two-level drawer:

Level 1 — list: anomaly cards with severity badge, date, summary text, expected → actual cost change, and filter tabs (all / emergency / critical / warning / info).
Level 2 — detail: stat cards (Actual / Expected / Deviation %), parsed insight bullets, a resource-type breakdown table, and an affected-resources table with expandable rows showing per-resource root cause. Resource UIDs are copiable.

Detail opens stacked over the list — closing detail reveals the list, so you can walk a triage workflow without losing context. For a single anomaly, the detail panel opens directly.

API

GET

/reports/anomalies

List anomaly records for a date window. Suppressed records' severity is mapped to info in the response.

Query parameters: from, to (YYYY-MM-DD, required), page, limit (optional, default page=1, limit=20).

Responsejson

{
  "data": {
    "items": [
      {
        "id": "a1b2c3d4-...",
        "orgId": "org_001",
        "date": "2026-04-25",
        "anomalyType": "resource",
        "anomalyKey": "i-0abc123def456",
        "severity": "warning",
        "method": "percent",
        "actualCostUsd": 28.90,
        "expectedCostUsd": 12.40,
        "deviationPercent": 133.06,
        "rootCause": "Insights:\n- Hourly rate increased from $0.42 to $0.97\n...",
        "suppressed": false,
        "notified": true,
        "notifiedAt": "2026-04-25T04:35:00Z",
        "createdAt": "2026-04-25T04:30:12Z"
      }
    ],
    "total": 12,
    "hasMore": true,
    "page": 1,
    "limit": 20
  }
}

Anomalies also appear inline on trend data points via GET /reports/trends — each trend item may include an anomaly object with severity, actualCostUsd, expectedCostUsd, deviationPercent, topResources (array of contributing resources), and rootCause (enriched insight text).

Detection schedule

Anomaly detection runs as a cron job at 04:30 UTC daily. It analyses the most recent day with complete cost rollups (typically yesterday).

See Reports for the surrounding cost views and Notifications for how to subscribe to anomaly alerts.