Documentation

Cost Anomaly Detection

ZopNight detects unusual cost spikes across five dimensions using a dual-method algorithm (percentage deviation + z-score). Each anomaly includes a severity level, root cause analysis, and affected resource breakdown. Detection runs as a scheduled cron job.

Detection dimensions

Anomalies are detected independently across five dimensions. Each dimension has its own detector and can be subscribed to separately via Notifications.

DimensionanomalyTypeWhat it catches
OrganizationorgTotal org spend spike — triggers dimension-level detection for the org
Cloud Accountcloud_accountSpend across all resources in a single cloud account growing beyond baseline
Resource Groupresource_groupA resource group's aggregated cost spiking — detects membership changes, group-level schedule failures
ResourceresourceA specific resource resized, restarted, usage spike, or reservation expiry
TeamteamCost attributed to a team out of trend — filters out cross-team redistribution (retagging)

Subscription-driven detection

Only dimensions with active notification subscriptions (prefixed cost.anomaly.*) are evaluated. Subscriptions can be scoped to specific entities — e.g. a single cloud account or resource group — so you only get alerts for what you care about.

Detection algorithm

ZopNight uses a dual-method approach on a 7-day rolling average baseline:

  1. Primary — percentage deviation: compares today's cost against the 7-day rolling average. Requires at least 4 days of data.
  2. Secondary — z-score: measures how many standard deviations today's cost is above the mean (positive spikes only — cost drops are not flagged). Requires at least 14 days of data and only upgrades severity (never downgrades). Only applied when the standard deviation exceeds 10% of the mean to avoid false escalation on stable costs.

Dimensions averaging less than $1/day are skipped to reduce noise.

Severity levels

Severity% thresholdZ-score threshold
warningExceeds baseline by 30%≥2.0σ
criticalExceeds baseline by 100%≥3.0σ
emergencyExceeds baseline by 500%≥5.0σ
infoNot a detection threshold — mapped from suppressed records

info is a derived state

info is not a detection severity — no anomaly is ever detected as info. When an anomaly is suppressed (e.g. caused by new resources first seen within 3 days), the GET /reports/anomalies endpoint remaps its severity to info so the frontend renders it as a muted blue dot rather than an alert.

Root cause analysis

Each anomaly is enriched with a root cause derived from a 14-day lookback over cost records and activity events. Root cause checks run against an in-memory index (no per-resource database queries) and include:

CheckSignalExample
Instance resizeHourly rate changed by >50%"Hourly rate increased from $0.83 to $1.67"
Usage spikeOn-time increased by >50%"Usage hours increased from 12h to 24h/day"
Reservation expiryPurchase type flipped Reserved → OnDemand"Reservation expired — switched to On-Demand pricing"
New resourceFirst seen within 3 days"New resource — added 2 days ago"
Schedule failureResources kept running past stop time"Schedule 'WorkHours' may have failed"

For org-level anomalies, additional checks detect new account onboarding (60%+ of the spike from new resources) and resource type spikes (e.g. "50 EC2 instances each +$5/day").

Resource group and team dimensions use a cascading root cause strategy: membership diff first, then group-level drill-down, then per-resource fallback — with early exit when a step explains the spike.

Notification cooldown & dedup

  • 7-day notification cooldown — once an anomaly key (org / type / key) is notified, re-notification is suppressed for 7 days. The rolling-average baseline absorbs the new cost level during that window so a single resize event doesn't page you for a week straight.
  • Duplicate-record dedup — when the same key produces another anomaly within 7 days at similar cost (<20% delta), the new anomaly is dropped before storage. The first occurrence is kept; subsequent rows are the same event repeating. A cost increase of 20% or more is treated as a genuinely new event.
  • New-resource suppression — anomalies triggered primarily by resources first seen within 3 days are marked suppressed and not notified. They still appear in the UI as info severity dots.

Notification event types

Anomaly notifications use 15 event types (3 severities × 5 dimensions), each independently subscribable via Notifications:

cost.anomaly.warning              # org-level
cost.anomaly.critical
cost.anomaly.emergency

cost.anomaly.cloud_account.warning # per cloud account
cost.anomaly.cloud_account.critical
cost.anomaly.cloud_account.emergency

cost.anomaly.resource_group.warning # per resource group
cost.anomaly.resource_group.critical
cost.anomaly.resource_group.emergency

cost.anomaly.resource.warning      # per resource
cost.anomaly.resource.critical
cost.anomaly.resource.emergency

cost.anomaly.team.warning          # per team
cost.anomaly.team.critical
cost.anomaly.team.emergency

Subscriptions support scope filtering — subscribe org-wide or to specific entities (a particular cloud account, resource group, or team).

Two-level anomaly drawer

Anomalies appear on cost trend charts as pulsing severity-coloured dots (daily view only). Clicking a dot or the dismissible banner opens a two-level drawer:

  • Level 1 — list: anomaly cards with severity badge, date, summary text, expected → actual cost change, and filter tabs (all / emergency / critical / warning / info).
  • Level 2 — detail: stat cards (Actual / Expected / Deviation %), parsed insight bullets, a resource-type breakdown table, and an affected-resources table with expandable rows showing per-resource root cause. Resource UIDs are copiable.

Detail opens stacked over the list — closing detail reveals the list, so you can walk a triage workflow without losing context. For a single anomaly, the detail panel opens directly.

API

GET
/reports/anomalies

List anomaly records for a date window. Suppressed records' severity is mapped to info in the response.

Query parameters: from, to (YYYY-MM-DD, required), page, limit (optional, default page=1, limit=20).

Responsejson
{
  "data": {
    "items": [
      {
        "id": "a1b2c3d4-...",
        "orgId": "org_001",
        "date": "2026-04-25",
        "anomalyType": "resource",
        "anomalyKey": "i-0abc123def456",
        "severity": "warning",
        "method": "percent",
        "actualCostUsd": 28.90,
        "expectedCostUsd": 12.40,
        "deviationPercent": 133.06,
        "rootCause": "Insights:\n- Hourly rate increased from $0.42 to $0.97\n...",
        "suppressed": false,
        "notified": true,
        "notifiedAt": "2026-04-25T04:35:00Z",
        "createdAt": "2026-04-25T04:30:12Z"
      }
    ],
    "total": 12,
    "hasMore": true,
    "page": 1,
    "limit": 20
  }
}

Anomalies also appear inline on trend data points via GET /reports/trends — each trend item may include an anomaly object with severity, actualCostUsd, expectedCostUsd, deviationPercent, topResources (array of contributing resources), and rootCause (enriched insight text).

Detection schedule

Anomaly detection runs as a cron job at 04:30 UTC daily. It analyses the most recent day with complete cost rollups (typically yesterday).

See Reports for the surrounding cost views and Notifications for how to subscribe to anomaly alerts.