Documentation

Provisioning

Stand up cloud infrastructure end-to-end from a single wizard. ZopNight's provisioner service creates VPCs, Kubernetes clusters, VMs, datastores, registries, and load balancers across AWS, GCP, and Azure — using the same credentials you connected for discovery and scheduling. It also runs auto-remediation workflows triggered from a recommendation's Remediate button.

Where these routes live

Every endpoint on this page is served by the provisioner service (port 5087 internally, behind the gateway on the public API). Spaces themselves (cluster registration) live on the config service — see Deployment Spaces.

Surfaces

Provisioning lives in the ZopDay app at /zopday. It shares authentication, RBAC, teams, and audit logs with the main ZopNight app — same login, same org, same role model.

Two job kinds, one state machine

Internally the provisioner runs two job kinds on the same provisioning_jobs table:

  • provisioning — creates and tears down clusters, VMs, datastores, networks, registries, load balancers, and Helm components.
  • remediation — orchestrates auto-remediation workflows kicked off from a recommendation. See Auto-Remediation Jobs below.

No OpenTofu, no remote agents — the service talks to cloud SDKs (AWS, GCP, Azure) directly, and to the target cluster via the Helm and Kubernetes Go SDKs. After a fresh cluster is up, the provisioner calls config's RegisterSpace gRPC so the cluster appears as a Deployment Space automatically, then calls the discoverer to kick off resource sync.

What you can provision

CategoryAWSGCPAzure
KubernetesEKSGKEAKS
VMs / computeEC2 + ASG + ALB / NLBCompute Engine + MIG + GCLBVM Scale Sets + Azure LB
DatastoresRDS (MySQL / Postgres), ElastiCache (Redis)Cloud SQL (MySQL / Postgres), Memorystore (Redis)Database for MySQL / Postgres, Azure Cache for Redis
RegistryECRArtifact RegistryACR
NetworkingVPC, Subnets, VPC PeeringVPC, Subnets, DNS ZonesVNet, Subnets
In-cluster componentsHelm-based installer for ingress, observability, cert-manager, and platform add-ons

How it works

  1. Pick a space — choose a connected cloud account and a region. ZopNight discovers existing VPCs, subnets, and DNS zones in the background so the wizard can offer real options instead of free-text fields.
  2. Pick a resource type — cluster, VM, datastore, or component.
  3. Configure — instance size, version, networking, backup retention, etc. Sensible defaults are pre-filled.
  4. Review — ZopNight shows the IAM operations it will perform and the estimated monthly cost before any action runs.
  5. Provision — the provisioner applies the changes through cloud SDKs; progress streams to the UI; an audit log entry is recorded under the provisioning product scope. On success, freshly created clusters auto-register as spaces and resources flow into Reports + Recommendations.

Credentials never leave the cloud

Datastore credentials (root passwords, connection strings) are written to the cloud provider's secret manager — AWS Secrets Manager, GCP Secret Manager, or Azure Key Vault. ZopNight stores only the secret reference, never the secret value.

RBAC

Provisioning ships with four policy categories so platform teams can delegate provisioning without granting full tenancy admin:

PolicyWhat it grants
space:view / create / update / deleteManage Deployment Spaces (cloud account + region pairs targeted for provisioning)
provisioning:view / create / update / deleteRun, monitor, modify, and tear down provisioning jobs
service:view / create / update / deleteManage in-cluster Helm components attached to a space
deployment:view / create / update / deleteManage application deployments orchestrated through config + deployer

Default System roles inherit these policies; custom roles need explicit grants. See Roles & Permissions for how to assemble them.

Provisioning jobs

POST
/provisioning-jobs

Submit a new provisioning job (cluster, VM, datastore, network, or component) into a space.

Cluster (EKS)bash
curl -X POST https://zopnight.com/api/provisioning-jobs \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "spaceID": "spc_abc123",
    "kind": "kubernetes",
    "config": {
      "name": "platform-prod",
      "version": "1.30",
      "nodeGroups": [
        { "name": "default", "instanceType": "m5.large", "min": 2, "max": 6 }
      ],
      "vpcID": "vpc-0abc",
      "subnetIDs": ["subnet-0a", "subnet-0b"]
    }
  }'
Datastore (RDS Postgres)bash
curl -X POST https://zopnight.com/api/provisioning-jobs \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "spaceID": "spc_abc123",
    "kind": "datastore",
    "config": {
      "engine": "postgres",
      "version": "16.3",
      "instanceClass": "db.t3.medium",
      "allocatedStorageGB": 50,
      "multiAZ": true,
      "backupRetentionDays": 7
    }
  }'
VM pool (EC2 + ALB)bash
curl -X POST https://zopnight.com/api/provisioning-jobs \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "spaceID": "spc_abc123",
    "kind": "vm_pool",
    "config": {
      "name": "checkout-edge",
      "instanceType": "m5.large",
      "min": 2,
      "max": 6,
      "createLoadBalancer": true,
      "listenerPorts": [80, 443]
    }
  }'
GET
/provisioning-jobs

List provisioning jobs in a space, with status.

GET
/provisioning-jobs/{id}

Get a job's full state — phase, per-step output, and any errors.

PATCH
/provisioning-jobs/{id}

Update lightweight job fields (name, tags) without re-running the pipeline.

DELETE
/provisioning-jobs/{id}

Tear down a previously created cluster, VM pool, datastore, or component.

POST
/provisioning-jobs/{id}/retry

Retry a failed job from the last successful step. Useful for transient failures (rate-limited cloud API, throttled subnet creation).

Databases under a job

Datastore jobs are the parent of one or more databases — logical schemas the application code will connect to. Each database has its own credentials and is provisioned by issuing CREATE DATABASE + CREATE USER against the parent server.

POST
/provisioning-jobs/{jobID}/databases

Create a database on the parent datastore.

Requestbash
curl -X POST "https://zopnight.com/api/provisioning-jobs/job_abc/databases" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{ "name": "checkout", "owner": "checkout_app" }'
GET
/provisioning-jobs/{jobID}/databases

List databases under a job.

GET
/provisioning-jobs/{jobID}/databases/{dbID}

Get a database with its connection metadata (server, port, secret reference).

DELETE
/provisioning-jobs/{jobID}/databases/{dbID}

Drop the database and revoke its user.

Connections

A connection is a stored set of credentials (host, port, user, secret reference) an application fetches at deploy time. Connections live either at job scope (one set of credentials shared across all databases on the server) or at database scope (one set per database).

Job-scoped

POST
/provisioning-jobs/{jobID}/connections

Create a job-scoped connection.

GET
/provisioning-jobs/{jobID}/connections

List job-scoped connections.

GET
/provisioning-jobs/{jobID}/connections/{connID}

Get a job-scoped connection by ID.

DELETE
/provisioning-jobs/{jobID}/connections/{connID}

Delete a job-scoped connection (does not drop the database user).

Database-scoped

POST
/provisioning-jobs/{jobID}/databases/{dbID}/connections

Create a connection scoped to a specific database.

GET
/provisioning-jobs/{jobID}/databases/{dbID}/connections

List connections scoped to a database.

GET
/provisioning-jobs/{jobID}/databases/{dbID}/connections/{connID}

Get a database-scoped connection.

DELETE
/provisioning-jobs/{jobID}/databases/{dbID}/connections/{connID}

Delete a database-scoped connection.

Component Catalog

The catalog is the curated list of installable Helm components (ingress controllers, observability stacks, cert-manager, etc.). The provisioner installs them into the space's cluster via the Helm Go SDK.

GET
/catalog/components

List available catalog components and their default values.

Responsejson
{
  "data": [
    {
      "id": "ingress-nginx",
      "name": "NGINX Ingress Controller",
      "category": "ingress",
      "version": "4.10.0",
      "description": "Production-ready ingress controller backed by NGINX."
    },
    {
      "id": "cert-manager",
      "name": "cert-manager",
      "category": "tls",
      "version": "1.15.0",
      "description": "Automated TLS certificate issuance via ACME."
    }
  ]
}
POST
/provisioning-jobs/install-component

Install a catalog component into a space's cluster.

Requestbash
curl -X POST https://zopnight.com/api/provisioning-jobs/install-component \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "spaceID": "spc_abc123",
    "componentID": "ingress-nginx",
    "values": {
      "controller.service.type": "LoadBalancer"
    }
  }'

Auto-Remediation Jobs

The provisioner also drives the remediation workflows triggered from a recommendation's Remediate button. Workflows are stored in the same provisioning_jobs table (with kind = remediation) and execute through a step-kind dispatcher. The full audit trail is kept in a dedicated remediation_audit table, of which the reconciler is the sole writer.

Step kinds

  • precondition — verifies the cloud resource is in the expected state before any change is made.
  • approval — parks the workflow in awaiting_user for destructive ops; admins approve or reject from the wizard or via the API.
  • execute — delegates Start/Stop to the executor.
  • pause_service — provisioner-owned pause for resources that don't have a Stop primitive (Synapse Pool, ML Compute, Data Factory, Cognitive Services, Databricks Cluster on Azure; Dataproc on GCP — terminal, not reversible).
  • validate — final state check after the cloud action.

Executor boundary

The executor only handles Start/Stop. Pause and delete are provisioner-owned because they can be irreversible and need approval gates.
POST
/remediation-jobs

Start a remediation workflow for a specific recommendation.

Requestbash
curl -X POST https://zopnight.com/api/remediation-jobs \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "recommendationID": "rec_abc123",
    "subjectType": "resource",
    "subjectID": "i-0123456789abcdef0"
  }'
POST
/remediation-jobs/{id}/cancel

Cancel a running or paused remediation workflow.

Step approval / rejection

Two equivalent endpoint shapes: address a step by its stable UUID (recommended — survives template revisions), or by the step's name in the template.

POST
/remediation-jobs/{jobID}/steps-by-id/{stepID}/approve

Approve an approval-gated step by stable step ID.

POST
/remediation-jobs/{jobID}/steps-by-id/{stepID}/reject

Reject an approval-gated step by stable step ID. The workflow transitions to terminal cancelled.

POST
/remediation-jobs/{id}/steps/{stepName}/approve

Approve by step name (legacy form, kept for backward compatibility).

POST
/remediation-jobs/{id}/steps/{stepName}/reject

Reject by step name (legacy form).

GET
/remediation-jobs/{id}/audit

Step-level audit timeline (created / approved / rejected / completed / failed) with actor + reason. Backed by the dedicated remediation_audit table.

Responsejson
{
  "data": [
    { "stepName": "precondition", "eventType": "completed", "actor": "system",         "createdAt": "2026-04-29T11:00:00Z" },
    { "stepName": "approval",     "eventType": "approved",  "actor": "alice@acme.com", "reason": "Confirmed off-hours window", "createdAt": "2026-04-29T11:02:30Z" },
    { "stepName": "execute",      "eventType": "completed", "actor": "system",         "createdAt": "2026-04-29T11:03:15Z" }
  ]
}

Failure model

When a step fails, the provisioner classifies the error into one of three buckets so the wizard can render the right CTA:

  • user_action — customer must fix (permission, quota, billing). The response carries a fixHint and a consoleUrl deep-link. Not retryable.
  • transient — cloud-side throttling or eventual consistency. Retryable; the wizard offers a "Remediate again" button.
  • system — our code or an unsupported API. Not retryable; recorded for follow-up.

Approval-required notifications are routed to org admins via the notification service (remediation.approval_required template). Terminal success/failure emails are gated behind the REMEDIATION_NOTIFY_TERMINAL environment variable to avoid noise during high-volume remediation runs.

VPC Peering on Architecture

VPC peering connections are first-class on the Architecture canvas alongside transit gateways and interconnects. When you provision a cluster or datastore in a peered VPC, the link is rendered automatically and click-throughs land on the discovered peering connection's detail.

See Resources for the full discovered-resource model and Deployment Spaces for cluster registration.