ZopNight Ships VM Autoscaling, MCP Server, and Tag-Level Cost Attribution

Most cloud platforms tell you what happened. They do not fix it. This release moves ZopNight from a visibility layer into an execution layer. You can now autoscale VMs across AWS, Azure, and GCP from a single policy. You can query your live infrastructure through AI assistants without write risk. You can attribute costs to any cloud tag in real time. And you can see exactly how large your S3 and GCS buckets actually are, inline, without opening a separate report.

Six features. One direction: from observation to action.

VM Autoscaling V1: Three Modes, One Policy Lifecycle

VM autoscaling is the feature most teams ask for and most teams break on their first attempt. The usual failure mode: a threshold set too aggressively fires during a one-hour spike, scales out 12 instances, and the scale-in cooldown is configured wrong, so those instances stay running for 3 days. That 3-day overage on 12 instances in us-east-1 costs roughly $420 at on-demand rates before anyone notices. As Azure VMSS autoscale floors show, even a misconfigured minimum instance count can block savings permanently.

We built V1 around a graduated trust model. Three operating modes let you move at your own pace.

Monitor mode collects CPU metrics and computes running statistics using Welford’s online algorithm. Welford computes P90, P95, and P99 percentiles without storing raw time-series data. This matters because it keeps memory overhead near zero regardless of how many VMs you are tracking.

Recommend mode surfaces the six new recommendation rules (RC-ASC-001 through RC-ASC-006) as actionable cards. The rules cover scale-out triggers, scale-in candidates, cooldown misconfiguration, threshold adjustment, policy conflict detection, and anomaly suppression. You read the recommendation, decide whether to apply it, and ZopNight does nothing until you confirm.

Autopilot mode executes the approved policy without human input. The system applies, pauses, resumes, and removes policies based on the same rules. Autopilot only activates after you have run the policy in Recommend mode and explicitly promoted it.

Mode	What It Does	Human Input Required
Monitor	Collects CPU metrics, computes P90/P95/P99 percentiles	None
Recommend	Surfaces 6 rules as actionable cards (RC-ASC-001 to RC-ASC-006)	Confirm each recommendation
Autopilot	Executes policy automatically	None — only after Recommend phase approval

VM autoscaling three modes: monitor, recommend, autopilot lifecycle

This works on AWS Auto Scaling Groups, Azure VMSS, and GCP Managed Instance Groups. The same policy object, the same mode transitions, three clouds.

The design choice to require Monitor before Autopilot is intentional. Teams that skip the observation phase do not understand why a policy fires, and that makes rollbacks slower. Kubernetes autoscaler comparisons show the same pattern: VPA in auto mode without an observation period causes more incidents than it prevents.

MCP Server: 43 Read-Only Tools, Zero Write Risk

The MCP Server exposes 43 tools to AI assistants: Claude Desktop, Cursor, Codex, and Claude Code. Every tool is read-only. There are no write operations, no mutation endpoints, no way to delete or modify a resource through the MCP interface.

This is a deliberate boundary. AI assistants that can write to production infrastructure create a new class of incident. A misread prompt, a hallucinated resource name, an ambiguous confirmation: any of these can trigger an action you cannot easily undo. AI in DevOps is genuinely useful, but only when the blast radius of a mistake is bounded.

Component	Role
AI Assistants (Claude, Cursor, Codex)	Send queries via MCP protocol
MCP Server	Routes to 43 read-only tools, logs every request
ZopNight Data	Resources, costs, tags, schedules, recommendations
Audit Log	Records assistant identity, tool, query, and response

MCP server architecture: AI assistants, read-only tools, ZopNight data, and audit log

An org-level toggle controls whether the MCP server is active. Disabling it cuts all AI assistant access in one operation, with no changes needed at the individual tool level. Every request is audit-logged: which assistant, which tool, which query, what was returned.

The 43 tools cover resources, costs, tags, schedules, recommendations, and policy state. An engineer can ask “which Databricks clusters are running outside business hours” and get a live answer grounded in ZopNight’s current state, not a snapshot from a report that was generated last Tuesday.

Showback Tags Dimension: Cost Breakdown by Cloud Tag

Cost reports with no tag dimension are visibility theater. You can see total spend. You cannot see which team, product, or environment is driving it. That is the problem the Tags dimension solves.

The new Tags tab in Reports breaks cost down by cloud tag: AWS resource tags, GCP labels, and Azure tags. Select a tag key, and the report pivots to show spend per tag value. Select environment, and you see production vs staging vs dev. Select team, and you see which squad is running the most expensive workloads.

Approach	Frequency	Actionability	Ownership
Monthly tag audit	Monthly	Low: stale data	FinOps team
Tag Coverage widget	Real-time	High: live percentage	Every team
Tags cost dimension	On-demand	High: drill into spend	Engineers + FinOps

The Tag Coverage donut widget on the Dashboard is the companion piece. It shows what percentage of your resources carry billing tags right now. In a typical fleet we see tag coverage start below 40% before governance is enforced. This converts tag compliance from a periodic audit result into a live metric. Teams that can see their coverage percentage in real time fix gaps faster, because the feedback loop is immediate rather than monthly. Showback and chargeback only work when the underlying tag data is complete, consistent, and current.

Billing Cost vs Rack Rate: Two Sources, One Source of Truth

ZopNight has always stored cost data. The platform now makes the cost source explicit. Two values exist for every resource.

actual_cost_usd comes from billing sync: the amount your cloud provider actually charged after applying committed use discounts, savings plans, reserved instance rates, and negotiated credits. cost_usd is the rack rate fallback, used when billing sync has not yet completed or the provider API is delayed.

Cost Source	What It Reflects	When It Applies
`actual_cost_usd`	Billed after discounts + credits	Billing sync complete
`cost_usd`	Rack rate, no discounts	Billing sync pending
Azure amortized	Reserved instance cost spread daily	Azure resources only

The difference matters. A team running on 3-year reserved instances may see a rack rate of $185,000 per year and an actual billed cost of $67,000. If your FinOps reporting uses rack rate, you overestimate spend by 176%. If you use actual cost without amortization, a reserved instance purchase looks like a one-day spike.

Azure handles this with amortized cost: the full reserved instance payment is spread across the reservation period day by day. ZopNight uses amortized cost for all Azure resources automatically. The frontend now labels which source is active for each resource so you always know what you are looking at.

Resource Discovery and Storage Sizing: Seeing What You Could Not See Before

You cannot govern what you cannot see. That is the limit that expanded resource discovery addresses.

ZopNight now surfaces EKS and GKE Deployments, StatefulSets, and CronJobs as first-class resources. Previously, these Kubernetes workload types were not discoverable through the platform. You could see the node groups and clusters, but not the workloads running on them. The All-Resources page now includes these workload types with filter support scoped to your current view.

Azure Databricks clusters, pools, and warehouses are now discoverable as well.

Level	Resource Types Discoverable
Cluster	EKS Cluster, GKE Cluster
Node Group	Managed Node Groups, Node Pools
Workload	Deployments, StatefulSets, CronJobs

EKS and GKE resource hierarchy: cluster, node group, workload levels

The parent-child hierarchy goes up to 3 levels deep. This means a cost filter applied at the cluster level scopes down through node groups to individual workloads. A recommendation surfaced for a Deployment shows the cluster and node group context without requiring a separate lookup.

Storage size visibility closes the remaining gap for data teams. S3 bucket size and object count are now pulled via CloudWatch metrics. GCS bucket size comes from Cloud Monitoring. GCP Artifact Registry exposes image count and total size per repository.

Before this change, a team could see that an S3 bucket existed but not how large it was. Storage costs are easy to ignore precisely because they accumulate slowly. At $0.023 per GB-month in standard tier, a bucket that was 50 GB in January and grew to 800 GB in April adds $17.25 per month. Across dozens of buckets, that compounds to thousands of dollars in unreviewed storage spend. Size visibility inside ZopNight puts that number next to the monthly cost, creating the pressure to act that the raw billing line item rarely generates on its own.