AWS Databricks FinOps: Flagging Idle Isn't Stopping It

Most cost tools can tell you a Databricks cluster is idle, and Flexera’s 2025 State of the Cloud report puts idle and over-provisioned resources at 60% of all cloud waste. None of that flagging saves a cent on its own. A flagged cluster keeps billing exactly the same until something turns it off, and “something” is usually a human who never gets to it. The detection was never the hard part; the acting is, and that is the part most cloud cost tools quietly leave to you.

ZopNight already governs Databricks on Azure. With ZopNight v2.0 it now does the same on AWS, as a standalone workspace connection, and it closes the part most tools leave open: it can stop the idle compute, not just point at it.

Flagging Idle Is Not Saving

There is a quiet lie in most FinOps dashboards: that identifying waste is the same as removing it. It is not, and Databricks is a textbook case. An all-purpose cluster left running overnight, an instance pool holding warm nodes nobody claims, a SQL warehouse that never auto-stops: each is easy to detect and expensive to ignore. They show up on the dashboard in red, get a polite nod in the cost review, and keep running, because the one step that would actually save money, turning them off, lives in a different tool nobody opens during the meeting.

The detection is not the hard part. The hard part is that the saving only lands at the moment the resource is stopped, and a recommendation in a tab does not stop anything. Every hour between “flagged idle” and “actually stopped” bills at full rate. A tool that ends at the recommendation leaves all of that on the table.

What ZopNight Now Does on AWS

AWS Databricks connects as its own workspace connection from the Cloud Accounts page. Once connected, ZopNight manages it end to end, the same loop it runs everywhere else: discover, cost, recommend, and act.

Capability	On AWS Databricks
Discover	Workspace, clusters, instance pools, SQL warehouses, jobs, model-serving endpoints
Cost	Per resource, EC2 instance type and AWS region priced from real pricing data
Recommend	15 AWS rules, including idle instance pools and oversized clusters
Act	Start and stop clusters, instance pools, and SQL warehouses

The discovery and pricing are AWS-native: workers are EC2 instances, priced by instance type and region, so the numbers reflect the actual bill rather than a generic estimate.

The Bill Is EC2 Plus DBUs

Databricks cost on AWS has the same two-layer shape it has everywhere: the EC2 instances that run the work, and the DBUs Databricks charges on top. A tool that meters only EC2 misses half the bill, and a tool that meters only DBUs misses the other half. Both move when a cluster is the wrong size or the wrong type.

This is the same structure we covered for the Azure Databricks cost surfaces: five compute types, each billed on its own clock, sitting on two cost layers. What changes on AWS is the underlying compute meter, EC2 instead of Azure VMs, and that is exactly what ZopNight prices natively now.

ZopNight closes the loop on AWS Databricks: discover and cost the workspace, recommend the idle and oversized resources, then stop them

Stopping Is the Part That Saves

The capability that matters most is the last one. ZopNight can start and stop AWS Databricks clusters, instance pools, and SQL warehouses directly, so the recommendation becomes an action instead of a suggestion.

Stopping is also where safety lives, because an action that half-works is worse than no action at all. The stop and start operations wait for the resource to actually reach the target state rather than reporting a premature “done”, so an action is finished when the cluster is finished, not before, and you never get a green checkmark on a cluster that is still spinning up. And when the grant is missing, the failure is explicit rather than silent: a missing CAN_MANAGE permission is caught at the provider boundary and turned into a message that names the exact grant to add, so the first stop you attempt either works or tells you precisely why it cannot, instead of failing into a quiet no-op you discover on next month’s bill.

💡

Start and stop need the Databricks CAN_MANAGE permission on the target. ZopNight detects when it is missing and surfaces the exact grant to add, so the first stop you try does not fail silently.

One Governance Model Across Clouds

The point of adding AWS is not AWS in isolation. It is that Databricks is rarely single-cloud, and the cost problem does not respect the boundary. A team running workspaces on both Azure and AWS used to need two tools, two mental models, and two reconciliations.

Now it is one. The same discovery, the same cost attribution, the same idle and right-size rules, and the same start/stop control, across both clouds, in one place. A cluster idling on AWS and a warehouse left on in Azure show up in the same view, ranked by the same impact, fixable by the same action, so nobody has to remember which tool covers which cloud or reconcile two bills by hand at month end. Multi-cloud Databricks spend stops being two separate problems managed by two separate habits and becomes one governed surface, which is the only way closed-loop remediation scales past a single provider instead of fragmenting every time you add a cloud.

What Right-Sizing Looks Like Here

Not every Databricks fix is a stop. An oversized cluster should be resized, not killed; a pool holding too many idle nodes should drop its floor. ZopNight’s AWS rules flag these against real EC2 pricing, so a right-size carries a real dollar figure rather than a vague “consider downsizing”. That grounding is what separates a useful recommendation from noise, the same discipline behind avoiding the right-sizing trap of acting on thin data.

For the resources that are genuinely idle on a rhythm rather than dead, scheduling beats stopping by hand, the same pattern as automated scheduling for non-prod.

When It Works, and When It Doesn’t

The honest caveat is access. AWS Databricks connects over account OAuth, and the credentials are validated and stored at connection time; discovery and cost need that connection healthy. Stopping needs the CAN_MANAGE grant described above. It works when the workspace is connected and the grant is in place: full discovery, EC2-priced cost, dollar-backed recommendations, and one-click stop. It breaks when the connection is partial or the grant is missing, and there ZopNight tells you what to fix rather than failing quietly.

Within that boundary, the loop finally closes on AWS Databricks: you see the idle cluster, you see what it costs, and you stop it, in the same place, the same way you already do on Azure.