A developer asks Claude Code at 2 AM: “this terraform plan is failing admission, fix the bucket so it deploys.” Claude reads the error, generates a slightly different bucket config, runs the plan again, hits the same admission rule, reads the new error, generates another shape, runs again. Three to five rounds of this and either the agent stumbles into a config that passes (often by accident) or gives up and tells the developer to look at the rule manually. Every round burns LLM tokens. None of the rounds taught the agent what the underlying policy actually was.
This is the loop every cloud-aware agent in 2026 runs without integration. The agent sees the failure but cannot see the rule. It sees the resource but cannot see the policy graph. It sees the recommendation but cannot see the drift events that produced it. The fix is not a better agent. The fix is exposing the cloud’s governance state to the agent through typed tools so the agent gets context before it acts, instead of after it fails.
ZopNight v2.0 ships an MCP server endpoint that does exactly this. Claude Code, Cursor, Codex, and any other MCP-aware client can read the live policy graph, resource state, ownership, drift events, exceptions, and audit history through seven typed tools. The agent composes those tools into answers. The human reads the answer.
The piece sits next to the existing work on read-only MCP servers (the foundational pattern), policy-aware MCP governance (how policy data composes with the MCP shape), and read-write MCP failure modes (why writes are gated by capability tier). This is the product post: what shipped, what each tool does, and how to use it.
Cloud agents in 2026 work blind
Compare the same triage task with and without an MCP-aware governance surface.
| Step | Agent without MCP | Agent with ZopNight MCP |
|---|---|---|
| Operator asks “why is this resource broken” | Agent reads the alert / error text | Agent reads the alert + reads policy graph via list_policies |
| Agent tries to fix | Generates a plausible config, retries | Calls check_resource to see what rule applies, generates the right config first time |
| Failure happens | Reads error, retries with shape variation | Reads violation_history to see why this fails and what passed historically |
| Operator gets the answer | After 5-15 minutes of agent back-and-forth | After one round-trip with 3-4 parallel tool calls |
| Cost | High token use, low signal-to-noise | Low token use, structured signal |
The token cost gap is real. An agent burning through 5 to 15 rounds of trial-and-error on a single admission failure spends 8 to 20 thousand tokens on orchestration alone, not counting the final correct config. With MCP integration, the same task lands in 1 to 3 thousand tokens because the agent does not waste rounds on misconfigs the policy graph would have ruled out.
The latency gap is bigger. Agent back-and-forth at 2 AM is the worst possible UX for incident triage; the on-call engineer is waiting on the LLM and the LLM is waiting on retry timeouts. With MCP, the agent’s first response includes the policy context, the ownership, the recent drift, and a candidate fix.
What ZopNight’s MCP server exposes
The MCP server ships with seven typed tools covering the four kinds of cloud-governance question the agent needs to answer.
| Tool | Returns | Used for |
|---|---|---|
list_policies(scope) | Active policies on the scope, severity, owners | ”Which rules apply to this resource” |
check_resource(arn, action) | Allow / deny + reasoning + cited policy | ”Should this change be made at all” |
resource_ownership(arn) | Team, cost center, on-call, escalation | Routing questions, paging, ownership-driven approvals |
drift_events(scope, window) | Recent drift detections (deploy, IAM, config) | “What changed lately around this resource” |
exception_status(policy, resource) | Active exception, expiry, approver | ”Is the override legitimate or expired” |
violation_history(scope) | Recent denies, by frequency | Pattern detection for the agent’s reasoning |
resource_topology(arn) | Dependency graph (upstream + downstream) | “What depends on this resource” |
Each tool has a JSON schema input that the MCP client validates at call time. Each tool returns a structured object the agent can quote, summarise, or feed into its next reasoning step. Each call writes one audit-log line on the server side: agent identity, tool name, parameters, response status, latency.
The seven tools were not chosen by guessing. Each maps to a question that came up repeatedly in customer incident postmortems and developer-workflow research. The list is small on purpose; the surface should be learnable in an afternoon and explainable on a single slide. Adding tools is easy; removing them once they ship is hard.
How an agent uses the MCP server
A typical incident triage flow with the MCP server wired in:
The operator types a natural-language question into Claude Code at 2:47 AM: “What is the context for the payments-prod EKS cluster, and is anything misconfigured or recently changed?”
The agent makes four MCP calls in parallel. Each takes 80 to 250 ms; the wall-clock latency is the longest one. The agent receives four structured objects and composes them into a one-paragraph synthesis:
The payments-prod EKS cluster is owned by team
payments-platform, on-call isjen.li. There were three changes in the last 24 hours: a deploy at 01:23 UTC by jen.li (PR d7f3e2), an IAM role update by security-bot at 02:11, and a pod OOM restart at 02:31. Policy POL-RES-007 fired at 02:39: podpayments-api-3is running without a memory limit, which is the same condition that triggered the OOM restart. There is no active exception. The most likely cause is the deploy at 01:23 changing the resource spec; check PR d7f3e2.
The operator reads the synthesis and goes straight to the PR. Total time from question to actionable context: 90 seconds. Pre-MCP, the same triage flow involves opening four different surfaces (PagerDuty for ownership, CloudTrail for drift, the GRC tool for policies, the runbook for exceptions) and mentally stitching the results.
Median triage time across ZopDev customers using the MCP integration drops 70 to 85% on this class of question. The cost is the seven tool calls; the savings is 12 to 30 minutes of operator attention per incident, plus the cognitive load of not having to remember which surface holds which piece of context.
Read-only is the right default
ZopNight’s MCP server ships read-only as the default surface. The seven tools above all read state and write nothing back to the cloud. Write capability exists (the platform can mutate cloud state through the same gRPC backend), but it lives behind a capability-tier gate that the customer enables per tool and the trust-score work decides per call.
The reasoning lines up with the read-write MCP failure modes work: most of the value an agent produces (faster triage, better context for decisions, runbook generation, recommendation reasoning) comes from reading state. Mutation is a smaller fraction of the agent’s actual workload and a much larger source of incidents. Read-only first; write capability second.
The capability tiers are exposed per MCP tool:
| Tier | Tool examples | Default availability |
|---|---|---|
| Read-only | list_policies, check_resource, resource_ownership, drift_events, exception_status, violation_history, resource_topology | Always enabled |
| Mutate-low-blast | Tag-correct, retention-adjust, non-prod-stop | Opt-in per customer per tool |
| Mutate-high-blast | Production resource changes, cross-region operations | Hard-gated; effectively page a human for approval |
Customers who want to grant write capability for low-blast operations enable per-tool. The audit log captures every mutation with the agent identity and the operator who authorised the agent’s PAT. Customers who want zero write capability ever can leave the tier-2 and tier-3 tools disabled and the read-only surface still produces most of the value.
Auth: PATs, per-user, one-click revocation
Authentication to the MCP server uses Personal Access Tokens (zn_pat_*). Each token is per-user, scoped to a single ZopNight organisation, and tied to the same RBAC policy graph that governs the dashboard.
The MCP server itself holds no state. It validates the PAT, applies the RBAC scope to the request (the user’s policy scope determines which resources are visible), and proxies to the existing gRPC backend services (Config, Discoverer, Aggregator, Recommender). The same backend that powers the dashboard powers the MCP surface; the data is identical.
Token management is in the user’s Settings page. Generating a new PAT is one click; copying it to the agent’s config is one paste; revoking is one click. Token rotation can be automated through the same surface. The customer does not have to think about cross-account IAM trust policies the way they would for a vendor SaaS integration; the auth model is the same as their personal access tokens for any other tool.
Per-call audit logging is the system of record. Every MCP call (regardless of whether the underlying tool is read or write) writes one line: the agent’s PAT identity, the user the PAT belongs to, the tool name, the input parameters, the response status, the latency, the timestamp. The audit log is queryable from the dashboard; compliance teams can answer “show me every action this agent took in the last 90 days” with one filter.
Composition with auto-remediation
The MCP server composes naturally with auto-remediation. The agent can read the policy graph (via list_policies and violation_history), identify a recommendation that fits the customer’s stated intent, and propose the Remediate action. The customer still clicks the button; the agent’s job is context-gathering and proposal, not unilateral action.
A typical interaction:
| Step | Agent action | Cloud effect |
|---|---|---|
| 1 | Customer asks “why is my dev bill high this week” | Nothing yet |
| 2 | Agent calls list_policies(scope=dev-account) + violation_history(scope=dev-account, window=7d) | Reads |
| 3 | Agent identifies that 12 idle-EC2 recommendations are open with combined savings of $1,840/month | Reads (still) |
| 4 | Agent surfaces the recommendations with a one-paragraph summary | Reads (still) |
| 5 | Customer clicks Remediate on 8 of the 12 | Auto-remediation path runs |
| 6 | Each remediation goes through precondition → action → validation | Cloud state changes |
| 7 | Agent reports the result back to the customer | Reads |
The agent does not click Remediate on the customer’s behalf. The capability tier model enforces this even if the agent tries to call a tier-2 tool directly — the read-only PAT would be rejected at the gateway. The pattern is “agent reads, agent recommends, customer authorises.”
For customers who want the agent to be more autonomous, the capability-tier model has the opt-in path: enable tier-2 tools per agent, set the trust-score threshold, let the agent execute on safe rules without human approval. Most customers do not need this; the read + propose pattern handles the majority of the agent’s actual usage.
How to enable and use the MCP server
Setup is three steps.
| Step | Where | Result |
|---|---|---|
| 1. Enable MCP for your org | Settings → Integrations → MCP | Org-level toggle |
| 2. Generate a PAT | Settings → Personal Access Tokens → New | zn_pat_* token |
| 3. Install the MCP config in your agent | Claude Code: ~/.config/claude/mcp.json. Cursor: Settings → MCP. Codex: similar | Agent can call the seven tools |
After step 3, ask the agent a natural-language question that requires cloud context. Typical first prompts:
| Prompt | What the agent does |
|---|---|
| ”What is the context for [resource ARN]?” | Calls resource_ownership + drift_events + list_policies + exception_status in parallel |
| ”Which policies apply to my prod accounts?” | Calls list_policies(scope=prod-*), summarises |
| ”What changed in the last 24h on the recommendation engine cluster?” | Calls drift_events(scope=rec-engine, window=24h) |
| ”Why is my dev bill high this week?” | Calls violation_history + open recommendations + cost data |
The agent’s first response includes the tool calls it made (visible in the agent UI) so the operator can audit the reasoning. The audit log on ZopNight’s side captures the same calls for the security and compliance teams.
Most customers see useful agent answers within the first session. The pattern that produces the best results is to ask focused questions about specific resources rather than open-ended “tell me about my cloud” — the seven tools are sharp on specific scopes, and the agent’s synthesis quality is highest when the input scope is small.
What’s next for the MCP surface
The seven tools cover the highest-volume context questions. Future surface additions follow customer signal rather than speculation.
| Coming work | What it adds |
|---|---|
| Per-org tool subset | Customers can disable specific tools (e.g., hide violation_history) for compliance reasons |
| Streaming MCP responses | Long-running calls (e.g., topology of a 5,000-resource account) stream incrementally |
| Saved agent workflows | Operators can save “context-for-this-resource” as a one-click workflow callable from the dashboard |
| Bidirectional events | The MCP server pushes notifications to the agent (e.g., a new policy violation just landed) |
The read-only surface is the foundation. Everything else layers on top without changing the existing seven tools.
If you have Claude Code, Cursor, or Codex installed and a ZopNight account connected, the MCP integration is a 5-minute setup. Generate a PAT, drop the config into your agent, ask the first context question. The synthesis you get back is the same data the dashboard has, in the shape the agent can act on. That is the difference between an agent that retries blindly and one that reads the policy graph first.


