Skip to main content
Incident Triage in One Query: Asking Your AI Agent 'Who Owns This and Why Is It Broken' and Getting a Real Answer

Incident Triage in One Query: Asking Your AI Agent 'Who Owns This and Why Is It Broken' and Getting a Real Answer

Aryan Mehrotra By Aryan Mehrotra
Published: May 8, 2026 9 min read

An alert fires at 2:47 AM. A pod in the payments-prod namespace is in CrashLoopBackOff. The on-call engineer reads the alert, opens Slack to find the team that owns payments-prod, opens the wiki to find what policies apply to that namespace, opens CloudTrail to see what changed in the last hour, opens the deploy history to see whose PR landed last, and starts piecing the picture together. It is 3:09 AM. They have not yet looked at the actual error.

This is the first 20 minutes of every incident. Context-gathering. The actual fix is usually 5-15 minutes once context is in place. The triage tax (finding the owner, the policy context, what changed, the audit trail) is a third to half of total incident time, and it is the same set of queries every time.

This post is the pattern that compresses that first 20 minutes into one query. A policy-graph MCP exposes typed tools for ownership, drift events, recent violations, and exception status. The agent (Claude Code, Cursor, the on-call engineer’s chat client) composes those tools into one answer to “who owns this and why is it broken.” The pattern composes with policy-aware MCP governance, closed-loop SRE remediation, and closed-loop IAM remediation.

The four queries every triage starts with

Across hundreds of incidents, the first four questions are the same.

  1. Who owns this resource? Team, on-call rotation, escalation path. This question fails without good ownership data, often because tags are inconsistent and the team that knows the answer is asleep.
  2. What policy applies to this resource? Compliance, security, operational. Without this, the engineer doesn’t know if a fix has policy implications (e.g., disabling encryption “to debug” violates SOC 2).
  3. What changed recently? Deploys, IAM updates, infra changes in the last 1-24 hours. The cause is almost always a recent change.
  4. Are there active exceptions or known issues? A resource with an active exception (“this bucket is intentionally public for the public CDN, expires 2026-08-01”) changes the triage path entirely.

A senior on-call engineer answers these by hitting four different systems: Slack/PagerDuty for ownership, the wiki for policy, CloudTrail/deploy logs for change history, and the runbook or an issue tracker for exceptions. The 20-minute tax is the navigation cost, not the data acquisition cost. Each system answers in seconds; the engineer’s time is spent in the gaps between them.

What the policy-graph MCP exposes

A policy-graph MCP server exposes the four queries as typed tools. Each is a single call. The agent (Claude in a chat, Claude Code in an IDE, or a chatbot wired into the alerting system) composes the calls into a unified answer.

MCP toolReturnsQuestion it answers
resource_ownership(arn)Team, on-call, escalation, business owner”Who owns this?”
list_policies(scope)Active policies + severity + recent violations”What rules apply?”
drift_events(scope, window)Recent drift / deploy / IAM-change events”What changed?”
exception_status(policy, resource)Active exception details + expiry + approver”Is this a known case?”
violation_history(scope)Recent denies / triggers, with frequency”Has this been flagged before?”
check_resource(arn, action)Allow/deny + cited policy + reasoning”Can I take this action?”

These are the six tools ZopNight exposes via MCP for policy-aware governance. The same tools serve incident triage because the data they expose IS the triage context.

The agent’s prompt to the MCP server during a triage looks like a single thought: “What is the context for arn:aws:eks:us-east-1:1234:cluster/payments-prod right now.” The agent decomposes this into four calls, gets four structured answers, composes them into a one-paragraph summary, and the engineer reads the paragraph instead of doing the navigation themselves.

The 2:47 AM example, replayed

Same alert, with the policy-graph MCP wired up. The engineer types into Claude Code:

What is the context for the payments-prod EKS cluster, and is anything misconfigured or recently changed?

Claude makes four MCP calls in parallel and gets four structured responses back:

Tool callReturns
resource_ownership(arn:aws:eks:us-east-1:1234:cluster/payments-prod)team payments-platform, on-call jen.li@example.com, escalation payments-eng-leads@example.com, business owner payments, cost center PMT-007
drift_events(scope=payments-prod, window=24h)01:23 UTC deploy d7f3e2 from CI (jen.li); 02:11 UTC IAM role update (security-bot, ticket SEC-4892); 02:31 UTC pod OOM restart (payments-api-3)
list_policies(scope=payments-prod)14 active policies; 1 recent violation (POL-RES-007: pod missing memory limit, 02:39 UTC); severity medium
exception_status(policy=POL-RES-007, resource=payments-api-3)No active exception; last violation 2026-04-22 (resolved by deploy)

Claude composes the answer:

The payments-prod EKS cluster is owned by team payments-platform, on-call is jen.li. There were three changes in the last 24 hours: a deploy at 01:23 UTC by jen.li (PR d7f3e2), an IAM role update by security-bot at 02:11 (ticket SEC-4892), and a pod OOM restart at 02:31. The policy POL-RES-007 fired at 02:39: pod payments-api-3 is running without a memory limit, which is the same condition that triggered the OOM restart. There is no active exception for this. The most likely cause is the deploy at 01:23 changing the resource spec; check PR d7f3e2 for the resource block.

Time elapsed from alert to context: 90 seconds. The engineer goes straight to the PR, finds the missing memory limit, fixes it. Total incident time: 12 minutes instead of 35.

This is not a productivity gimmick. It is a structural shift: the agent is the layer that composes context, the engineer is the layer that decides. The four queries that used to live in the engineer’s head live in the agent’s tool calls. The engineer reads the synthesis and skips the navigation.

Why the data has to be a graph, not a query layer

The reason this works is that the underlying data is a graph: resources have ownership, policies, exceptions, and drift events linked by ARNs and tags. The MCP tools traverse the graph along those edges.

Most teams’ triage data is not a graph. It is four query layers that happen to know about ARNs:

  • Slack channels keyed by team name
  • A wiki page per service
  • CloudTrail filtered by ARN
  • An issue tracker filtered by tag

Each of these is queryable, but the joins are manual. The engineer is the join engine. They look up the team in Slack, find the wiki link in the team’s channel, search CloudTrail for the ARN, search the tracker for the tag, and hold all four results in their head while reasoning.

A policy graph collapses this. The graph stores: resource ARN → team (edge: ownership), team → escalation path, ARN → policy (edge: applies-to), policy → exceptions, ARN → recent events. One question, one traversal. The agent reads the graph; the engineer reads the synthesis.

Triage stepWithout graphWith policy-graph MCP
Find ownerSlack search, wiki lookupresource_ownership(arn)
Find policy contextWiki, GRC systemlist_policies(scope)
Find recent changesCloudTrail, deploy logdrift_events(scope, window)
Find known exceptionsRunbook, ticketingexception_status(policy, resource)
Stitch into narrativeEngineer’s headAgent composition

The data sources that feed the graph

The graph is only useful if it is current. Stale ownership = wrong page. Stale policy state = wrong fix.

Ownership data. Best source: the Terraform / Pulumi module that created the resource, which usually has team / cost-center metadata. Second-best: tags on the resource itself (team, service, cost-center). Third-best: an ownership registry (Backstage catalog, CMDB) that maps services to teams. The graph should accept all three and prefer the most recent.

Policy state. Best source: the policy engine itself (OPA, Custodian, ZopNight, etc.). The graph subscribes to the engine’s events and updates the violation history in near-real-time. Second-best: AWS Config / Azure Policy state, which is eventually-consistent (5-15 min lag).

Drift events. Best source: CloudTrail (AWS) or its equivalent on other clouds, plus the CI/CD deploy log. The graph indexes these by resource ARN. Eventbridge or equivalent streams events into the graph as they happen.

Exception state. Best source: a structured exception registry (a small DB or even a Git repo of YAML files). The graph cross-references against this on every exception_status call.

The 60-second cache mentioned in the ZopNight policy MCP architecture is the right balance: fresh enough to be trustworthy during an incident, cached enough to handle the agent’s repeated calls without melting the underlying systems. Longer cache windows make the data unreliable during fast-moving incidents; shorter ones make the MCP server expensive to run.

What the MCP call looks like under the hood

For implementers, the structural shape of the MCP server.

A typed tool definition follows the Claude / MCP SDK shape: a tool name, a one-sentence description for the agent, and a JSON-schema input. For ownership, the tool is named resource_ownership with description "Get team, on-call, escalation, and business owner for a cloud resource". The schema accepts a single required string arn (the cloud resource ARN). The implementation walks the graph: ARN → tag table → team registry → on-call schedule. Returns a structured object. The agent receives the object as a tool result and uses it as context for the next reasoning step.

The drift_events tool follows the same shape with description "Recent deploy, IAM, and configuration events for a scope". Its schema requires two strings: scope (resource ARN or namespace pattern) and window (time window like "24h" or "7d"). The implementation queries CloudTrail (filtered by resource), the deploy log (filtered by namespace or service), and the IAM event stream. Aggregates, sorts by time, returns a structured list. The agent composes this with the ownership and policy results.

Every MCP call also writes one audit log line for security review. The fields:

FieldTypeExample
tsISO timestamp2026-05-08T02:48:11.482Z
useremailjen.li@example.com
sessionstringclaude-code-3-incident-triage
toolstringdrift_events
paramsobject{"scope": "payments-prod", "window": "24h"}
response_statusstringok
response_size_bytesnumber4128
duration_msnumber234

This works when the graph is kept fresh by event subscription, and the audit log captures the agent’s reasoning trace. It breaks when the data sources fall behind during a major incident: the spike in CloudTrail events causes the graph indexer to back up, and the agent serves stale data exactly when it matters most.

The mitigation is to fail fast: when the graph cannot guarantee freshness, the MCP tool returns a staleness_warning field. The agent reads the warning and surfaces it to the engineer (“note: drift events may be 5+ minutes behind real-time”) instead of presenting stale data as fresh.

Where this stops working

Three honest failure modes.

The agent goes off-narrative. Claude is not deterministic. Two calls with the same input may produce slightly different syntheses. For triage this is usually fine (the underlying data is the same; the prose differs). For audit-grade incident reports it is not. You want the synthesis to be reproducible. The fix is to log the structured tool results AND the prose synthesis; the structured results are the audit trail; the prose is the convenience layer.

Ownership ambiguity. A resource with two valid owners (a shared service used by multiple teams) returns multiple owners from resource_ownership. The agent has to pick one or surface both. The right shape is “primary owner: X; consuming teams: Y, Z”. The graph should encode the distinction.

Policy graph staleness during high-change windows. A deploy storm during an incident overwhelms the indexer. The fix is the staleness warning above, plus capacity planning on the indexer for peak event volume.

Failure modeSymptomMitigation
Non-deterministic synthesisTwo prose answers differLog structured + prose; rely on structured for audit
Multiple ownersSingle-team query returns NGraph distinguishes primary vs consumer
Stale graphDrift events miss recent changesStaleness warning; indexer capacity planning

The 21-day rollout

Days 1-7: Build the ownership data. Walk the inventory of cloud resources. Confirm tag coverage on team / service / cost-center. Backfill the missing 20-30% via the deploy modules or the service registry. The output of week 1 is a clean ownership lookup for >95% of resources.

Days 7-14: Stand up the MCP server with three tools first. resource_ownership, drift_events, list_policies. The smallest viable triage MCP. Wire it to Claude Code or the on-call engineer’s chat client. Run on three or four simulated incidents to validate the answers match what a senior engineer would have produced manually.

Days 14-21: Add exception_status and violation_history. Wire to the policy engine and the exception registry. Run on real incidents for a week. Watch what the agent gets right and what it misses. Tune the prompt and the tool descriptions based on what the on-call team actually needs in the synthesis.

By day 21 the team has a triage MCP that answers the four-question opening of every incident in <2 minutes. The on-call rotation reports incidents resolving 30-50% faster, with most of the time saved going from “context-gathering” to “actual fix.”

The closing call

The first 20 minutes of every incident is the same four questions. The same four data sources. The same manual stitching done over and over by every engineer who has ever been on call. None of it requires AI. All of it is sped up by AI plus a graph plus four typed tools.

The pattern is not “AI for incident response.” It is “structure the triage data as a graph, expose it via typed MCP tools, let the agent compose the queries.” The agent is the low-cost, fast layer that handles the navigation. The engineer is the irreplaceable layer that decides the fix.

Pick the next incident. Run the four queries manually first; record how long they take. Stand up the three-tool MCP. Run the next incident through it. The 20 minutes of context-gathering becomes 90 seconds. The 35-minute incident becomes a 12-minute incident. The on-call rotation stops dreading 2:47 AM. That is the work the work is for.

Aryan Mehrotra

Written by

Aryan Mehrotra Author

Engineer at Zop.Dev

ZopDev Resources

Stay in the loop

Get the latest articles, ebooks, and guides
delivered to your inbox. No spam, unsubscribe anytime.