Read-Only AI Tools for Cloud Infrastructure: What MCP Servers Make Possible

The trust ceiling on AI in cloud automation is not capability. It is write access.

Most platform teams I talk to have built a Claude or GPT-backed assistant in some form. A cost-forensics slack bot. A K8s Q&A wrapper. A “ask the cloud” terminal demo. Almost none of them are in production. The reason is consistent: the moment the AI gets the credentials to actually change something, the security team raises a hand, and the project dies.

The Model Context Protocol (MCP), released by Anthropic in November 2024, changed the routing question. With an MCP server, an AI agent calls structured, named tools instead of guessing at API shapes. When those tools are read-only, the agent gets full read-state access without ever holding a write credential. The IAM role on the MCP server is ReadOnlyAccess. The blast radius is bounded. The audit trail is clean. The rollback plan is “there is nothing to roll back.”

This post is the argument that read-only AI cloud tools are not a stepping stone to write-access AI. They are the production-shaped product. The work that actually moves the needle in cloud operations, cost forensics, and security review is interpretive, not mutative. Read-only is the ceiling, not the floor.

The trust ceiling that froze AI cloud work

When a security review asks “what can this thing do if it goes wrong,” the honest answer for a write-capable AI cloud tool is “anything the IAM role allows.” That answer kills the project. So teams either over-scope the role into uselessness or pause indefinitely.

The trust math gets worse the more capable the model. A weaker model with ec2:TerminateInstances is a contained risk. A strong model with ec2:TerminateInstances is a strong agent that could, on a mistaken interpretation of a user’s question, terminate a production fleet. Strength makes the policy harder to write, not easier.

Read-only inverts the math. The role is a single AWS managed policy. The blast radius is “what could be inferred from public-internal cloud state.” The audit trail is whatever CloudTrail captures plus the MCP server’s own tool-call log. RBAC reduces to “use the AI’s IAM role, period.” There is no new authorization system to design.

	Write-capable AI	Read-only AI	Human in the loop
Blast radius	full IAM scope	read-state only	bounded by reviewer
IAM role complexity	per-action scoped	`ReadOnlyAccess`	n/a
Audit story	every action + intent	every read + intent	every action + intent + reviewer
Rollback plan	required and difficult	not needed	required
Production-ready timeline	quarters	sprints	already shipping

The teams I have seen actually ship AI cloud tooling all start in the read-only column. The teams in the write-capable column are still in proof-of-concept after twelve months.

What an MCP server actually exposes

The protocol is small. An MCP server uses a JSON-RPC dialect over stdio or HTTP and exposes three things: tools, resources, and prompts. For cloud infrastructure work, tools are what matter.

A typical AWS MCP server exposes a tool surface like this:

Architecture diagram

Each tool has a typed schema. aws:get_cost_and_usage takes a time range, a granularity, and optional dimensions. The agent does not invent the API shape. It calls the tool with structured arguments and gets a structured response back. If the tool is not in the registered surface, the agent cannot call it.

The K8s MCP server exposes the same pattern: k8s:list_pods, k8s:describe_node, k8s:get_logs, k8s:top_pods. The Terraform MCP server exposes terraform:plan_summary, terraform:state_list, terraform:show_resource. None of these tools mutate state. The MCP server’s authority is bounded by the IAM role, the kubeconfig, or the Terraform Cloud API token it was given. When that token is read-only, every tool call is read-only by construction.

The schema is what makes the AI useful. When Claude knows that aws:get_cost_and_usage exists and takes a time range, Claude does not hallucinate an aws:get_billing_summary that doesn’t exist. The protocol enforces calling discipline that prompt-engineering alone cannot.

The read-only use cases that actually win

Four use cases consume most of the value. All of them are read-only by nature.

Cost forensics. “Why is our compute bill 40% higher this month?” The dashboard answer requires clicking through Cost Explorer, filtering by service, comparing month-over-month, drilling into instance types, cross-referencing with tag groups. A platform engineer doing this carefully takes 30 to 45 minutes per question. The AI calls aws:get_cost_and_usage with month-over-month grouping, sees that NAT gateway charges jumped 4x, calls aws:describe_vpcs and aws:describe_nat_gateways, correlates with a CloudTrail event, and returns the answer in under 60 seconds. It’s not just faster. It’s a question that gets asked once instead of never.

Drift detection. Terraform state and live cloud state diverge constantly. Someone clicked through the console in an outage. An auto-scaling group resized. A tag was added manually. The AI calls terraform:state_list and aws:describe_resources, diffs them, and reports. Tools like driftctl already do this; what the AI adds is the natural-language summary for the humans who will not read a JSON diff. Read-only is sufficient for 90% of drift reporting.

Security posture review. “Which IAM roles in this account are over-permissioned for what they actually do?” An average mid-size cloud account has 150 to 300 IAM roles. Manually auditing them takes a security engineer ten to fourteen days of focused time. With an IAM MCP tool plus CloudTrail access patterns, the AI can flag candidates in hours: roles with *:* wildcards, roles attached to identities that have not used 80% of the actions granted, roles created by individuals who have left. None of this needs write access. It needs structured read access plus reasoning.

Capacity and configuration Q&A. “Show me all RDS instances without backups enabled in production.” “Which security groups have 0.0.0.0/0 ingress on non-443 ports?” “What’s the failure rate of our pods over the last hour?” These questions get asked weekly by people who do not have the runbook in their head. They are interpretive. They are read-only. They do not need to be solved by giving the AI write access to fix them; they need to be solved by surfacing the answer when the question is asked.

Use case	Question	Tools called	Dashboard time	AI time
Cost forensics	Why did compute spend jump 40%?	`aws:get_cost_and_usage`, `aws:describe_nat_gateways`	30 to 45 min	under 60 sec
Drift detection	What’s drifted from Terraform?	`terraform:state_list`, `aws:describe_resources`	1 to 2 hr	2 to 5 min
IAM audit	Which roles are over-permissioned?	`aws:list_roles`, `aws:get_role_policy`, CloudTrail patterns	10 to 14 days	1 to 4 hr
Config Q&A	Which RDS lacks backups?	`aws:describe_db_instances`	5 to 10 min	under 30 sec

The dashboard times are not wrong. They are the cost of asking the question carefully today. AI does not replace the dashboard. It replaces the threshold at which someone bothers to ask.

What read-only does not give you

Be honest about the boundaries.

You do not get auto-remediation. The AI cannot stop a runaway Lambda or rotate a leaked credential. Those are write-capable workflows and they belong with closed-loop autonomous-security automation where the policy and rollback paths are explicit.

You do not get live policy enforcement at admission time. A read-only MCP can answer “would this Terraform plan violate policy 47?” but only if you call it before applying. It is advisory, not blocking. Blocking enforcement belongs in admission controllers or OPA Gatekeeper running inline.

You do not get cross-account write coordination. If you need to assume a role in account A, modify a resource in account B, and rollback if a check fails, that is a workflow engine question and the AI is at most the orchestration brain. Read-only AI does not replace Step Functions or Argo Workflows.

That is fine. Read-only AI is the production-first wave because it is the wave teams can actually ship without a six-month security review. The audit story for read-only tools maps directly onto existing cloud infrastructure compliance controls: every tool call is logged, every IAM action traces to the MCP role, every output is reproducible from the log. The next wave layers narrowly-scoped write tools on top, gated behind explicit human approval, with the same audit and rollback machinery that any production change uses. Skipping straight to “AI with full IAM” is what kills projects.

Shipping a read-only MCP server this quarter

The architecture is small and the failure modes are well-understood. Five components.

Architecture diagram

The IAM role is the security boundary. Use the AWS managed ReadOnlyAccess policy as the starting point and trim down to the services your tool actually needs. Attach the role to the compute that runs the MCP server, not to the AI client.

The MCP server runs in a private subnet with no inbound public ingress. Use OAuth2 or signed JWTs from your existing identity provider for client authentication. The MCP server should know which user is asking; the audit log needs that.

The audit log captures every tool call, the caller identity, the structured arguments, and a hash of the response. This is what makes the system reviewable. When someone asks “what did the AI see about our IAM roles last week,” the answer is in this log, not in chat history.

The natural-language interface can be Slack, a CLI, an IDE plugin, or a web UI. The MCP protocol does not care. What matters is that the AI client only knows about the tools the MCP server advertises. No MCP, no tool. The client is bounded by the server’s surface.

The first deliverable is a single tool call answering a single question. “Show me yesterday’s AWS spend by service, top 10.” Wire that. Get the audit log working. Show it to the security team. Then add the next tool. The whole thing is a read-only governance loop, and the loop closes when humans ask better questions because the answers are now low-friction.

What composes on top: policy-aware governance

A read-only MCP server tells you what is. It does not tell you what should be. The S3 bucket is public. Is that a violation? The IAM role has a wildcard. Is that allowed for this workload? The Terraform plan adds a NAT gateway. Does that fit the cost budget for this team?

Those questions need a second layer. Policy state. Tag schema. Cost budgets. Compliance requirements. They live outside the cloud APIs and they are what makes a governance system aware of intent rather than just state.

That is the next post in this series. ZopNight + Claude via MCP shows what happens when the AI sees both the cloud state and the policy state in the same context window, and what changes when “show me the public buckets” becomes “show me the public buckets that violate our data-residency policy because they hold PII tags.”

The composition is where AI cloud governance gets real. Read-only state from the cloud. Read-only policy from the governance layer. Reasoning on top. None of it needs write access to be the most useful tool the platform team has.