Live Kubernetes Visibility: 21 Resource Pages and the Crashloop Overview

A 500-pod cluster has one pod that restarted three times in the last 10 minutes. The operator on call does not know which pod. kubectl get pods -A returns 500 lines of Running and a handful of CrashLoopBackOff interleaved through them. Finding the failing pod is a grep exercise. Understanding why it failed is a describe exercise followed by a logs --previous exercise. The pod is back to Running by the time the operator finishes scrolling.

The flat list is the wrong starting point. The question is not “list everything in the cluster”; the question is “what’s wrong right now and why.” ZopNight reorganises cluster visibility around that question. The Overview page surfaces the failure states as the landing tile row. The 21 typed resource pages give each Kubernetes resource kind its own grouping and filtering surface. Detail drawers are built per type so a Pod drawer shows what matters for a Pod and a Deployment drawer shows what matters for a Deployment.

This post walks through what the 21 pages cover, why the Overview is the right entry point, how the per-type drawers reduce the multi-command debugging dance, and how the same view ships to three audiences (customer cluster page, ZopDay post-provision handoff, and ZopDev internal admin) so support sees what the customer sees.

Why `kubectl get pods` stops working past 100 pods

A flat list of resources scales linearly with cluster size. The operator’s attention does not. At 50 pods the list is readable and the failures stand out. At 500 pods the failures are needles in a haystack of running rows. At 5,000 pods the list is unusable without grep.

Cluster size	Flat-list view	Typed pages + Overview
50 pods, 2 failing	Failures visible at a glance	Failures visible as tile count
500 pods, 2 failing	Failures buried in 498 running rows	Failures one click from the landing tile
5,000 pods, 2 failing	`grep` exercise across multiple namespaces	Same: tile count + drill-down
5,000 pods, 50 failing	Multi-grep, multi-describe, hours	Tile count, list, inline events, minutes

The problem is not that kubectl is slow. The problem is that “list everything” is the wrong question. The operator’s actual question is “is anything broken,” and once the answer is yes, “what is it and why.” A view that opens with the failure summary answers the first question in a glance and the second one in a click.

The Live Kubernetes view does this by reversing the layout. The landing surface is the Overview, not a resource list. The list views exist (21 of them, one per resource kind), but they are not where you start.

The Overview page: “what’s wrong” as the landing surface

The Overview page is the cluster’s front door. Six tiles run across the top, one per pod state that an operator cares about.

Tile	What it counts	When it matters
Running	Pods in `Running` phase, all containers ready	Sanity check — most pods should be here
Healthy	Running pods that have passed all readiness probes recently	Catches “running but not actually serving”
Failed	Pods in `Failed` phase (non-zero exit, not restarted)	Job and one-shot pod failures
CrashLoopBackOff	Pods with container restart loop, backoff active	The classic “something is wrong” signal
OOMKilled	Pods killed by the kernel for exceeding memory limit	Memory misconfiguration or leak
Pending	Pods stuck without a node assignment	Scheduling, quota, or resource pressure

Each tile is a count and a click. Clicking a non-zero failure tile drills into the list of pods in that state. The drill-down is the differentiator: each row in the list already carries the most-recent warning event that explains the state. A CrashLoopBackOff row shows the exit code, the restart count, and the last container message. An OOMKilled row shows the configured memory limit and the actual peak usage at the moment of the kill. A Pending row shows the scheduler’s last warning event (Insufficient cpu, no nodes available, node selector mismatch).

The enrichment matters because the operator’s next step after seeing a CrashLoopBackOff count is normally kubectl describe pod <name> followed by kubectl logs --previous <name> <container>. Both commands extract information that already lives in the API server’s event stream and the kubelet’s last-known container state. ZopNight reads those once on render and shows them inline. The two-command debug dance is the click on the row.

21 typed pages, one per resource kind

Outside the Overview, the Live Kubernetes view is 21 typed pages. Each page lists resources of one kind with type-aware filters and columns. The pages group naturally into six categories.

Category	Pages
Workload	Pods, Deployments, DaemonSets, StatefulSets, ReplicaSets, Jobs
Config	ConfigMaps, Secrets, Resource Quotas, HPA / KEDA Scalers
Network	Services, Ingresses, Endpoints, Network Policies
Storage	PVCs, PVs, Storage Classes
Identity	Service Accounts
Cluster	CRDs, Nodes, Namespaces, Events

The workload category is the largest because workloads are where most failures show up. Pods are the leaves; the four parents (Deployment, DaemonSet, StatefulSet, Job) own them; ReplicaSets sit between Deployments and Pods as the unit a rollout actually moves through. Putting each kind on its own page means the filters can be type-specific (by namespace, by owner, by phase for Pods; by replica count, by strategy, by paused for Deployments).

The config category groups the things workloads consume. ConfigMaps and Secrets are the configuration carriers. Resource Quotas are the namespace-level caps. HPA and KEDA Scalers are the autoscaling controllers. Each has its own page because each has its own debugging questions.

The network category is the four primitives that make traffic flow. Services define the virtual IPs and the selector that picks backing pods. Endpoints are the resolved selector at any moment — the actual backing pod IPs. When traffic isn’t flowing, the Service page tells you whether the selector is right and the Endpoints page tells you whether anything matches. Ingresses and Network Policies sit at the edges (north-south traffic and east-west traffic respectively).

Storage, identity, and cluster pages round out the surface. Storage answers “where does this pod’s data live, is the claim bound, is the underlying volume healthy.” Service Accounts answer “what identity does this pod run as.” Cluster pages (CRDs, Nodes, Namespaces, Events) answer the cluster-wide questions.

The Events page deserves a callout. Every warning event in the cluster lands on it in reverse chronological order. The Overview tiles enrich individual pod rows with their most-recent event; the Events page is the unfiltered firehose for the operator who wants to see what’s changing across the whole cluster in a window.

Detail drawers built per type

Clicking a row in any of the 21 pages opens a drawer on the right side of the screen. The drawer is built per resource type because what matters for a Pod is not what matters for a Service.

Resource type	Drawer surfaces
Pod	Container statuses (per container), init container outcomes, mounted ConfigMaps and Secrets, mounted PVCs, recent events, live tailing logs per container
Deployment	Replica history (current vs desired), rollout strategy + paused state, backing ReplicaSets (current and previous), pods owned by each ReplicaSet
Service	Selector definition, resolved Endpoints (actual backing pod IPs), reachability test from a probe pod, traffic policy, session affinity
HPA / KEDA	Current vs target metric value, current vs min/max replica count, last scale event, trigger configuration

The Pod drawer is the most commonly used because most debugging starts with a pod. Live logs are a particularly load-bearing feature: instead of kubectl logs -f, the operator opens the drawer and the logs tail in place. Switching containers (for multi-container pods) is a tab click. Switching from “current” to “previous” container instance (the equivalent of --previous) is a toggle.

The Deployment drawer is built around rollout history because that is the question most operators have when they open it. Was the last rollout successful? Is it paused? Did it produce a healthy ReplicaSet or did the new pods fail and the old ReplicaSet is still serving traffic? The drawer answers all three without leaving the page.

The Service drawer is built around reachability because that is the question Service debugging always reduces to: are there any backing pods, do they match the selector, can traffic actually reach them. The drawer shows the selector, the resolved Endpoints, and (where possible) a synthetic probe result.

The HPA / KEDA drawer is built around the scale decision because the question is always “why is this not scaling the way I expected.” Current vs target metric, current vs allowed replica bounds, last scale event with reason — the three numbers that explain any autoscaler behaviour.

The same view in three places: cluster page, ZopDay handoff, internal-admin

The Live Kubernetes view ships to three surfaces in ZopNight, and they are the same view rather than three variants.

The customer cluster page is the default surface. A ZopNight user with cluster access lands on the Overview when they open their cluster.

The ZopDay handoff matters because ZopDay (the cluster provisioning wizard for EKS, GKE, and AKS) drops the operator straight into the Live Kubernetes view after the cluster comes up. There is no second product to learn after provisioning. The 21 pages are right there with whatever the cluster booted with (kube-system pods, default service accounts, the operator’s first workload if they deployed one).

Internal-admin parity is the more subtle one. ZopDev’s support team uses an internal admin surface for debugging customer clusters during incidents. That surface shows the same Live Kubernetes view the customer sees. There is no information asymmetry: when a support engineer asks “what’s the state of your pod,” they are looking at the same screen the customer is. The debug loop becomes “I see what you see, here’s what I think is wrong” instead of the customer screenshotting their dashboard for the support engineer.

Crashloop and OOMKilled with the event already attached

The most valuable enrichment in the Overview is on the two failure modes that account for the majority of pod incidents: CrashLoopBackOff and OOMKilled. Both have a root cause that lives in two or three places (container exit code, last container log, kernel event, configured limit, actual usage), and both have been hidden behind multi-command debugging dances since Kubernetes shipped.

Failure reason	What the Overview row surfaces inline
CrashLoopBackOff	Exit code, restart count, backoff duration, last container message (from terminated container state)
OOMKilled	Memory limit, peak memory usage at kill, last 5 lines of stdout/stderr before kill
Pending — `Insufficient cpu`	Requested CPU vs cluster available CPU, node count at the requested CPU
Pending — `node selector mismatch`	The selector that didn’t match, the nodes that were considered
ImagePullBackOff	Image name, registry response, last pull attempt time

The CrashLoopBackOff enrichment alone saves a measurable amount of time. The exit code answers “did the container crash or did it exit cleanly that we then restart-loop.” The last container message answers “what was the last thing it printed before it died.” Without the inline render, both pieces of information require kubectl describe pod plus kubectl logs --previous. With the inline render, both are visible in the list view.

The OOMKilled enrichment is similar. The memory limit and peak usage tell the operator immediately whether the limit was set too low or whether the workload is actually leaking memory. A peak of 480 MiB against a 512 MiB limit is “raise the limit.” A peak that climbs every restart is “fix the leak.” Both decisions are visible without opening a single drawer.

How to use it day to day

The workflow is short and lands the operator on a decision in under a minute.

Step	Action	Where
1	Open the cluster	Sidebar → Clusters → pick one
2	Scan the failure tiles	Overview top row
3	If any failure tile is non-zero, click it	Tile drills into a filtered list
4	Read the inline event per row	List view
5	If the event explains it, fix the workload	Apply a manifest change or use auto-remediation
6	If not, click the row, open the drawer, read logs	Drawer right side
7	Choose: manual fix, auto-remediate, or escalate	Decision based on logs + events

For day-to-day operation, most operators land on the Overview, see green tiles across the failure row, and close the tab. The view’s job is to be boring most of the time. When it’s not boring, the failure tile count is the only signal needed; everything else is one click away.

The 21 typed pages get used differently. The Pods page is the most common deep-dive surface for incident response. The Deployments page is the most common for rollout debugging. The Services and Endpoints pages are the most common for “why is traffic not reaching this.” The Events page is the most common for cluster-wide situational awareness.

The view does not replace kubectl for everything. Custom resources installed by operators (Argo Rollouts, Istio VirtualServices, Cert-Manager Certificates) live on the CRDs page but the drawer is generic; per-CRD detail drawers are on the roadmap. Live exec into a container is not in the drawer yet (logs are live; shell access requires kubectl exec for now). Historical replay (“show me what this pod looked like at 03:00 last Tuesday when it last restarted”) is a future direction.

What ZopNight ships is the view that answers the question the operator actually has when they open their cluster on a Monday morning. Not “list everything.” The question is “is anything broken, and if so, why.” The Overview answers the first half in a tile glance. The 21 typed pages and the per-type drawers answer the second half in one or two clicks. The same view ships to the customer, to ZopDay, and to internal-admin so the answer is the same no matter who is looking. That is the work the work is for.