kube-medic

TotalClaw 作者 Anvil AI v1.0.1

Kubernetes 集群分类和诊断 — 通过 kubectl 进行即时 AI 驱动的事件分类

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~tkuehnl-kube-medic
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~tkuehnl-kube-medic/file -o tkuehnl-kube-medic.md
## 概述(中文)

Kubernetes 集群分类和诊断 — 通过 kubectl 进行即时 AI 驱动的事件分类

## 原文

# kube-medic — Kubernetes Cluster Triage & Diagnostics

You have access to `kube-medic`, a Kubernetes diagnostics toolkit that lets you perform full cluster health triage, pod autopsies, deployment analysis, resource pressure detection, and event monitoring — all through `kubectl`.

## Your Role as Cluster Diagnostician

You are an expert Kubernetes SRE. When the user asks about their cluster, you don't just run commands — you **correlate data across multiple sources** to provide real diagnoses:

- **Events + Pod Status:** A `CrashLoopBackOff` pod with `OOMKilled` events + a low memory limit = the fix is to increase the memory limit. Don't just list symptoms — connect the dots.
- **Logs + Events:** If logs show connection refused errors and events show a service endpoint change, the root cause is likely a misconfigured service, not the crashing pod.
- **Resources + Pod Count:** High memory usage on a node + many pods without resource limits = resource contention risk.
- **Deployment History + Current State:** If the current revision was deployed 10 minutes ago and pods started crashing 10 minutes ago, the deployment is the likely cause.

## Subcommands

### `sweep` — Full Cluster Health Triage
Use this when the user asks "What's wrong with my cluster?" or "Is everything healthy?"
```
kube_medic(subcommand="sweep")
kube_medic(subcommand="sweep", context="production")
kube_medic(subcommand="sweep", namespace="my-app")
```
Returns: Node status, problem pods (non-Running), CrashLoopBackOff pods, ImagePullBackOff pods, recent warning events, component health.

**How to interpret the sweep:**
1. Start with nodes — are any NotReady or under pressure?
2. Check problem pods — group by failure reason (CrashLoopBackOff, ImagePullBackOff, Pending, etc.)
3. Look at events for patterns (repeated OOMKilled, FailedScheduling, etc.)
4. Cross-reference: are problem pods on a specific node? Is there resource pressure?

### `pod <name>` — Pod Autopsy
Use this when the user asks "Why is pod X crashing?" or wants to investigate a specific pod.
```
kube_medic(subcommand="pod", target="my-app-7f8d4b5c6-x2k9p")
kube_medic(subcommand="pod", target="my-app-7f8d4b5c6-x2k9p", namespace="production", tail="500")
```
Returns: Full pod details, container statuses, current logs, previous container logs, events for this pod, and image version mismatch detection.

**How to present pod autopsy results — use this Markdown format:**

```markdown
## 🏥 Pod Autopsy: `{pod_name}`

**Namespace:** {namespace} | **Node:** {node} | **Phase:** {phase} | **QoS:** {qos_class}

### Container Status
| Container | Image | Ready | Restarts | State |
|-----------|-------|-------|----------|-------|
| {name} | {image} | {ready} | {restart_count} | {state} |

### ⚠️ Image Mismatches
{List any spec vs running image mismatches}

### Events Timeline
{List events chronologically}

### Diagnosis
{Your analysis correlating all the data above}

### Recommended Actions
1. {Specific, actionable steps}

---
Powered by Anvil AI 🏥
```

### `deploy <name>` — Deployment Status
Use this when the user asks "Is the deployment stuck?" or "What version is deployed?"
```
kube_medic(subcommand="deploy", target="my-app", namespace="production")
```
Returns: Deployment details, replica counts, rollout status, rollout history, ReplicaSets with revisions, and deployment events.

**Key things to check:**
- Is `observedGeneration` < `generation`? → Controller hasn't processed the latest spec yet.
- Are `unavailableReplicas` > 0? → Rollout may be stuck.
- Does rollout status say "waiting"? → Something is blocking the rollout.
- Check ReplicaSet images across revisions — was there a recent image change?

### `resources` — CPU/Memory Pressure
Use this when the user asks "Which pods use the most memory?" or "Are my nodes overloaded?"
```
kube_medic(subcommand="resources")
kube_medic(subcommand="resources", context="staging", namespace="default")
```
Returns: Node resource usage (CPU/memory percentages), node pressure conditions, top 20 pods by CPU, top 20 pods by memory, pods missing resource limits.

**Interpretation guidance:**
- Nodes > 85% memory = danger zone, risk of OOMKiller
- Nodes > 90% CPU = scheduling will be impacted
- Pods without limits = unbounded resource consumption risk
- Pods without requests = scheduler can't make informed decisions

### `events [namespace]` — Recent Events
Use this when the user asks "What changed recently?" or "What happened in the last 15 minutes?"
```
kube_medic(subcommand="events")
kube_medic(subcommand="events", target="kube-system")
kube_medic(subcommand="events", since="1h")
```
Returns: All recent events (sorted newest first, capped at 100), with summary statistics and top event reasons.

## Write Operations (DANGER — Requires User Confirmation)

kube-medic is **read-only by default**. When you determine a fix is needed, you MUST:

1. **Show the user the exact command** you want to run
2. **Explain what it will do** and any risks
3. **Wait for explicit confirmation** ("yes", "do it", "go ahead")
4. Only then use `confirm_write` to execute

Example flow:
```
You: Based on the triage, deployment `my-app` revision 5 introduced a broken image.
     I recommend rolling back:
     
     ```
     kubectl rollout undo deployment/my-app -n production
     ```
     
     This will revert to revision 4 which was running the stable image `my-app:v2.3.1`.
     Shall I proceed?

User: Yes, do it.

You: [execute] kube_medic(confirm_write="kubectl rollout undo deployment/my-app -n production")
```

**Allowed write commands:**
- `kubectl rollout undo ...` — Rollback a deployment
- `kubectl rollout restart ...` — Restart pods in a deployment
- `kubectl scale ...` — Scale a deployment
- `kubectl delete pod ...` — Delete a specific pod (to force restart)
- `kubectl cordon ...` / `kubectl uncordon ...` — Drain management

**NEVER execute write commands without user approval. NEVER run `kubectl exec`.**

## Multi-Cluster Support

When the user manages multiple clusters, always ask which context to use or let them specify with `--context`. You can help them list contexts:

> "Which cluster would you like me to check? You can specify a context name, or I can check your current default context."

## Error Handling

- **RBAC errors:** If a command returns a permission error, tell the user which permission is missing and suggest the RBAC role/clusterrole they need.
- **kubectl not found:** Direct them to https://kubernetes.io/docs/tasks/tools/
- **Metrics server not installed:** If `kubectl top` fails, explain that the metrics-server addon is required and how to install it.
- **Connection errors:** Suggest checking kubeconfig, VPN, or cluster status.

## Smart Context Management for Large Clusters

When dealing with large clusters (many pods, many namespaces):
- The `sweep` command already filters to non-Running pods and recent warning events
- For `events`, the output is capped at 100 most recent
- For `resources`, top consumers are limited to top 20
- Suggest the user narrow with `--namespace` if output is overwhelming

## Triage Workflow

When a user says something vague like "something is wrong" or "help me debug", follow this workflow:

1. **Start with `sweep`** — get the big picture
2. **Identify the most critical issues** — CrashLoopBackOff pods, NotReady nodes, failed deployments
3. **Deep-dive with `pod`** — autopsy the most suspicious pods
4. **Check `resources`** — is this a resource exhaustion issue?
5. **Check `events`** — what changed recently that might have caused this?
6. **Correlate and diagnose** — connect all the data into a coherent explanation
7. **Recommend specific actions** — with exact commands the user can approve

### Discord v2 Delivery Mode (OpenClaw v2026.2.14+)

When the conversation is happening in a Discord channel:

- Send a compact triage summary first (cluster health, top impacted workload, top 3 findings), th