linux-system-health

GitHub 作者 LeoYeAI/openclaw-master-skills v1.3.0

Diagnose Linux OS-level issues — slow server, OOM kills, disk full, high CPU/load, DNS failures, connection timeouts, port exhaustion, too many open files, zombie processes, browser automation failures, locale problems, and kernel misconfigurations.

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install github:LeoYeAI~openclaw-master-skills~linux-system-health
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/github%3ALeoYeAI~openclaw-master-skills~linux-system-health/file -o linux-system-health.md
# Linux System Health Diagnostic Skill

You are a Linux OS diagnostic expert. When a user reports any of the following problems, use this skill:
- **Performance**: server slow, high load, lag, unresponsive
- **Memory**: OOM killed, out of memory, memory leak, swap thrashing
- **Disk**: disk full, read-only filesystem, inode exhaustion, log files too large
- **CPU**: high CPU, IO wait, process stuck, load average spike
- **Network**: DNS failure, connection timeout, port exhaustion, CLOSE_WAIT accumulation, firewall blocking
- **Process**: crash, zombie processes, too many open files, file descriptor limit
- **Browser automation**: missing shared libraries, Chromium sandbox error, headless browser failures
- **Locale/Encoding**: garbled text, character encoding issues, locale not configured

Use the judgment rules below to systematically diagnose OS-level root causes.

**When NOT to use this skill**: For application-level issues specific to OpenClaw (gateway config, API keys, model configuration, service management, systemd units), use the `openclaw-diagnostic` skill instead. This skill only covers OS-level diagnostics.

**Diagnostic workflow**:
1. Always start with Section 1 (System Environment Baseline) to establish context
2. Then run the sections relevant to the user's reported symptoms
3. If the root cause is unclear, run all sections in order for a comprehensive check

> **Commands**: Run the corresponding section in [scripts/diagnostics.sh](scripts/diagnostics.sh). Run as root with `export LANG=C`.
>
> **Issue Registry**: See [reference.md](reference.md) for severity level definitions and the complete issue name table.

**Data access scope** — this skill collects OS-level diagnostic data. Review before running in sensitive environments:

| Category | What is accessed | Sections |
|----------|-----------------|----------|
| System config files | `/etc/os-release`, `/etc/resolv.conf`, `/etc/security/limits.conf`, `/etc/default/locale`, `/etc/locale.conf`, `/etc/systemd/journald.conf` | 1, 6, 8, 11, 17 |
| Kernel interfaces | `/proc/meminfo`, `/proc/stat`, `/proc/loadavg`, `/proc/sys/fs/*`, `/proc/sys/net/*`, `/sys/kernel/mm/*` | 2, 3, 5, 6, 7, 14 |
| Kernel ring buffer | `dmesg` — may contain process names and OOM kill details | 2, 7, 12 |
| Systemd journal | `journalctl -k` — kernel messages only | 2 |
| Log directory | `/var/log/` size enumeration only (does **not** read log content) | 11 |
| Process & socket table | `ps`, `ss -p` — exposes PIDs, command names, socket owners | 2, 3, 10, 15 |
| User home directories | `/root/.cache/ms-playwright`, `/home/*/.cache/ms-playwright` — Chromium binary search only | 16 |
| Outbound network probes | DNS resolution tests (`nslookup`/`dig`/`getent` to `github.com`), nameserver TCP/53 reachability, Chrome headless launch test (`about:blank`) | 8, 16 |
| Write operation | Creates and immediately removes `/tmp/.oc_write_test` to verify filesystem writability — the **only** write in the entire script | 12 |

**Output format**: After running diagnostics, report findings as a severity-sorted list (FATAL > CRITICAL > ERROR > WARNING > INFO). For each issue found, include:
- Issue name (e.g., `OpenClaw.Memory.SystemMemoryCritical`)
- Severity level
- Observed value vs threshold
- Recommended remediation

---

## 1. System Environment Baseline

Collect OS context for subsequent analysis.

**Judgment rules**:
- Record output as **OpenClaw.System.EnvironmentBaseline** (INFO) — no issues, context only.

---

## 2. Memory & OOM

Detect low memory and past OOM kills that affect any workload on this server.

**Judgment rules**:
- MemAvailable / MemTotal < 5% → **OpenClaw.Memory.SystemMemoryCritical** (CRITICAL)
  - Remediation: Kill unnecessary processes, add swap, or increase instance RAM
- MemAvailable / MemTotal < 10% → **OpenClaw.Memory.SystemMemoryLow** (WARNING)
  - Remediation: Monitor closely; consider scaling up
- MemTotal < 2 GB → **OpenClaw.Memory.InsufficientTotalMemory** (ERROR)
  - Remediation: 4 GB+ RAM recommended for production workloads
- dmesg contains "oom-killer" → **OpenClaw.Memory.OOMKillerEvent** (WARNING)
  - Remediation: Identify which processes were killed; review memory allocation

---

## 3. CPU & Performance

Resource contention causes slow responses; high iowait indicates disk bottlenecks.

**Judgment rules**:
- Load average (1 min) > 2x `nproc` → **OpenClaw.CPU.SystemLoadHigh** (WARNING)
  - Remediation: Identify top CPU consumers; check for runaway processes
- CPU idle < 10% (i.e., total utilization > 90%) → **OpenClaw.CPU.SystemCPUExhausted** (CRITICAL)
  - Remediation: Identify top process; check for log flooding or computation storms
- iowait > 30% (from `/proc/stat`) → **OpenClaw.CPU.HighIOWait** (WARNING)
  - Remediation: Check disk I/O — likely excessive logging or disk-bound workload

---

## 4. Network Infrastructure

Basic network configuration, DNS, IPv6, and firewall state.

**Judgment rules**:
- IPv6 enabled and services bind `::` but upstream resolves to IPv4 only → **OpenClaw.Network.IPv6Mismatch** (WARNING)
  - Remediation: Set `NODE_OPTIONS='--dns-result-order=ipv4first'` or `sysctl -w net.ipv6.conf.all.disable_ipv6=1`

---

## 5. Disk & inotify

Disk space exhaustion and inotify limits cause "ENOSPC" errors.

**Judgment rules**:
- Any filesystem usage >= 95% → **OpenClaw.Disk.FilesystemFull** (CRITICAL)
  - Remediation: Clean old logs and data; extend partition or add disk
- Any filesystem usage >= 80% → **OpenClaw.Disk.FilesystemHighUsage** (WARNING)
  - Remediation: Monitor; plan cleanup or expansion
- `max_user_watches` < 65536 → **OpenClaw.Disk.InotifyWatchesTooLow** (ERROR)
  - Remediation: `echo 'fs.inotify.max_user_watches=524288' >> /etc/sysctl.d/99-inotify.conf && sysctl -p /etc/sysctl.d/99-inotify.conf`
- `max_user_instances` < 256 → **OpenClaw.Disk.InotifyInstancesTooLow** (WARNING)
  - Remediation: `echo 'fs.inotify.max_user_instances=512' >> /etc/sysctl.d/99-inotify.conf && sysctl -p /etc/sysctl.d/99-inotify.conf`

---

## 6. File Descriptor & Process Limits

Low ulimits cause "too many open files" (EMFILE) errors under load.

**Judgment rules**:
- Shell `ulimit -n` < 4096 → **OpenClaw.Limits.NofileTooLow** (ERROR)
  - Remediation: Add `* soft nofile 65536` and `* hard nofile 65536` to `/etc/security/limits.conf`; re-login
- limits.conf `nofile` value > `fs.nr_open` → **OpenClaw.Limits.NofileExceedsKernelMax** (CRITICAL)
  - Remediation: Increase `fs.nr_open` first: `sysctl -w fs.nr_open=1048576` and persist in `/etc/sysctl.d/`
- `file-nr` allocated / max > 80% → **OpenClaw.Limits.SystemFileDescriptorsHigh** (WARNING)
  - Remediation: Identify processes holding many FDs (`ls /proc/*/fd 2>/dev/null | wc -l`); increase `fs.file-max` if needed

---

## 7. Kernel & Sysctl Tuning

nf_conntrack, TCP tuning, and somaxconn affect high-concurrency workloads.

**Judgment rules**:
- `nf_conntrack_max` < 65536 → **OpenClaw.Kernel.NfConntrackMaxTooLow** (ERROR)
  - Remediation: `sysctl -w net.netfilter.nf_conntrack_max=262144` and persist in `/etc/sysctl.d/99-sysctl.conf`
- dmesg contains "nf_conntrack: table full" → **OpenClaw.Kernel.NfConntrackTableFull** (CRITICAL)
  - Remediation: Increase `nf_conntrack_max`; check for connection leaks
- `somaxconn` < 1024 → **OpenClaw.Kernel.SomaxconnTooLow** (WARNING)
  - Remediation: `sysctl -w net.core.somaxconn=4096` and persist
- `tcp_max_tw_buckets` < 10000 → **OpenClaw.Kernel.TcpMaxTwBucketsTooLow** (WARNING)
  - Remediation: `sysctl -w net.ipv4.tcp_max_tw_buckets=262144`
- `tcp_tw_reuse = 0` → **OpenClaw.Kernel.TcpTwReuseNotEnabled** (WARNING)
  - Remediation: `sysctl -w net.ipv4.tcp_tw_reuse=1`
- TIME_WAIT count from `ss -s` > 10000 → **OpenClaw.Kernel.TimeWaitOverflow** (WARNING)
  - Remediation: Enable `tcp_tw_reuse`, increase `tcp_max_tw_buckets`, reduce `tcp_fin_timeout`
- List