error-recovery-automation

TotalClaw 作者 totalclaw

通过自动恢复步骤标准化常见 OpenClaw 错误（网关重新启动、浏览器服务不可用、cron 故障）的处理。当您需要自动检测已知故障模式并进行恢复时使用，从而减少手动干预并提高系统弹性。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~konscious0beast-error-recovery-automation

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~konscious0beast-error-recovery-automation/file -o konscious0beast-error-recovery-automation.md

## 概述（中文）

通过自动恢复步骤标准化常见 OpenClaw 错误（网关重新启动、浏览器服务不可用、cron 故障）的处理。当您需要自动检测已知故障模式并进行恢复时使用，从而减少手动干预并提高系统弹性。

## 原文

# Error Recovery Automation Skill

This skill provides patterns for automating the detection and recovery of common OpenClaw errors: gateway unresponsiveness, browser service failures, cron scheduler issues, and other recurring problems. It builds on health‑monitoring and system‑diagnostics by adding **automated recovery workflows** that can be triggered by cron jobs, heartbeat checks, or external monitoring.

## When to use

- A service (gateway, browser, cron) fails intermittently and you want to automate its restart.
- You are setting up proactive monitoring and need a recovery plan beyond just detection.
- You want to reduce the manual steps required when “Läuft alles?” reveals a failure.
- You need to ensure critical OpenClaw components stay running with minimal user intervention.
- You are asked to “create a skill for error recovery automation” (this is that skill).

## Core patterns

### 1. Error Detection Patterns

Before automating recovery, you must reliably detect the error. Use these detection methods:

**Gateway unresponsive:**
- `openclaw gateway status` returns non‑zero exit code or shows `"running": false`.
- Gateway logs (`~/.openclaw/logs/gateway.err.log`) contain recent `CRITICAL` or `ERROR` entries.
- HTTP health endpoint (if configured) returns non‑2xx status.

**Browser service unavailable:**
- `openclaw browser --browser-profile openclaw status --json` shows `"running": false` or CDP not ready.
- Browser logs contain connection timeouts or Chrome process failures.
- Simple page load via `curl` to CDP endpoint fails.

**Cron scheduler not running:**
- `openclaw cron status` returns `"running": false` or error.
- Cron logs show no recent activity.
- Scheduled jobs are not triggered (check `openclaw cron list` for missed runs).

**Memory search disabled:**
- `memory_search` tool returns “disabled” or native‑module error.
- `openclaw doctor --fix` reports better‑sqlite3 mismatch.

**Permission errors:**
- File operations fail with `EACCES`/`EPERM`.
- Logs indicate permission denied on specific paths (archive, logs, config).

### 2. Automated Recovery Steps

For each error type, define a recovery script that attempts to restore service automatically. The script should:

1. **Detect** the error (using the patterns above).
2. **Attempt recovery** (restart service, fix permissions, rebuild module).
3. **Verify** recovery (re‑run detection after a short wait).
4. **Report outcome** (exit code 0 for success, non‑zero for persistent failure).

#### Gateway Recovery Script Template

```bash
#!/bin/bash
set -e

SERVICE="gateway"
MAX_ATTEMPTS=2
SLEEP_SECONDS=5

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

check() {
  openclaw gateway status > /dev/null 2>&1
}

restart() {
  openclaw gateway restart
  sleep "$SLEEP_SECONDS"
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
  if check; then
    log "$SERVICE is healthy"
    exit 0
  fi
  log "$SERVICE is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
  restart
  ((attempt++))
done

log "$SERVICE could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```

#### Browser Service Recovery Script Template

```bash
#!/bin/bash
set -e

SERVICE="browser"
PROFILE="openclaw"
MAX_ATTEMPTS=2
SLEEP_SECONDS=8

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

check() {
  openclaw browser --browser-profile "$PROFILE" status --json 2>&1 | grep -q '"running":true'
}

restart() {
  openclaw browser --browser-profile "$PROFILE" stop
  sleep 2
  openclaw browser --browser-profile "$PROFILE" start
  sleep "$SLEEP_SECONDS"
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
  if check; then
    log "$SERVICE ($PROFILE) is healthy"
    exit 0
  fi
  log "$SERVICE ($PROFILE) is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
  restart
  ((attempt++))
done

log "$SERVICE ($PROFILE) could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```

#### Cron Scheduler Recovery Script Template

```bash
#!/bin/bash
set -e

SERVICE="cron"
MAX_ATTEMPTS=1
SLEEP_SECONDS=3

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

check() {
  openclaw cron status 2>&1 | grep -q '"running":true'
}

restart() {
  # Cron is restarted automatically when gateway restarts.
  # If cron is not running, restart gateway.
  openclaw gateway restart
  sleep "$SLEEP_SECONDS"
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
  if check; then
    log "$SERVICE scheduler is running"
    exit 0
  fi
  log "$SERVICE scheduler is not running, restarting gateway (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
  restart
  ((attempt++))
done

log "$SERVICE scheduler still not running after $MAX_ATTEMPTS attempts"
exit 1
```

#### Memory Search Recovery Script Template

```bash
#!/bin/bash
set -e

SERVICE="memory_search"
MAX_ATTEMPTS=1

log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }

check() {
  openclaw memory search --query "test" 2>&1 | grep -q -v "disabled\|Module did not self-register"
}

restart() {
  # Try rebuilding better‑sqlite3
  cd "$(dirname "$(which openclaw)")/../lib/node_modules/openclaw"
  npm rebuild better-sqlite3
  # Restart gateway to pick up the rebuilt module
  openclaw gateway restart
  sleep 5
}

attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
  if check; then
    log "$SERVICE is functional"
    exit 0
  fi
  log "$SERVICE is disabled, rebuilding native module (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
  restart
  ((attempt++))
done

log "$SERVICE could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```

### 3. Integration with Cron for Automated Recovery

Once you have a recovery script, schedule it as a cron job that runs **only when the service is likely to fail** (e.g., every 30 minutes for browser, every hour for gateway). Use an isolated agent session to execute the script and announce failures.

**Example cron job for browser recovery:**

```bash
openclaw cron add \
  --name "Browser‑Recovery‑Automation" \
  --schedule 'every 30 minutes' \
  --session isolated \
  --payload '{"kind":"agentTurn","message":"Run browser recovery automation script","model":"default","thinking":"low"}' \
  --delivery '{"mode":"announce","channel":"telegram"}'
```

**Agent response inside isolated session:** The agent reads the script (or inline logic) and executes it via `exec`. If the script exits with 0, the agent announces success; if non‑zero, the cron delivery forwards the failure message.

**Alternative:** You can embed the recovery logic directly in the agent’s response (without a separate script) for simplicity, but a script is easier to test and reuse.

### 4. Escalation When Automation Fails

If automated recovery fails after the maximum attempts, escalate:

- **Log the failure** in `memory/YYYY‑MM‑DD.md` with tag `error‑recovery‑failed`.
- **Add a task** to `inbox/agent‑aufgaben.md` for manual diagnosis.
- **Send a high‑priority notification** (if supported) to the user.
- **Fallback to a safe state** (e.g., disable the problematic component if possible).

Example escalation snippet:

```bash
if [ $? -ne 0 ]; then
  echo "Browser recovery failed. Adding manual diagnosis task."
  # Append to agent-aufgaben.md
  echo "| 99 | Diagnose browser recovery failure – automated recovery failed after 2 attempts | ⬜ |" >> inbox/agent-aufgaben.md
  # Store in memory
  echo "## [error] Browser recovery automation failed" >> memory/$(date +%Y-%m-%d).md
  echo "Date: $(date +%Y-%m-%d)" >> memory/$(date +%Y-%m-%d).md
  echo "Tags: error, browser, recovery-failed" >> memory/$(date +%Y-%m-%d).md
  echo "Browser recovery script exited with code $?. Manual intervention required." >> memory/$(date +%Y-%m-%d).md
fi
```

### 5. Testing Recovery Scripts

Before deploying a recovery script as a cron job, test it manually:

1. **Simulate the failure** (e.g., kill the gateway process, stop the browser service).
2. **Run the recovery script** and verify it detects the failure and