error-recovery-automation
通过自动恢复步骤标准化常见 OpenClaw 错误(网关重新启动、浏览器服务不可用、cron 故障)的处理。当您需要自动检测已知故障模式并进行恢复时使用,从而减少手动干预并提高系统弹性。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~konscious0beast-error-recovery-automationcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~konscious0beast-error-recovery-automation/file -o konscious0beast-error-recovery-automation.md## 概述(中文)
通过自动恢复步骤标准化常见 OpenClaw 错误(网关重新启动、浏览器服务不可用、cron 故障)的处理。当您需要自动检测已知故障模式并进行恢复时使用,从而减少手动干预并提高系统弹性。
## 原文
# Error Recovery Automation Skill
This skill provides patterns for automating the detection and recovery of common OpenClaw errors: gateway unresponsiveness, browser service failures, cron scheduler issues, and other recurring problems. It builds on health‑monitoring and system‑diagnostics by adding **automated recovery workflows** that can be triggered by cron jobs, heartbeat checks, or external monitoring.
## When to use
- A service (gateway, browser, cron) fails intermittently and you want to automate its restart.
- You are setting up proactive monitoring and need a recovery plan beyond just detection.
- You want to reduce the manual steps required when “Läuft alles?” reveals a failure.
- You need to ensure critical OpenClaw components stay running with minimal user intervention.
- You are asked to “create a skill for error recovery automation” (this is that skill).
## Core patterns
### 1. Error Detection Patterns
Before automating recovery, you must reliably detect the error. Use these detection methods:
**Gateway unresponsive:**
- `openclaw gateway status` returns non‑zero exit code or shows `"running": false`.
- Gateway logs (`~/.openclaw/logs/gateway.err.log`) contain recent `CRITICAL` or `ERROR` entries.
- HTTP health endpoint (if configured) returns non‑2xx status.
**Browser service unavailable:**
- `openclaw browser --browser-profile openclaw status --json` shows `"running": false` or CDP not ready.
- Browser logs contain connection timeouts or Chrome process failures.
- Simple page load via `curl` to CDP endpoint fails.
**Cron scheduler not running:**
- `openclaw cron status` returns `"running": false` or error.
- Cron logs show no recent activity.
- Scheduled jobs are not triggered (check `openclaw cron list` for missed runs).
**Memory search disabled:**
- `memory_search` tool returns “disabled” or native‑module error.
- `openclaw doctor --fix` reports better‑sqlite3 mismatch.
**Permission errors:**
- File operations fail with `EACCES`/`EPERM`.
- Logs indicate permission denied on specific paths (archive, logs, config).
### 2. Automated Recovery Steps
For each error type, define a recovery script that attempts to restore service automatically. The script should:
1. **Detect** the error (using the patterns above).
2. **Attempt recovery** (restart service, fix permissions, rebuild module).
3. **Verify** recovery (re‑run detection after a short wait).
4. **Report outcome** (exit code 0 for success, non‑zero for persistent failure).
#### Gateway Recovery Script Template
```bash
#!/bin/bash
set -e
SERVICE="gateway"
MAX_ATTEMPTS=2
SLEEP_SECONDS=5
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }
check() {
openclaw gateway status > /dev/null 2>&1
}
restart() {
openclaw gateway restart
sleep "$SLEEP_SECONDS"
}
attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log "$SERVICE is healthy"
exit 0
fi
log "$SERVICE is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
restart
((attempt++))
done
log "$SERVICE could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```
#### Browser Service Recovery Script Template
```bash
#!/bin/bash
set -e
SERVICE="browser"
PROFILE="openclaw"
MAX_ATTEMPTS=2
SLEEP_SECONDS=8
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }
check() {
openclaw browser --browser-profile "$PROFILE" status --json 2>&1 | grep -q '"running":true'
}
restart() {
openclaw browser --browser-profile "$PROFILE" stop
sleep 2
openclaw browser --browser-profile "$PROFILE" start
sleep "$SLEEP_SECONDS"
}
attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log "$SERVICE ($PROFILE) is healthy"
exit 0
fi
log "$SERVICE ($PROFILE) is unhealthy, restarting (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
restart
((attempt++))
done
log "$SERVICE ($PROFILE) could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```
#### Cron Scheduler Recovery Script Template
```bash
#!/bin/bash
set -e
SERVICE="cron"
MAX_ATTEMPTS=1
SLEEP_SECONDS=3
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }
check() {
openclaw cron status 2>&1 | grep -q '"running":true'
}
restart() {
# Cron is restarted automatically when gateway restarts.
# If cron is not running, restart gateway.
openclaw gateway restart
sleep "$SLEEP_SECONDS"
}
attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log "$SERVICE scheduler is running"
exit 0
fi
log "$SERVICE scheduler is not running, restarting gateway (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
restart
((attempt++))
done
log "$SERVICE scheduler still not running after $MAX_ATTEMPTS attempts"
exit 1
```
#### Memory Search Recovery Script Template
```bash
#!/bin/bash
set -e
SERVICE="memory_search"
MAX_ATTEMPTS=1
log() { echo "[$(date +'%Y-%m-%d %H:%M:%S')] $*"; }
check() {
openclaw memory search --query "test" 2>&1 | grep -q -v "disabled\|Module did not self-register"
}
restart() {
# Try rebuilding better‑sqlite3
cd "$(dirname "$(which openclaw)")/../lib/node_modules/openclaw"
npm rebuild better-sqlite3
# Restart gateway to pick up the rebuilt module
openclaw gateway restart
sleep 5
}
attempt=0
while [ $attempt -lt $MAX_ATTEMPTS ]; do
if check; then
log "$SERVICE is functional"
exit 0
fi
log "$SERVICE is disabled, rebuilding native module (attempt $((attempt+1))/$MAX_ATTEMPTS)..."
restart
((attempt++))
done
log "$SERVICE could not be recovered after $MAX_ATTEMPTS attempts"
exit 1
```
### 3. Integration with Cron for Automated Recovery
Once you have a recovery script, schedule it as a cron job that runs **only when the service is likely to fail** (e.g., every 30 minutes for browser, every hour for gateway). Use an isolated agent session to execute the script and announce failures.
**Example cron job for browser recovery:**
```bash
openclaw cron add \
--name "Browser‑Recovery‑Automation" \
--schedule 'every 30 minutes' \
--session isolated \
--payload '{"kind":"agentTurn","message":"Run browser recovery automation script","model":"default","thinking":"low"}' \
--delivery '{"mode":"announce","channel":"telegram"}'
```
**Agent response inside isolated session:** The agent reads the script (or inline logic) and executes it via `exec`. If the script exits with 0, the agent announces success; if non‑zero, the cron delivery forwards the failure message.
**Alternative:** You can embed the recovery logic directly in the agent’s response (without a separate script) for simplicity, but a script is easier to test and reuse.
### 4. Escalation When Automation Fails
If automated recovery fails after the maximum attempts, escalate:
- **Log the failure** in `memory/YYYY‑MM‑DD.md` with tag `error‑recovery‑failed`.
- **Add a task** to `inbox/agent‑aufgaben.md` for manual diagnosis.
- **Send a high‑priority notification** (if supported) to the user.
- **Fallback to a safe state** (e.g., disable the problematic component if possible).
Example escalation snippet:
```bash
if [ $? -ne 0 ]; then
echo "Browser recovery failed. Adding manual diagnosis task."
# Append to agent-aufgaben.md
echo "| 99 | Diagnose browser recovery failure – automated recovery failed after 2 attempts | ⬜ |" >> inbox/agent-aufgaben.md
# Store in memory
echo "## [error] Browser recovery automation failed" >> memory/$(date +%Y-%m-%d).md
echo "Date: $(date +%Y-%m-%d)" >> memory/$(date +%Y-%m-%d).md
echo "Tags: error, browser, recovery-failed" >> memory/$(date +%Y-%m-%d).md
echo "Browser recovery script exited with code $?. Manual intervention required." >> memory/$(date +%Y-%m-%d).md
fi
```
### 5. Testing Recovery Scripts
Before deploying a recovery script as a cron job, test it manually:
1. **Simulate the failure** (e.g., kill the gateway process, stop the browser service).
2. **Run the recovery script** and verify it detects the failure and