runbook-generator
Runbook Generator
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install clawskills:alirezarezvani~runbook-generatorcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aalirezarezvani~runbook-generator/file -o runbook-generator.mdGit 仓库获取源码
git clone https://github.com/openclaw/skills/commit/1505192d54de6e1243ad66e9103fa6a37efa2b17# Runbook Generator
**Tier:** POWERFUL
**Category:** Engineering
**Domain:** DevOps / Site Reliability Engineering
---
## Overview
Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.
---
## Core Capabilities
- **Stack detection** — auto-identify CI/CD, database, hosting, orchestration from repo files
- **Runbook types** — deployment, incident response, database maintenance, scaling, monitoring setup
- **Format discipline** — numbered steps, copy-paste commands, ✅ verification checks, time estimates
- **Escalation paths** — L1 → L2 → L3 with contact info and decision criteria
- **Rollback procedures** — every deployment step has a corresponding undo
- **Staleness detection** — runbook sections reference config files; flag when source changes
- **Testing methodology** — dry-run framework for staging validation, quarterly review cadence
---
## When to Use
Use when:
- A codebase has no runbooks and you need to bootstrap them fast
- Existing runbooks are outdated or incomplete (point at the repo, regenerate)
- Onboarding a new engineer who needs clear operational procedures
- Preparing for an incident response drill or audit
- Setting up monitoring and on-call rotation from scratch
Skip when:
- The system is too early-stage to have stable operational patterns
- Runbooks already exist and only need minor updates (edit directly)
---
## Stack Detection
When given a repo, scan for these signals before writing a single runbook line:
```bash
# CI/CD
ls .github/workflows/ → GitHub Actions
ls .gitlab-ci.yml → GitLab CI
ls Jenkinsfile → Jenkins
ls .circleci/ → CircleCI
ls bitbucket-pipelines.yml → Bitbucket Pipelines
# Database
grep -r "postgresql\|postgres\|pg" package.json pyproject.toml → PostgreSQL
grep -r "mysql\|mariadb" package.json → MySQL
grep -r "mongodb\|mongoose" package.json → MongoDB
grep -r "redis" package.json → Redis
ls prisma/schema.prisma → Prisma ORM (check provider field)
ls drizzle.config.* → Drizzle ORM
# Hosting
ls vercel.json → Vercel
ls railway.toml → Railway
ls fly.toml → Fly.io
ls .ebextensions/ → AWS Elastic Beanstalk
ls terraform/ ls *.tf → Custom AWS/GCP/Azure (check provider)
ls kubernetes/ ls k8s/ → Kubernetes
ls docker-compose.yml → Docker Compose
# Framework
ls next.config.* → Next.js
ls nuxt.config.* → Nuxt
ls svelte.config.* → SvelteKit
cat package.json | jq '.scripts' → Check build/start commands
```
Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:
- Deployment runbook (Vercel + GitHub Actions)
- Database runbook (PostgreSQL backup, migration, vacuum)
- Incident response (with Vercel logs + pg query debugging)
- Monitoring setup (Vercel Analytics, pg_stat, alerting)
---
## Runbook Types
### 1. Deployment Runbook
```markdown
# Deployment Runbook — [App Name]
**Stack:** Next.js 14 + PostgreSQL 15 + Vercel
**Last verified:** 2025-03-01
**Source configs:** vercel.json (modified: git log -1 --format=%ci -- vercel.json)
**Owner:** Platform Team
**Est. total time:** 15–25 min
---
## Pre-deployment Checklist
- [ ] All PRs merged to main
- [ ] CI passing on main (GitHub Actions green)
- [ ] Database migrations tested in staging
- [ ] Rollback plan confirmed
## Steps
### Step 1 — Run CI checks locally (3 min)
```bash
pnpm test
pnpm lint
pnpm build
```
✅ Expected: All pass with 0 errors. Build output in `.next/`
### Step 2 — Apply database migrations (5 min)
```bash
# Staging first
DATABASE_URL=$STAGING_DATABASE_URL npx prisma migrate deploy
```
✅ Expected: `All migrations have been successfully applied.`
```bash
# Verify migration applied
psql $STAGING_DATABASE_URL -c "\d" | grep -i migration
```
✅ Expected: Migration table shows new entry with today's date
### Step 3 — Deploy to production (5 min)
```bash
git push origin main
# OR trigger manually:
vercel --prod
```
✅ Expected: Vercel dashboard shows deployment in progress. URL format:
`https://app-name-<hash>-team.vercel.app`
### Step 4 — Smoke test production (5 min)
```bash
# Health check
curl -sf https://your-app.vercel.app/api/health | jq .
# Critical path
curl -sf https://your-app.vercel.app/api/users/me \
-H "Authorization: Bearer $TEST_TOKEN" | jq '.id'
```
✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.
### Step 5 — Monitor for 10 min
- Check Vercel Functions log for errors: `vercel logs --since=10m`
- Check error rate in Vercel Analytics: < 1% 5xx
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)
---
## Rollback
If smoke tests fail or error rate spikes:
```bash
# Instant rollback via Vercel (preferred — < 30 sec)
vercel rollback [previous-deployment-url]
# Database rollback (only if migration was applied)
DATABASE_URL=$PROD_DATABASE_URL npx prisma migrate reset --skip-seed
# WARNING: This resets to previous migration. Confirm data impact first.
```
✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.
---
## Escalation
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents
```
---
### 2. Incident Response Runbook
```markdown
# Incident Response Runbook
**Severity levels:** P1 (down), P2 (degraded), P3 (minor)
**Est. total time:** P1: 30–60 min, P2: 1–4 hours
## Phase 1 — Triage (5 min)
### Confirm the incident
```bash
# Is the app responding?
curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null
# Check Vercel function errors (last 15 min)
vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"
```
✅ 200 = app up. 5xx or timeout = incident confirmed.
Declare severity:
- Site completely down → P1 — page L2/L3 immediately
- Partial degradation / slow responses → P2 — notify team channel
- Single feature broken → P3 — create ticket, fix in business hours
---
## Phase 2 — Diagnose (10–15 min)
```bash
# Recent deployments — did something just ship?
vercel ls --limit=5
# Database health
psql $DATABASE_URL -c "SELECT pid, state, wait_event, query FROM pg_stat_activity WHERE state != 'idle' LIMIT 20;"
# Long-running queries (> 30 sec)
psql $DATABASE_URL -c "SELECT pid, now() - pg_stat_activity.query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';"
# Connection pool saturation
psql $DATABASE_URL -c "SELECT count(*), max_conn FROM pg_stat_activity, (SELECT setting::int AS max_conn FROM pg_settings WHERE name='max_connections') t GROUP BY max_conn;"
```
Diagnostic decision tree:
- Recent deploy + new errors → rollback (see Deployment Runbook)
- DB query timeout / pool saturation → kill long queries, scale connections
- External dependency failing → check status pages, add circuit breaker
- Memory/CPU spike → check Vercel function logs for infinite loops
---
## Phase 3 — Mitigate (variable)
```bash
# Kill a runaway DB query
psql $DATABASE_URL -c "SELECT pg_terminate_backend(<pid>);"
# Scale DB connections (Supabase/Neon — adjust pool size)
# Vercel → Settings → Environment Variables → update DATABASE_POOL_MAX
# Enable maintenance mode (if you have a feature flag)
vercel env add MAINTENANCE_MODE true prod