AutoForge

SkillDB 作者 akrimm702 v1.0.4

AutoForge is a production-grade autonomous optimization framework for AI agents. It replaces subjective "reflection" with mathematically rigorous convergence loops — tracking every iteration in TSV, cross-validating with multiple models, and stopping only when pass rates confirm real improvement. Four specialized modes: prompt (skill & doc optimization via scenario simulation), code (sandboxed test execution with measurable criteria), audit (CLI verification against live tool behavior), and project (whole-repo cross-file consistency analysis). Battle-tested across 50+ iterations on production skills. Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install skilldb:akrimm702~autoforge

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Aakrimm702~autoforge/file -o autoforge.md

Git 仓库获取源码

git clone https://github.com/openclaw/skills/commit/d8b92e0323a797cd8b76614a50100b3d12dee713

# AutoForge — Autonomous Optimization Framework

> Stop reflecting. Start converging. Every iteration is measured, logged, and validated — not vibed.

AutoForge replaces ad-hoc "improve this" prompts with a rigorous optimization loop: define evals, run iterations, track pass rates in TSV, report live to your channel, and stop only when math says you're done. Multi-model cross-validation prevents the "same model grades its own homework" blind spot.

**Four modes. One convergence standard.**

| Mode | What it does | Best for |
|------|-------------|----------|
| `prompt` | Simulate 5 scenarios/iter, evaluate Yes/No | SKILL.md, prompts, doc templates |
| `code` | Sandboxed test execution, measure exit/stdout/stderr | Shell scripts, Python tools, pipelines |
| `audit` | Test CLI commands live, verify SKILL.md matches reality | CLI skill documentation |
| `project` | Scan whole repo, cross-file consistency analysis | README↔CLI drift, Dockerfile↔deps, CI gaps |

---

# AutoForge — Top-Agent Architecture

## Overview

```
Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel
```

### Sub-Agent = You
"Sub-Agent" is a **conceptual role**, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.

For code mode, run tests using the `exec` tool.

### Multi-Model Setup (recommended for Deep Audits)

For complex audits, you can **split two roles across different models**:

| Role | Model | Task |
|------|-------|------|
| **Optimizer** | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| **Validator** | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |

**Flow:** Optimizer and Validator alternate. Optimizer iterations have status `improved`/`retained`/`discard`. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with `sessions_spawn` and explicit `model`.

**When to use Multi-Model:** Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.

**When Single-Model suffices:** Simple CLI audits, prompt optimization, code with clear tests.

---

## Configuration

AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.

| Variable | Default | Description |
|----------|---------|-------------|
| `AF_CHANNEL` | `telegram` | Messaging channel for reports |
| `AF_CHAT_ID` | _(none)_ | Chat/group ID for report delivery |
| `AF_TOPIC_ID` | _(none)_ | Thread/topic ID within the chat |

---

## Hard Invariants

These rules apply **always**, regardless of mode:

1. **TSV is mandatory.** Every iteration writes exactly one row to `results/[target]-results.tsv`.
2. **Reporting is mandatory.** Call `report.sh` immediately after every TSV row.
3. **--dry-run never overwrites the target.** Only TSV, `*-proposed.md`, and reports are written.
4. **Mode isolation is strict.** Only execute steps for the assigned mode.
5. **Iteration 1 = Baseline.** Evaluate the original version unchanged, status `baseline`.

---

## Modes — Read ONLY Your Mode!

You are assigned ONE mode. **Ignore all sections for other modes.**

| Mode | What happens | Output |
|------|-------------|--------|
| `prompt` | Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
| `code` | Run tests in sandbox, measure results | Improved code |
| `audit` | Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
| `project` | Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |

**Your mode is in the task prompt.** Everything else is irrelevant to you.

---

## TSV Format (same for ALL modes)

### Header (once at loop start):
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
```

### Row per iteration:
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
```
> **Use `printf` not `echo -e`!** `echo -e` interprets backslashes in field values. `printf '%s'` outputs strings literally.

### 5 columns, TAB-separated, EXACTLY this order:

| # | Column | Type | Rules |
|---|--------|------|-------|
| 1 | `iteration` | Integer | 1, 2, 3, ... |
| 2 | `prompt_version_summary` | String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | `pass_rate` | String | Number + `%`: `58%`, `92%`, `100%`. Always integer. |
| 4 | `change_description` | String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | `status` | Enum | Exactly one of: `baseline` · `improved` · `retained` · `discard` |

### Escaping rules:
- **Tabs** in text fields → replace with spaces
- **Newlines** in text fields → replace with ` | `
- **Empty fields** → use hyphen `-` (never leave empty)
- **`$` and backticks** → use `printf '%s'` or escape with `\$` (prevents unintended variable interpolation)
- **Unicode/Emoji** allowed, count as 1 character (not bytes)

### Status rules (based on pass-rate comparison):
- `baseline` — **Mandatory for Iteration 1.** Evaluate original version only.
- `improved` — Pass rate **higher** than previous best → new version becomes current state
- `retained` — Pass rate **equal or marginally better** → predecessor remains
- `discard` — Pass rate **lower** → change discarded, revert to best state

---

## Reporting (same for ALL modes)

**After EVERY TSV row** (including baseline):
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
```

**After loop ends**, additionally with `--final`:
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
```

The report script reads `AF_CHANNEL`, `AF_CHAT_ID`, and `AF_TOPIC_ID` from environment. Without them, it prints to stdout with ANSI colors.

---

## Stop Conditions (for ALL modes)

Priority — first matching condition wins, top to bottom:

1. 🛑 **Minimum iterations** — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
2. 🛑 **Max 30 iterations** — Hard safety net, stop immediately.
3. ❌ **3× `discard` in a row** → structural problem, stop + analyze.
4. ✅ **3× 100% pass rate** (after minimum) → confirmed perfect, done.
5. ➡️ **5× `retained` in a row** → converged, done.

### Counting rules:
- `3× 100%` = three iterations with `pass_rate == 100%`, not necessarily consecutive.
- `5× retained` and `3× discard` = **consecutive** (in a row).
- `baseline` counts toward no series.
- `improved` interrupts `retained` and `discard` series.

**At 100% in early iterations:** Keep going! Test harder edge cases. Only 3× 100% *after the minimum* confirms true perfection.

### Recognizing Validator Noise

In multi-model setups, the Validator can produce **false positives** — fails that aren't real issues:

- **Config path vs tool name confusion** (e.g. `agents.list[]` ≠ `agents_list` tool)
- **Inverted checks** ("no X" → Validator looks for X as required)
- **Normal English as forbidden reference** (e.g. "runtime outcome" ≠ `runtime: "acp"`)
- **Overcounting** (thread commands counted as subagent commands)

**Rule:** If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → **declare convergence**, don't validate endlessly.

---

## Execution Modes

| Flag | Behavior |
|------|----------|
| `--dry-run` (default) | Only TSV + proposed files. Target file/repo remains unchanged. |
| `--live` | Target file/repo is overwritten. Auto-backup → `results/backups/` |
| `--resume` | Read existing TSV, continue from last iteration. On invalid format: abort. |

---

##