AutoForge
AutoForge is a production-grade autonomous optimization framework for AI agents. It replaces subjective "reflection" with mathematically rigorous convergence loops — tracking every iteration in TSV, cross-validating with multiple models, and stopping only when pass rates confirm real improvement. Four specialized modes: prompt (skill & doc optimization via scenario simulation), code (sandboxed test execution with measurable criteria), audit (CLI verification against live tool behavior), and project (whole-repo cross-file consistency analysis). Battle-tested across 50+ iterations on production skills. Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".
安装 / 下载方式
totalclaw install skilldb:akrimm702~autoforgecurl -fsSL https://skills.taituai.com/api/skills/skilldb%3Aakrimm702~autoforge/file -o autoforge.mdgit clone https://github.com/openclaw/skills/commit/d8b92e0323a797cd8b76614a50100b3d12dee713# AutoForge — Autonomous Optimization Framework
> Stop reflecting. Start converging. Every iteration is measured, logged, and validated — not vibed.
AutoForge replaces ad-hoc "improve this" prompts with a rigorous optimization loop: define evals, run iterations, track pass rates in TSV, report live to your channel, and stop only when math says you're done. Multi-model cross-validation prevents the "same model grades its own homework" blind spot.
**Four modes. One convergence standard.**
| Mode | What it does | Best for |
|------|-------------|----------|
| `prompt` | Simulate 5 scenarios/iter, evaluate Yes/No | SKILL.md, prompts, doc templates |
| `code` | Sandboxed test execution, measure exit/stdout/stderr | Shell scripts, Python tools, pipelines |
| `audit` | Test CLI commands live, verify SKILL.md matches reality | CLI skill documentation |
| `project` | Scan whole repo, cross-file consistency analysis | README↔CLI drift, Dockerfile↔deps, CI gaps |
---
# AutoForge — Top-Agent Architecture
## Overview
```
Agent (you)
├── State: results.tsv, current target file state, iteration counter
├── Iteration 1: evaluate → improve → write TSV → report
├── Iteration 2: evaluate → improve → write TSV → report
├── ...
└── Finish: report.sh --final → configured channel
```
### Sub-Agent = You
"Sub-Agent" is a **conceptual role**, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.
For code mode, run tests using the `exec` tool.
### Multi-Model Setup (recommended for Deep Audits)
For complex audits, you can **split two roles across different models**:
| Role | Model | Task |
|------|-------|------|
| **Optimizer** | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| **Validator** | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |
**Flow:** Optimizer and Validator alternate. Optimizer iterations have status `improved`/`retained`/`discard`. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with `sessions_spawn` and explicit `model`.
**When to use Multi-Model:** Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.
**When Single-Model suffices:** Simple CLI audits, prompt optimization, code with clear tests.
---
## Configuration
AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.
| Variable | Default | Description |
|----------|---------|-------------|
| `AF_CHANNEL` | `telegram` | Messaging channel for reports |
| `AF_CHAT_ID` | _(none)_ | Chat/group ID for report delivery |
| `AF_TOPIC_ID` | _(none)_ | Thread/topic ID within the chat |
---
## Hard Invariants
These rules apply **always**, regardless of mode:
1. **TSV is mandatory.** Every iteration writes exactly one row to `results/[target]-results.tsv`.
2. **Reporting is mandatory.** Call `report.sh` immediately after every TSV row.
3. **--dry-run never overwrites the target.** Only TSV, `*-proposed.md`, and reports are written.
4. **Mode isolation is strict.** Only execute steps for the assigned mode.
5. **Iteration 1 = Baseline.** Evaluate the original version unchanged, status `baseline`.
---
## Modes — Read ONLY Your Mode!
You are assigned ONE mode. **Ignore all sections for other modes.**
| Mode | What happens | Output |
|------|-------------|--------|
| `prompt` | Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
| `code` | Run tests in sandbox, measure results | Improved code |
| `audit` | Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
| `project` | Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |
**Your mode is in the task prompt.** Everything else is irrelevant to you.
---
## TSV Format (same for ALL modes)
### Header (once at loop start):
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
```
### Row per iteration:
```bash
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
```
> **Use `printf` not `echo -e`!** `echo -e` interprets backslashes in field values. `printf '%s'` outputs strings literally.
### 5 columns, TAB-separated, EXACTLY this order:
| # | Column | Type | Rules |
|---|--------|------|-------|
| 1 | `iteration` | Integer | 1, 2, 3, ... |
| 2 | `prompt_version_summary` | String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | `pass_rate` | String | Number + `%`: `58%`, `92%`, `100%`. Always integer. |
| 4 | `change_description` | String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | `status` | Enum | Exactly one of: `baseline` · `improved` · `retained` · `discard` |
### Escaping rules:
- **Tabs** in text fields → replace with spaces
- **Newlines** in text fields → replace with ` | `
- **Empty fields** → use hyphen `-` (never leave empty)
- **`$` and backticks** → use `printf '%s'` or escape with `\$` (prevents unintended variable interpolation)
- **Unicode/Emoji** allowed, count as 1 character (not bytes)
### Status rules (based on pass-rate comparison):
- `baseline` — **Mandatory for Iteration 1.** Evaluate original version only.
- `improved` — Pass rate **higher** than previous best → new version becomes current state
- `retained` — Pass rate **equal or marginally better** → predecessor remains
- `discard` — Pass rate **lower** → change discarded, revert to best state
---
## Reporting (same for ALL modes)
**After EVERY TSV row** (including baseline):
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
```
**After loop ends**, additionally with `--final`:
```bash
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
```
The report script reads `AF_CHANNEL`, `AF_CHAT_ID`, and `AF_TOPIC_ID` from environment. Without them, it prints to stdout with ANSI colors.
---
## Stop Conditions (for ALL modes)
Priority — first matching condition wins, top to bottom:
1. 🛑 **Minimum iterations** — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
2. 🛑 **Max 30 iterations** — Hard safety net, stop immediately.
3. ❌ **3× `discard` in a row** → structural problem, stop + analyze.
4. ✅ **3× 100% pass rate** (after minimum) → confirmed perfect, done.
5. ➡️ **5× `retained` in a row** → converged, done.
### Counting rules:
- `3× 100%` = three iterations with `pass_rate == 100%`, not necessarily consecutive.
- `5× retained` and `3× discard` = **consecutive** (in a row).
- `baseline` counts toward no series.
- `improved` interrupts `retained` and `discard` series.
**At 100% in early iterations:** Keep going! Test harder edge cases. Only 3× 100% *after the minimum* confirms true perfection.
### Recognizing Validator Noise
In multi-model setups, the Validator can produce **false positives** — fails that aren't real issues:
- **Config path vs tool name confusion** (e.g. `agents.list[]` ≠ `agents_list` tool)
- **Inverted checks** ("no X" → Validator looks for X as required)
- **Normal English as forbidden reference** (e.g. "runtime outcome" ≠ `runtime: "acp"`)
- **Overcounting** (thread commands counted as subagent commands)
**Rule:** If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → **declare convergence**, don't validate endlessly.
---
## Execution Modes
| Flag | Behavior |
|------|----------|
| `--dry-run` (default) | Only TSV + proposed files. Target file/repo remains unchanged. |
| `--live` | Target file/repo is overwritten. Auto-backup → `results/backups/` |
| `--resume` | Read existing TSV, continue from last iteration. On invalid format: abort. |
---
##