hle-benchmark-evolver

TotalClaw 作者 totalclaw

为能力发展者运行面向 HLE 的基准奖励摄取和课程生成。当用户要求优化 Humanity 的期末考试分数、获取问题级基准测试结果、优先考虑简单优先队列或请求立即基准测试进度结果时使用。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~wanng-ide-hle-benchmark-evolver
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~wanng-ide-hle-benchmark-evolver/file -o wanng-ide-hle-benchmark-evolver.md
## 概述(中文)

为能力发展者运行面向 HLE 的基准奖励摄取和课程生成。当用户要求优化 Humanity 的期末考试分数、获取问题级基准测试结果、优先考虑简单优先队列或请求立即基准测试进度结果时使用。

## 原文

# HLE Benchmark Evolver

This skill operationalizes HLE score-driven evolution for OpenClaw.

## When to Use

- User asks to improve HLE score (for example target >= 60%).
- User provides question-level benchmark output and wants it converted to reward.
- User wants easy-first curriculum queue and next-focus questions.
- User asks for an immediate benchmark result snapshot.

## Inputs

- Benchmark report JSON path (`--report=/abs/path/report.json`)
- Optional benchmark id (`cais/hle` default)

## Workflow

1. Validate the report JSON exists and is parseable.
2. Ingest report into `capability-evolver` benchmark reward state.
3. Generate curriculum signals:
   - `benchmark_*`
   - `curriculum_stage:*`
   - `focus_subject:*`
   - `focus_modality:*`
   - `question_focus:*`
4. Return a compact result summary for this run.

## Run

```bash
node skills/hle-benchmark-evolver/run_result.js --report=/absolute/path/hle_report.json
```

Full automatic loop (starts evolution cycle):

```bash
node skills/hle-benchmark-evolver/run_pipeline.js --report=/absolute/path/hle_report.json --cycles=1
```

If your evaluator can be called from shell, let pipeline generate the report each cycle:

```bash
node skills/hle-benchmark-evolver/run_pipeline.js \
  --report=/absolute/path/hle_report.json \
  --eval_cmd="python /path/to/eval_hle.py --out {{report}}" \
  --cycles=3 --interval_ms=2000
```

If no `--report` is provided, it defaults to:

`skills/capability-evolver/assets/gep/hle_report.template.json`

## Output Contract

Always print JSON with these fields:

- `benchmark_id`
- `run_id`
- `accuracy`
- `reward`
- `trend`
- `curriculum_stage`
- `queue_size`
- `focus_subjects`
- `focus_modalities`
- `next_questions`

## Notes

- This skill handles reward/curriculum ingestion. It does not directly solve HLE questions.
- `run_pipeline.js` links ingestion, evolve, and solidify into one executable loop.