arxiv-summarizer-orchestrator

TotalClaw 作者 totalclaw

使用三个子技能进行定期 arXiv 收集和报告的端到端编排技能:arxiv-search-collector、arxiv-paper-processor 和 arxiv-batch-reporter。支持跨所有 Markdown 输出和 Stage-B 处理策略的手动语言控制(“subagent_parallel”默认最大 5 或串行)。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~xukp20-arxiv-summarizer-orchestrator
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~xukp20-arxiv-summarizer-orchestrator/file -o xukp20-arxiv-summarizer-orchestrator.md
## 概述(中文)

使用三个子技能进行定期 arXiv 收集和报告的端到端编排技能:arxiv-search-collector、arxiv-paper-processor 和 arxiv-batch-reporter。支持跨所有 Markdown 输出和 Stage-B 处理策略的手动语言控制(“subagent_parallel”默认最大 5 或串行)。

## 原文

# ArXiv Summarizer Orchestrator

Run the full pipeline by composing three sub-skills.

## Sub-skill Order

1. `arxiv-search-collector`
2. `arxiv-paper-processor`
3. `arxiv-batch-reporter`

## Workflow Parameters

- `language`: manual language parameter used by all stages. Default is English when omitted.
- `paper_processing_mode`: `subagent_parallel` or `serial`.
- `max_parallel_papers`: default `5` when `paper_processing_mode=subagent_parallel`.

## Workflow

### Stage A: Collection Setup + Query Retrieval

1. Initialize one run with `arxiv-search-collector/scripts/init_collection_run.py`.
2. Model generates multiple focused queries from original topic and writes a minimal `query_plan.json` (`label` + `query` only).
3. Run `arxiv-search-collector/scripts/fetch_queries_batch.py` with the plan file (recommended).
4. (Optional fallback) call `arxiv-search-collector/scripts/fetch_query_metadata.py` manually for one-by-one fetch.
5. Model reads each indexed query list and decides keep indexes.
6. Merge selected items with `arxiv-search-collector/scripts/merge_selected_papers.py`.
7. If relevance/coverage is still not good, iterate Stage A:
   - generate another query plan with new labels,
   - fetch again,
   - re-merge with `--incremental` and updated `selection-json`.
   - set weak labels to empty keep list (`[]`) to explicitly drop them.

Pass `--language <LANG>` to collector scripts so all generated markdown files in Stage A follow the selected language.
Use serial query fetch in Stage A with conservative controls (for example `--min-interval-sec 5`, `--retry-max 4`).
Default collector settings already include retries/backoff and run-local throttle state (`<run_dir>/.runtime/arxiv_api_state.json`), so manual tuning is usually unnecessary.
Prefer cache reuse (no `--force`) unless query parameters changed or data refresh is required.

Output: one run directory with per-paper metadata subdirectories.

### Stage B: Per-paper Artifact Download + Manual Summary

For each paper directory, invoke sub-skill `arxiv-paper-processor` once and let that skill produce `<paper_dir>/summary.md`.

Recommended pre-step for many papers:

1. Run one batch artifact download before per-paper reading:

```bash
python3 arxiv-paper-processor/scripts/download_papers_batch.py \
  --run-dir /path/to/run \
  --artifact source_then_pdf \
  --max-workers 3 \
  --min-interval-sec 5 \
  --language <LANG>
```

Per-paper execution steps (inside `arxiv-paper-processor`):

1. If `<paper_dir>/summary.md` already exists and is complete, skip this paper.
2. If usable source (`source/source_extract/*.tex`) or PDF (`source/paper.pdf`) already exists, skip download.
3. If artifacts are missing, download source with `arxiv-paper-processor/scripts/download_arxiv_source.py`.
4. If source is unusable, download PDF with `arxiv-paper-processor/scripts/download_arxiv_pdf.py`.
5. Model reads content and manually writes `<paper_dir>/summary.md` by reference format, in `language`.

Parallel strategy for many papers:

- Default: `paper_processing_mode=subagent_parallel` with `max_parallel_papers=5`.
- Optional: `paper_processing_mode=serial` to process one paper at a time.
- In parallel mode, run multiple `arxiv-paper-processor` instances in batches; concurrent papers must not exceed `max_parallel_papers`.
- Wait for one batch to finish before starting the next batch.
- In serial mode, run exactly one `arxiv-paper-processor` instance at a time.
- Subagent workers should only own one paper directory each to avoid file conflicts.
- Do not use scripts to auto-compose summary text; scripts are download-only tools.

Output: all paper directories contain `summary.md`.

### Stage C: Bundle + Final Hierarchical Report

1. Run `arxiv-batch-reporter/scripts/collect_summaries_bundle.py --language <LANG>`.
2. Model reads `summaries_bundle.md` and writes `collection_report_template.md` in base dir.
3. In template, each paper leaf entry must include one standalone placeholder line: `{{ARXIV_BRIEF:<arxiv_id>}}`.
4. Run `arxiv-batch-reporter/scripts/render_collection_report.py` to generate final `collection_report.md`.
5. Do not manually paraphrase per-paper conclusion lines in final report; they must come from per-paper `summary.md` section 10 via script injection.

If `language` is non-English (for example Chinese), all intermediate markdown files and final reports should follow that language.

## Periodic Scheduling

This orchestrator is suitable for cron/scheduled execution in OpenClaw:

- Frequency examples: daily, weekly, monthly.
- For rolling windows, use lookback (`1d`, `7d`, `30d`) when initializing runs.

## Output Layout

`<output-root>/<topic>-<timestamp>-<range>/`

- `task_meta.json`, `task_meta.md`
- `query_results/`, `query_selection/`
- `<arxiv_id>/metadata.md` + downloaded source/pdf + `summary.md`
- `summaries_bundle.md`
- `collection_report_template.md`
- final rendered collection report (e.g. `collection_report.md`)

Use `references/workflow-checklist.md` as execution checklist.

## Related Skills

This is the top-level orchestration skill.

Before using it, install and enable these three sub-skills:

- `arxiv-search-collector`
- `arxiv-paper-processor`
- `arxiv-batch-reporter`

Execution order inside this orchestrator:

1. `arxiv-search-collector` (Stage A)
2. `arxiv-paper-processor` (Stage B)
3. `arxiv-batch-reporter` (Stage C)