arxiv-paper-processor

ClawSkills 作者 clawskills

Tool-only paper processing skill with a manual language parameter: supports batch artifact download for many papers or single-paper download, then the model manually reads source/PDF and writes summary.md in the selected language. Use when per-paper comprehension should be model-driven instead of script-generated.

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install clawskills:clawskills~xukp20-arxiv-paper-processor
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Aclawskills~xukp20-arxiv-paper-processor/file -o xukp20-arxiv-paper-processor.md
# ArXiv Paper Processor

Use this skill for per-paper manual summarization, with optional batch artifact download.

- Single-paper mode: process one paper directory (e.g. `<run_dir>/<arxiv_id>/`).
- Batch predownload mode: process many paper directories under one run dir before writing summaries.

## Language Parameter

- Use a workflow language parameter (for example `English` or `Chinese`) and apply it manually.
- The per-paper `summary.md` must be written in the selected language.
- If download scripts are called directly, pass `--language <LANG>` for traceability.

## Core Principle

Scripts only fetch artifacts. The model performs reading and writing.

## Non-negotiable Constraint

- Do not generate `summary.md` by script-based snippet extraction, regex harvesting, or template autofill.
- Do not use Python/shell scripts to auto-compose section text from abstract/introduction fragments.
- Scripts in this skill are only for artifact download (`source`/`pdf`) and trace logs.
- The final `summary.md` must come from model-side reading and synthesis of the paper content.

## Optional Batch Artifact Download (Many Papers)

Use this first when Stage B has many papers:

```bash
python3 scripts/download_papers_batch.py \
  --run-dir /path/to/run \
  --artifact source_then_pdf \
  --max-workers 3 \
  --min-interval-sec 5 \
  --language English
```

Key behavior:

- Supports `--artifact source`, `--artifact pdf`, or `--artifact source_then_pdf` (default).
- Supports concurrency (`--max-workers`) and safe throttling/retry (`--min-interval-sec`, retry args).
- Uses run-local throttle state by default (`<run_dir>/.runtime/arxiv_download_state.json`) to reduce 429 risk.
- Skips papers that already have usable `source/source_extract/*.tex` or existing `source/paper.pdf` (unless `--force`).
- Resume-friendly: if a paper already has a completed `summary.md`, you can skip that paper's summary-writing step.
- Writes batch log to `<run_dir>/download_batch_log.json` by default.

## Step 1: Download Source (Preferred)

```bash
python3 scripts/download_arxiv_source.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English
```

This writes:

- `source/source_bundle.bin`
- `source/source_extract/`
- `source/download_source_log.json`

If usable source already exists and `--force` is not set, the script reuses local artifacts.

## Step 2: If Needed, Download PDF

```bash
python3 scripts/download_arxiv_pdf.py \
  --paper-dir /path/to/run/2602.00528 \
  --language English
```

This writes:

- `source/paper.pdf`
- `source/download_pdf_log.json`

If PDF already exists and `--force` is not set, the script reuses local artifacts.

## Step 3: Model Reads and Summarizes

1. If `summary.md` already exists and follows the required format, skip this paper and mark it complete.
2. Read `metadata.md` first.
3. If `source/source_extract/` already exists with readable `.tex` files, use it directly.
4. Otherwise, if `source/paper.pdf` already exists, use PDF directly.
5. If neither exists, run download scripts (single-paper scripts or batch script) first.
6. Manually write `summary.md` in the same paper directory, in the selected language.

Do not rely on rule-based auto summarization.
Do not rely on auto-extracted snippets as the primary writing basis.

## Quality Requirement

- Every section should include paper-specific details that are traceable to full-text reading.
- Section 4/5/10 should reflect concrete method and evaluation details, not generic wording.
- If key details are unclear in the source, explicitly note uncertainty instead of guessing.
- Match the detail level shown in `references/summary-example-en.md` and `references/summary-example-zh.md`.
- If your draft is clearly shorter or less specific than the examples, expand it before finishing.

## Required Output

- `<paper_dir>/summary.md` in fixed section format.
- Pay special attention to section `## 10. Brief Conclusion`: write a 3-4 sentence mini-conclusion that covers contribution, method, evaluation setup, and results with paper-specific details.
- In section `## 1. Paper Snapshot`, use exact keys: `ArXiv ID`, `Title`, `Authors`, `Publish date`, `Primary category`, `Reading basis`.
- Do not use key variants such as `Reading source`, `Author list`, `Published on`, or lowercase key names.

See `references/summary-format.md` for exact section requirements.

## Related Skills

This skill is a sub-skill of `arxiv-summarizer-orchestrator`.

Pipeline position:

1. Step 1 (upstream): `arxiv-search-collector` produces the selected paper directories and metadata.
2. Step 2 (this skill): `arxiv-paper-processor` downloads artifacts and writes one `summary.md` per paper.
3. Step 3 (downstream): `arxiv-batch-reporter` uses these per-paper summaries to generate the final collection report.

Use this skill together with Step 1 and Step 3 for full end-to-end execution.