Doc Process

ClawSkills 作者 piyush-zinc v4.1.1

Document intelligence: categorize, autofill forms, analyze contracts, scan receipts/invoices, analyze bank statements, parse resumes/CVs, scan IDs/passports (MRZ), summarize medical records, redact PII (light/standard/full, 50+ rule types, global coverage), extract meeting minutes/action items, extract tables to CSV/JSON, translate documents, scan/dewarp document photos (edge detection, perspective correction, scan-quality output). Trigger: fill this form, autofill, review contract, red flags, scan receipt, log expense, bank statement, subscriptions, parse resume, scan passport, read id, lab report, redact, remove pii, anonymize, meeting minutes, action items, extract table, table to csv, translate, scan photo, make scanned, dewarp, correct perspective, what is this, analyze this.

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install clawskills:piyush-zinc~doc-process

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/clawskills%3Apiyush-zinc~doc-process/file -o doc-process.md

Git 仓库获取源码

git clone https://github.com/openclaw/skills/commit/1c6125cdf2705a8eb67d5240f3cd62bd8da48eb0

# Doc-Process — Document Intelligence Skill

## Step 0 — Auto-Setup (run once on first use)

Before invoking any script for the first time in a session, check whether the script dependencies are available. If any are missing, run the setup script automatically — no prompting needed:

```bash
bash skills/doc-process/setup.sh
```

This installs all Python packages (`pymupdf`, `Pillow`, `pytesseract`, `opencv-python-headless`, `numpy`, `img2pdf`, `pdfplumber`, `openai-whisper`) and attempts to install system binaries (`tesseract`, `ffmpeg`) via `brew` or `apt` depending on the platform.

**When to run Step 0:**
- First time any script-assisted mode is used in a session
- After a fresh `clawhub install piyush-zinc/doc-process`
- If a script fails with `ModuleNotFoundError` or `ImportError`

To install Python packages only (no system packages):
```bash
bash skills/doc-process/setup.sh --light
```

Or install directly from the skill's requirements file:
```bash
pip install -r skills/doc-process/requirements.txt
```

> **Note:** `openai-whisper` downloads its model (~140 MB) on first audio transcription — not at install time.

---

## Overview

This skill handles all document-related tasks using Claude's native vision/language capabilities for reading and analysis, and Python scripts for file-output operations. Most modes require **no installation** — only the file-output scripts need third-party libraries.

---

## How Features Are Implemented

| Feature | Implementation | External libraries |
|---|---|---|
| OCR / reading images | Claude built-in vision | None |
| MRZ decoding (passport/ID) | Claude reads MRZ visually, applies ICAO algorithm | None |
| PDF reading | Claude reads PDF text layer or visually | None |
| Form autofill | Claude reads form fields, outputs fill table | None |
| Contract analysis | Claude applies reference rule set | None |
| Receipt / invoice scanning | Claude reads image or PDF | None |
| Bank statement (PDF) | Claude reads PDF pages | None |
| Bank statement (CSV) | `statement_parser.py` — pure stdlib | None |
| Expense logging | `expense_logger.py` — pure stdlib | None |
| Bank report generation | `report_generator.py` — pure stdlib | None |
| Resume / CV parsing | Claude reads document | None |
| Medical summarizer | Claude reads document | None |
| Legal redaction (display) | Claude marks up output | None |
| **Legal redaction (file output)** | `redactor.py` | **pymupdf** (PDF); **Pillow + pytesseract** (image) |
| Meeting minutes (text/PDF) | Claude reads document | None |
| Translation | Claude's multilingual capabilities | None |
| Document categorizer | Claude reads first 1–2 pages (with consent gate) | None |
| Timeline logging | `timeline_manager.py` — pure stdlib | None |
| **Table extraction (PDF)** | `table_extractor.py` | **pdfplumber** |
| **Audio transcription** | `audio_transcriber.py` | **openai-whisper + ffmpeg** |
| **Doc scan / perspective correction** | `doc_scanner.py` | **opencv-python-headless, numpy, Pillow**; img2pdf optional |

---

## Dependencies & Installation

### No installation required for core functionality
Reading, analysis, form filling, contract review, receipt scanning, bank statement analysis (PDF), resume parsing, ID scanning, medical summarising, redaction markup, meeting minutes, and translation all run on Claude's built-in capabilities.

### Optional — install only for file-output scripts

```bash
# PII redaction to PDF/image files  (redactor.py)
pip install pymupdf>=1.23          # required for PDF redaction
pip install Pillow>=10.0           # required for image redaction
pip install pytesseract>=0.3       # required for image redaction (also: brew install tesseract)

# Document scanning / perspective correction  (doc_scanner.py)
pip install opencv-python-headless>=4.9 numpy>=1.24 Pillow>=10.0
pip install img2pdf>=0.5           # optional — for PDF output; Pillow fallback used if absent

# Table extraction from PDFs  (table_extractor.py)
pip install pdfplumber>=0.11

# Audio transcription  (audio_transcriber.py)
# Also requires ffmpeg binary: brew install ffmpeg  /  apt install ffmpeg
pip install openai-whisper>=20231117
```

All dependencies are also listed in `requirements.txt` at the repository root.

### Binary dependencies

| Binary | Required by | Install |
|---|---|---|
| `tesseract` | `redactor.py` (image mode) | `brew install tesseract` / `apt install tesseract-ocr` |
| `ffmpeg` | `audio_transcriber.py` | `brew install ffmpeg` / `apt install ffmpeg` |

### Network access

`openai-whisper` downloads model files (~140 MB) from OpenAI/HuggingFace servers **on first run only**. Cached at `~/.cache/whisper/`. All other scripts are fully local after installation.

---

## Script Reference

| Script | Dependencies | Purpose | Example |
|---|---|---|---|
| `redactor.py` | pymupdf; Pillow + pytesseract (image mode) | PII redaction to file (PDF/image/text) | `python scripts/redactor.py --file doc.pdf --mode full --log` |
| `doc_scanner.py` | opencv-python-headless, numpy, Pillow; img2pdf optional | Document scanning: edge detection, perspective correction, scan-quality output | `python scripts/doc_scanner.py --input photo.jpg --output scanned.png --mode bw` |
| `expense_logger.py` | None | Add/list/edit/delete expense entries in CSV | `python scripts/expense_logger.py add --date 2024-03-15 --merchant "Starbucks" --amount 13.12 --file expenses.csv` |
| `statement_parser.py` | None | Parse bank CSV export, categorize transactions | `python scripts/statement_parser.py --file statement.csv --output categorized.json` |
| `report_generator.py` | None | Format categorized JSON into a markdown report | `python scripts/report_generator.py --file categorized.json --type bank` |
| `timeline_manager.py` | None | Manage opt-in document processing timeline | `python scripts/timeline_manager.py show` |
| `audio_transcriber.py` | openai-whisper, ffmpeg | Transcribe audio files to text | `python scripts/audio_transcriber.py --file meeting.mp3 --output transcript.txt` |
| `table_extractor.py` | pdfplumber | Extract tables from PDFs to CSV or JSON | `python scripts/table_extractor.py --file document.pdf --output data.csv` |

All scripts import only what they declare. Scripts with no declared deps use Python stdlib only. You can verify any script: "show me the source of [script name]".

---

## Script Import Verification

| Script | Stdlib imports | Third-party | Network |
|---|---|---|---|
| `timeline_manager.py` | argparse, json, sys, datetime, pathlib, uuid, collections | None | Never |
| `redactor.py` | argparse, re, sys, pathlib, dataclasses | pymupdf (PDF); Pillow + pytesseract (image) | Never |
| `doc_scanner.py` | argparse, json, sys, time, pathlib | opencv-python-headless, numpy, Pillow; img2pdf optional | Never |
| `expense_logger.py` | argparse, csv, json, sys, pathlib | None | Never |
| `statement_parser.py` | argparse, csv, json, re, sys, collections, datetime, pathlib | None | Never |
| `report_generator.py` | argparse, json, sys, collections, pathlib | None | Never |
| `utils.py` | re, unicodedata, datetime, pathlib | None | Never |
| `audio_transcriber.py` | argparse, sys, pathlib | openai-whisper | First-run model download only |
| `table_extractor.py` | argparse, csv, io, json, sys, pathlib | pdfplumber | Never |

---

## Privacy & Data Handling

| Aspect | Policy |
|---|---|
| Document content | Read locally within this session only. Not stored, indexed, or transmitted. |
| Personal data for form autofill | Used only to complete the current form. Not written to any file. Not retained after session. |
| Timeline log | Opt-in only. Confirmed by user before any entry is written. Contains no raw document content — only category-level summaries. |
| Redacted output files | Written only to a path the user explicitly confirms. |
| Audio transcripts | Written to a local file the user specifies. Model download on first Whisper use only. |
| No telemetry | This skill has no analytics, usage r