web-scraping

TotalClaw 作者 totalclaw

用于从网页获取和提取数据的网络抓取工具

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~paulgnz-xpr-web-scraping

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~paulgnz-xpr-web-scraping/file -o paulgnz-xpr-web-scraping.md

## Web Scraping

You have web scraping tools for fetching and extracting data from web pages:

**Single page:**
- `scrape_url` — fetch a URL and get cleaned text content + metadata (title, description, link count)
  - Use format="text" (default) for most tasks — strips all HTML
  - Use format="markdown" to preserve headings, links, lists, bold/italic
  - Use format="html" only when you need raw HTML

**Link discovery:**
- `extract_links` — fetch a page and extract all links with text and type (internal/external)
  - Use the `pattern` parameter to filter by regex (e.g. `"\\.pdf$"` for PDF links)
  - Links are deduplicated and resolved to absolute URLs

**Multi-page research:**
- `scrape_multiple` — fetch up to 10 URLs in parallel for comparison/research
  - One failure doesn't block others (uses Promise.allSettled)

**Best practices:**
- Prefer "text" format for content extraction, "markdown" for preserving structure
- Don't scrape the same domain more than 5 times per minute
- Combine with `store_deliverable` to save scraped content as job evidence
- For very large pages, the content is limited to 5MB

---

## 中文说明

## 网络抓取

你拥有用于从网页获取和提取数据的网络抓取工具：

**单页：**
- `scrape_url` — 获取一个 URL 并得到清洗后的文本内容 + 元数据（标题、描述、链接数）
  - 大多数任务使用 format="text"（默认）——会剥离所有 HTML
  - 使用 format="markdown" 以保留标题、链接、列表、粗体/斜体
  - 仅在需要原始 HTML 时使用 format="html"

**链接发现：**
- `extract_links` — 获取一个页面并提取所有链接，附带文本和类型（内部/外部）
  - 使用 `pattern` 参数按正则表达式过滤（例如用 `"\\.pdf$"` 过滤 PDF 链接）
  - 链接会去重，并解析为绝对 URL

**多页研究：**
- `scrape_multiple` — 并行获取至多 10 个 URL，用于对比/研究
  - 单个失败不会阻塞其他请求（使用 Promise.allSettled）

**最佳实践：**
- 提取内容时优先使用 "text" 格式，需要保留结构时使用 "markdown"
- 对同一域名每分钟抓取不超过 5 次
- 结合 `store_deliverable` 将抓取的内容保存为任务证据
- 对于非常大的页面，内容上限为 5MB