web-scraping
用于从网页获取和提取数据的网络抓取工具
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~paulgnz-xpr-web-scrapingcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~paulgnz-xpr-web-scraping/file -o paulgnz-xpr-web-scraping.md## Web Scraping You have web scraping tools for fetching and extracting data from web pages: **Single page:** - `scrape_url` — fetch a URL and get cleaned text content + metadata (title, description, link count) - Use format="text" (default) for most tasks — strips all HTML - Use format="markdown" to preserve headings, links, lists, bold/italic - Use format="html" only when you need raw HTML **Link discovery:** - `extract_links` — fetch a page and extract all links with text and type (internal/external) - Use the `pattern` parameter to filter by regex (e.g. `"\\.pdf$"` for PDF links) - Links are deduplicated and resolved to absolute URLs **Multi-page research:** - `scrape_multiple` — fetch up to 10 URLs in parallel for comparison/research - One failure doesn't block others (uses Promise.allSettled) **Best practices:** - Prefer "text" format for content extraction, "markdown" for preserving structure - Don't scrape the same domain more than 5 times per minute - Combine with `store_deliverable` to save scraped content as job evidence - For very large pages, the content is limited to 5MB --- ## 中文说明 ## 网络抓取 你拥有用于从网页获取和提取数据的网络抓取工具: **单页:** - `scrape_url` — 获取一个 URL 并得到清洗后的文本内容 + 元数据(标题、描述、链接数) - 大多数任务使用 format="text"(默认)——会剥离所有 HTML - 使用 format="markdown" 以保留标题、链接、列表、粗体/斜体 - 仅在需要原始 HTML 时使用 format="html" **链接发现:** - `extract_links` — 获取一个页面并提取所有链接,附带文本和类型(内部/外部) - 使用 `pattern` 参数按正则表达式过滤(例如用 `"\\.pdf$"` 过滤 PDF 链接) - 链接会去重,并解析为绝对 URL **多页研究:** - `scrape_multiple` — 并行获取至多 10 个 URL,用于对比/研究 - 单个失败不会阻塞其他请求(使用 Promise.allSettled) **最佳实践:** - 提取内容时优先使用 "text" 格式,需要保留结构时使用 "markdown" - 对同一域名每分钟抓取不超过 5 次 - 结合 `store_deliverable` 将抓取的内容保存为任务证据 - 对于非常大的页面,内容上限为 5MB