scrapling-official
使用带有反机器人绕过(如 Cloudflare Turnstile)、隐形无头浏览、蜘蛛框架、自适应抓取和 JavaScript 渲染的 Scrapling 抓取网页。当被要求从网站上抓取、爬行或提取数据时使用; web_fetch 失败;该网站具有反机器人保护措施;编写Python代码进行抓取/爬行;或者写蜘蛛。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~scrapling-officialcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~scrapling-official/file -o scrapling-official.md## 概述(中文) 使用带有反机器人绕过(如 Cloudflare Turnstile)、隐形无头浏览、蜘蛛框架、自适应抓取和 JavaScript 渲染的 Scrapling 抓取网页。当被要求从网站上抓取、爬行或提取数据时使用; web_fetch 失败;该网站具有反机器人保护措施;编写Python代码进行抓取/爬行;或者写蜘蛛。 ## 原文 # Scrapling Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises. Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone. **Requires: Python 3.10+** **This is the official skill for the scrapling library by the library author.** ## Setup (once) Create a virtual Python environment through any way available, like `venv`, then inside the environment do: `pip install "scrapling[all]>=0.4.2"` Then do this to download all the browsers' dependencies: ```bash scrapling install --force ``` Make note of the `scrapling` binary path and use it instead of `scrapling` from now on with all commands (if `scrapling` is not on `$PATH`). ### Docker Another option if the user doesn't have Python or doesn't want to use it is to use the Docker image, but this can be used only in the commands, so no writing Python code for scrapling this way: ```bash docker pull pyd4vinci/scrapling ``` or ```bash docker pull ghcr.io/d4vinci/scrapling:latest ``` ## CLI Usage The `scrapling extract` command group lets you download and extract content from websites directly without writing any code. ```bash Usage: scrapling extract [OPTIONS] COMMAND [ARGS]... Commands: get Perform a GET request and save the content to a file. post Perform a POST request and save the content to a file. put Perform a PUT request and save the content to a file. delete Perform a DELETE request and save the content to a file. fetch Use a browser to fetch content with browser automation and flexible options. stealthy-fetch Use a stealthy browser to fetch content with advanced stealth features. ``` ### Usage pattern - Choose your output format by changing the file extension. Here are some examples for the `scrapling extract get` command: - Convert the HTML content to Markdown, then save it to the file (great for documentation): `scrapling extract get "https://blog.example.com" article.md` - Save the HTML content as it is to the file: `scrapling extract get "https://example.com" page.html` - Save a clean version of the text content of the webpage to the file: `scrapling extract get "https://example.com" content.txt` - Output to a temp file, read it back, then clean up. - All commands can use CSS selectors to extract specific parts of the page through `--css-selector` or `-s`. Which command to use generally: - Use **`get`** with simple websites, blogs, or news articles. - Use **`fetch`** with modern web apps, or sites with dynamic content. - Use **`stealthy-fetch`** with protected sites, Cloudflare, or anti-bot systems. > When unsure, start with `get`. If it fails or returns empty content, escalate to `fetch`, then `stealthy-fetch`. The speed of `fetch` and `stealthy-fetch` is nearly the same, so you are not sacrificing anything. #### Key options (requests) Those options are shared between the 4 HTTP request commands: | Option | Input type | Description | |:-------------------------------------------|:----------:|:-----------------------------------------------------------------------------------------------------------------------------------------------| | -H, --headers | TEXT | HTTP headers in format "Key: Value" (can be used multiple times) | | --cookies | TEXT | Cookies string in format "name1=value1; name2=value2" | | --timeout | INTEGER | Request timeout in seconds (default: 30) | | --proxy | TEXT | Proxy URL in format "http://username:password@host:port" | | -s, --css-selector | TEXT | CSS selector to extract specific content from the page. It returns all matches. | | -p, --params | TEXT | Query parameters in format "key=value" (can be used multiple times) | | --follow-redirects / --no-follow-redirects | None | Whether to follow redirects (default: True) | | --verify / --no-verify | None | Whether to verify SSL certificates (default: True) | | --impersonate | TEXT | Browser to impersonate. Can be a single browser (e.g., Chrome) or a comma-separated list for random selection (e.g., Chrome, Firefox, Safari). | | --stealthy-headers / --no-stealthy-headers | None | Use stealthy browser headers (default: True) | Options shared between `post` and `put` only: | Option | Input type | Description | |:-----------|:----------:|:----------------------------------------------------------------------------------------| | -d, --data | TEXT | Form data to include in the request body (as string, ex: "param1=value1¶m2=value2") | | -j, --json | TEXT | JSON data to include in the request body (as string) | Examples: ```bash # Basic download scrapling extract get "https://news.site.com" news.md # Download with custom timeout scrapling extract get "https://example.com" content.txt --timeout 60 # Extract only specific content using CSS selectors scrapling extract get "https://blog.example.com" articles.md --css-selector "article" # Send a request with cookies scrapling extract get "https://scrapling.requestcatcher.com" content.md --cookies "session=abc123; user=john" # Add user agent scrapling extract get "https://api.site.com" data.json -H "User-Agent: MyBot 1.0" # Add multiple headers scrapling extract get "https://site.com" page.html -H "Accept: text/html" -H "Accept-Language: en-US" ``` #### Key options (browsers) Both (`fetch` / `stealthy-fetch`) share options: | Option | Input type | Description | |:-----------------------------------------|:----------:|:---------------------------------------------------------------------------------------------------------------------------------------------------------| | --headless / --no-headless | None | Run browser in headless mode (default: True) | |