scrapling
具有反机器人绕过和蜘蛛爬行功能的自适应网络抓取框架。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~zendenho7-scraplingcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~zendenho7-scrapling/file -o zendenho7-scrapling.md## 概述(中文)
具有反机器人绕过和蜘蛛爬行功能的自适应网络抓取框架。
## 原文
# Scrapling - Adaptive Web Scraping
> "Effortless web scraping for the modern web."
---
## Credits
### Core Library
- **Repository:** https://github.com/D4Vinci/Scrapling
- **Author:** D4Vinci (Karim Shoair)
- **License:** BSD-3-Clause
- **Documentation:** https://scrapling.readthedocs.io
### API Reverse Engineering Methodology
- **GitHub:** https://github.com/paoloanzn/free-solscan-api
- **X Post:** https://x.com/paoloanzn/status/2026361234032046319
- **Author:** @paoloanzn
- **Insight:** "Web scraping is 80% reverse engineering"
---
## Installation
```bash
# Core library (parser only)
pip install scrapling
# With fetchers (HTTP + browser automation) - RECOMMENDED
pip install "scrapling[fetchers]"
scrapling install
# With shell (CLI tools) - RECOMMENDED
pip install "scrapling[shell]"
# With AI (MCP server) - OPTIONAL
pip install "scrapling[ai]"
# Everything
pip install "scrapling[all]"
# Browser for stealth/dynamic mode
playwright install chromium
# For Cloudflare bypass (advanced)
pip install cloudscraper
```
---
## Agent Instructions
### When to Use Scrapling
**Use Scrapling when:**
- Research topics from websites
- Extract data from blogs, news sites, docs
- Crawl multiple pages with Spider
- Gather content for summaries
- Extract brand data from any website
- Reverse engineer APIs from websites
**Do NOT use for:**
- X/Twitter (use x-tweet-fetcher skill)
- Login-protected sites (unless credentials provided)
- Paywalled content (respect robots.txt)
- Sites that prohibit scraping in their TOS
---
## Quick Commands
### 1. Basic Fetch (Most Common)
```python
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com')
# Extract content
title = page.css('h1::text').get()
paragraphs = page.css('p::text').getall()
```
### 2. Stealthy Fetch (Anti-Bot/Cloudflare)
```python
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)
```
### 3. Dynamic Fetch (Full Browser Automation)
```python
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)
```
### 4. Adaptive Parsing (Survives Design Changes)
```python
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com')
# First scrape - saves selectors
items = page.css('.product', auto_save=True)
# Later - if site changes, use adaptive=True to relocate
items = page.css('.product', adaptive=True)
```
### 5. Spider (Multiple Pages)
```python
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com"]
concurrent_requests = 3
async def parse(self, response: Response):
for item in response.css('.item'):
yield {"item": item.css('h2::text').get()}
# Follow links
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
MySpider().start()
```
### 6. CLI Usage
```bash
# Simple fetch to file
scrapling extract get https://example.com content.html
# Stealthy fetch (bypass anti-bot)
scrapling extract stealthy-fetch https://example.com content.html
# Interactive shell
scrapling shell https://example.com
```
---
## Common Patterns
### Extract Article Content
```python
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://example.com/article')
# Try multiple selectors for title
title = (
page.css('[itemprop="headline"]::text').get() or
page.css('article h1::text').get() or
page.css('h1::text').get()
)
# Get paragraphs
content = page.css('article p::text, .article-body p::text').getall()
print(f"Title: {title}")
print(f"Paragraphs: {len(content)}")
```
### Research Multiple Pages
```python
from scrapling.spiders import Spider, Response
class ResearchSpider(Spider):
name = "research"
start_urls = ["https://news.ycombinator.com"]
concurrent_requests = 5
async def parse(self, response: Response):
for item in response.css('.titleline a::text').getall()[:10]:
yield {"title": item, "source": "HN"}
more = response.css('.morelink::attr(href)').get()
if more:
yield response.follow(more)
ResearchSpider().start()
```
### Crawl Entire Site (Easy Mode)
Auto-crawl all pages on a domain by following internal links:
```python
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
class EasyCrawl(Spider):
"""Auto-crawl all pages on a domain."""
name = "easy_crawl"
start_urls = ["https://example.com"]
concurrent_requests = 3
def __init__(self):
super().__init__()
self.visited = set()
async def parse(self, response: Response):
# Extract content
yield {
'url': response.url,
'title': response.css('title::text').get(),
'h1': response.css('h1::text').get(),
}
# Follow internal links (limit to 50 pages)
if len(self.visited) >= 50:
return
self.visited.add(response.url)
links = response.css('a::attr(href)').getall()[:20]
for link in links:
full_url = urljoin(response.url, link)
if full_url not in self.visited:
yield response.follow(full_url)
# Usage
result = EasyCrawl()
result.start()
```
### Sitemap Crawl
Crawl pages from `sitemap.xml` (with fallback to link discovery):
```python
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re
def get_sitemap_urls(url: str, max_urls: int = 100) -> list:
"""Extract URLs from sitemap.xml - also checks robots.txt."""
parsed = urlparse(url)
base_url = f"{parsed.scheme}://{parsed.netloc}"
sitemap_urls = [
f"{base_url}/sitemap.xml",
f"{base_url}/sitemap-index.xml",
f"{base_url}/sitemap_index.xml",
f"{base_url}/sitemap-news.xml",
]
all_urls = []
# First check robots.txt for sitemap URL
try:
robots = Fetcher.get(f"{base_url}/robots.txt")
if robots.status == 200:
sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE)
for sm in sitemap_in_robots:
sitemap_urls.insert(0, sm)
except:
pass
# Try each sitemap location
for sitemap_url in sitemap_urls:
try:
page = Fetcher.get(sitemap_url, timeout=10)
if page.status != 200:
continue
text = page.text
# Check if it's XML
if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:
urls = re.findall(r'<loc>([^<]+)</loc>', text)
all_urls.extend(urls[:max_urls])
print(f"Found {len(urls)} URLs in {sitemap_url}")
except:
continue
return list(set(all_urls))[:max_urls]
def crawl_from_sitemap(domain_url: str, max_pages: int = 50):
"""Crawl pages from sitemap."""
print(f"Fetching sitemap for {domain_url}...")
urls = get_sitemap_urls(domain_url)
if not urls:
print("No sitemap found. Use EasyCrawl instead!")
return []
print(f"Found {len(urls)} URLs, crawling first {max_pages}...")
results = []
for url in urls[:max_pages]:
try:
page = Fetcher.get(url, timeout=10)
results.append({
'url': url,
'status': page.status,
'title': page.css('title::text').get(),
})
except Exception as e:
results.append({'url': url, 'error': str(e)[:50]})
return results
# Usage
print("=== S