scrapling

TotalClaw 作者 totalclaw v1.0.8

具有反机器人绕过和蜘蛛爬行功能的自适应网络抓取框架。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~zendenho7-scrapling

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~zendenho7-scrapling/file -o zendenho7-scrapling.md

## 概述（中文）

具有反机器人绕过和蜘蛛爬行功能的自适应网络抓取框架。

## 原文

# Scrapling - Adaptive Web Scraping

> "Effortless web scraping for the modern web."

---

## Credits

### Core Library
- **Repository:** https://github.com/D4Vinci/Scrapling
- **Author:** D4Vinci (Karim Shoair)
- **License:** BSD-3-Clause
- **Documentation:** https://scrapling.readthedocs.io

### API Reverse Engineering Methodology
- **GitHub:** https://github.com/paoloanzn/free-solscan-api
- **X Post:** https://x.com/paoloanzn/status/2026361234032046319
- **Author:** @paoloanzn
- **Insight:** "Web scraping is 80% reverse engineering"

---

## Installation

```bash
# Core library (parser only)
pip install scrapling

# With fetchers (HTTP + browser automation) - RECOMMENDED
pip install "scrapling[fetchers]"
scrapling install

# With shell (CLI tools) - RECOMMENDED
pip install "scrapling[shell]"

# With AI (MCP server) - OPTIONAL
pip install "scrapling[ai]"

# Everything
pip install "scrapling[all]"

# Browser for stealth/dynamic mode
playwright install chromium

# For Cloudflare bypass (advanced)
pip install cloudscraper
```

---

## Agent Instructions

### When to Use Scrapling

**Use Scrapling when:**
- Research topics from websites
- Extract data from blogs, news sites, docs
- Crawl multiple pages with Spider
- Gather content for summaries
- Extract brand data from any website
- Reverse engineer APIs from websites

**Do NOT use for:**
- X/Twitter (use x-tweet-fetcher skill)
- Login-protected sites (unless credentials provided)
- Paywalled content (respect robots.txt)
- Sites that prohibit scraping in their TOS

---

## Quick Commands

### 1. Basic Fetch (Most Common)

```python
from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# Extract content
title = page.css('h1::text').get()
paragraphs = page.css('p::text').getall()
```

### 2. Stealthy Fetch (Anti-Bot/Cloudflare)

```python
from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, solve_cloudflare=True)
```

### 3. Dynamic Fetch (Full Browser Automation)

```python
from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch('https://example.com', headless=True, network_idle=True)
```

### 4. Adaptive Parsing (Survives Design Changes)

```python
from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com')

# First scrape - saves selectors
items = page.css('.product', auto_save=True)

# Later - if site changes, use adaptive=True to relocate
items = page.css('.product', adaptive=True)
```

### 5. Spider (Multiple Pages)

```python
from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "demo"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    async def parse(self, response: Response):
        for item in response.css('.item'):
            yield {"item": item.css('h2::text').get()}
        
        # Follow links
        next_page = response.css('.next a')
        if next_page:
            yield response.follow(next_page[0].attrib['href'])

MySpider().start()
```

### 6. CLI Usage

```bash
# Simple fetch to file
scrapling extract get https://example.com content.html

# Stealthy fetch (bypass anti-bot)
scrapling extract stealthy-fetch https://example.com content.html

# Interactive shell
scrapling shell https://example.com
```

---

## Common Patterns

### Extract Article Content

```python
from scrapling.fetchers import Fetcher

page = Fetcher.get('https://example.com/article')

# Try multiple selectors for title
title = (
    page.css('[itemprop="headline"]::text').get() or
    page.css('article h1::text').get() or
    page.css('h1::text').get()
)

# Get paragraphs
content = page.css('article p::text, .article-body p::text').getall()

print(f"Title: {title}")
print(f"Paragraphs: {len(content)}")
```

### Research Multiple Pages

```python
from scrapling.spiders import Spider, Response

class ResearchSpider(Spider):
    name = "research"
    start_urls = ["https://news.ycombinator.com"]
    concurrent_requests = 5
    
    async def parse(self, response: Response):
        for item in response.css('.titleline a::text').getall()[:10]:
            yield {"title": item, "source": "HN"}
        
        more = response.css('.morelink::attr(href)').get()
        if more:
            yield response.follow(more)

ResearchSpider().start()
```

### Crawl Entire Site (Easy Mode)

Auto-crawl all pages on a domain by following internal links:

```python
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse

class EasyCrawl(Spider):
    """Auto-crawl all pages on a domain."""
    
    name = "easy_crawl"
    start_urls = ["https://example.com"]
    concurrent_requests = 3
    
    def __init__(self):
        super().__init__()
        self.visited = set()
    
    async def parse(self, response: Response):
        # Extract content
        yield {
            'url': response.url,
            'title': response.css('title::text').get(),
            'h1': response.css('h1::text').get(),
        }
        
        # Follow internal links (limit to 50 pages)
        if len(self.visited) >= 50:
            return
        
        self.visited.add(response.url)
        
        links = response.css('a::attr(href)').getall()[:20]
        for link in links:
            full_url = urljoin(response.url, link)
            if full_url not in self.visited:
                yield response.follow(full_url)

# Usage
result = EasyCrawl()
result.start()
```

### Sitemap Crawl

Crawl pages from `sitemap.xml` (with fallback to link discovery):

```python
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re

def get_sitemap_urls(url: str, max_urls: int = 100) -> list:
    """Extract URLs from sitemap.xml - also checks robots.txt."""
    
    parsed = urlparse(url)
    base_url = f"{parsed.scheme}://{parsed.netloc}"
    
    sitemap_urls = [
        f"{base_url}/sitemap.xml",
        f"{base_url}/sitemap-index.xml",
        f"{base_url}/sitemap_index.xml",
        f"{base_url}/sitemap-news.xml",
    ]
    
    all_urls = []
    
    # First check robots.txt for sitemap URL
    try:
        robots = Fetcher.get(f"{base_url}/robots.txt")
        if robots.status == 200:
            sitemap_in_robots = re.findall(r'Sitemap:\s*(\S+)', robots.text, re.IGNORECASE)
            for sm in sitemap_in_robots:
                sitemap_urls.insert(0, sm)
    except:
        pass
    
    # Try each sitemap location
    for sitemap_url in sitemap_urls:
        try:
            page = Fetcher.get(sitemap_url, timeout=10)
            if page.status != 200:
                continue
            
            text = page.text
            
            # Check if it's XML
            if '<?xml' in text or '<urlset' in text or '<sitemapindex' in text:
                urls = re.findall(r'<loc>([^<]+)</loc>', text)
                all_urls.extend(urls[:max_urls])
                print(f"Found {len(urls)} URLs in {sitemap_url}")
        except:
            continue
    
    return list(set(all_urls))[:max_urls]

def crawl_from_sitemap(domain_url: str, max_pages: int = 50):
    """Crawl pages from sitemap."""
    
    print(f"Fetching sitemap for {domain_url}...")
    urls = get_sitemap_urls(domain_url)
    
    if not urls:
        print("No sitemap found. Use EasyCrawl instead!")
        return []
    
    print(f"Found {len(urls)} URLs, crawling first {max_pages}...")
    
    results = []
    for url in urls[:max_pages]:
        try:
            page = Fetcher.get(url, timeout=10)
            results.append({
                'url': url,
                'status': page.status,
                'title': page.css('title::text').get(),
            })
        except Exception as e:
            results.append({'url': url, 'error': str(e)[:50]})
    
    return results

# Usage
print("=== S