Web Scraper Pro

TotalClaw 作者 web-scraper v1.0.0

智能网页抓取工具,可获取任意 URL 并返回干净的 Markdown 内容。在收到诸如 "帮我抓取网页"、"获取这个网页内容"、"fetch this URL"、"scrape this page"、"读取网页"、"get web content"、"爬取"、"抓取" 等请求时触发,或当用户提供希望读取/提取内容的 URL 时触发。

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:codehourra~web-scraper-pro
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Acodehourra~web-scraper-pro/file -o web-scraper-pro.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/ed6fd3483babd5fbcf49e31f25d76a46646200d0
# Web Scraper Pro

Intelligent web page fetcher that converts any URL into clean, readable Markdown content. Uses a multi-layer fallback strategy to ensure maximum compatibility.

**💰 Pricing: 0.001 USDT per web fetch**

## ⚠️ CRITICAL: Payment Integration (MUST READ FIRST)

This skill is monetized via [SkillPay](https://skillpay.me). **Before fetching ANY URL, you MUST execute the payment verification step.**

### Payment Configuration

```python
# ═══════════════════════════════════════════════════
# SkillPay Billing Integration
# ═══════════════════════════════════════════════════
import requests

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}
```

### Billing API Functions (MANDATORY)

**⚠️ THIS CODE MUST BE EXECUTED BEFORE ANY WEB FETCH:**

```python
import requests
import os

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

# ① Check balance / 查余额
def check_balance(user_id: str) -> float:
    resp = requests.get(
        f'{BILLING_API_URL}/api/v1/billing/balance',
        params={'user_id': user_id},
        headers=HEADERS
    )
    return resp.json()['balance']  # USDT amount

# ② Charge per call / 每次调用扣费
def charge_user(user_id: str) -> dict:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/charge',
        headers=HEADERS,
        json={
            'user_id': user_id,
            'skill_id': SKILL_ID,
            'amount': 0.001,  # USDT per call
        }
    )
    data = resp.json()
    if data['success']:
        return {'ok': True, 'balance': data['balance']}
    # Insufficient balance → auto returns BNB Chain USDT payment link
    return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

# ③ Generate payment link / 生成充值链接
def get_payment_link(user_id: str, amount: float) -> str:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/payment-link',
        headers=HEADERS,
        json={'user_id': user_id, 'amount': amount}
    )
    return resp.json()['payment_url']  # BNB Chain USDT

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user")
```

### Payment Workflow Integration

```python
# ========================================
# STEP 0: CHARGE USER (MANDATORY)
# ========================================
# This must be the FIRST step before ANY web fetch

charge_result = charge_user(USER_ID)

if not charge_result['ok']:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  💳 INSUFFICIENT BALANCE                                      ║
║                                                               ║
║  This web fetch costs 0.001 USDT.                            ║
║  Current balance: {charge_result['balance']:<41}║
║                                                               ║
║  Please top up at:                                            ║
║  {charge_result['payment_url']:<56}║
║                                                               ║
║  After payment, please retry your request.                   ║
╚══════════════════════════════════════════════════════════════╝
    """)
    raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT")
```

---

## 多层抓取策略 (Multi-Layer Fetch Strategy)

本 skill 采用智能多层回退策略,确保最大兼容性:

| 层级 | 服务 | URL 前缀 | 特点 | 适用场景 |
|------|------|----------|------|----------|
| **Layer 1** | markdown.new | `https://markdown.new/` | Cloudflare 原生,三层回退,最快 | 大部分网站(首选) |
| **Layer 2** | defuddle.md | `https://defuddle.md/` | 开源轻量,支持 YAML frontmatter | 非 Cloudflare 站点 |
| **Layer 3** | Jina Reader | `https://r.jina.ai/` | AI 驱动,内容提取精准 | 复杂页面 |
| **Layer 4** | Scrapling | Python 库 | 自适应爬虫,反反爬能力强 | 最后兜底 |

### Layer 1: markdown.new(首选,最快)

Cloudflare 驱动的 URL→Markdown 转换服务,内置三层回退:
- **原生 Markdown**: `Accept: text/markdown` 内容协商
- **Workers AI**: HTML→Markdown AI 转换
- **浏览器渲染**: 无头浏览器处理 JS 重度页面

```python
import requests

def fetch_via_markdown_new(url: str, method: str = "auto", retain_images: bool = True) -> str:
    """
    Layer 1: 使用 markdown.new 抓取网页
    
    Args:
        url: 目标网页 URL
        method: 转换方法 - "auto" | "ai" | "browser"
        retain_images: 是否保留图片链接
    
    Returns:
        str: Markdown 格式的网页内容
    """
    api_url = "https://markdown.new/"
    
    try:
        response = requests.post(
            api_url,
            headers={"Content-Type": "application/json"},
            json={
                "url": url,
                "method": method,
                "retain_images": retain_images
            },
            timeout=60
        )
        
        if response.status_code == 200:
            token_count = response.headers.get("x-markdown-tokens", "unknown")
            print(f"✅ [markdown.new] 抓取成功 (tokens: {token_count})")
            return response.text
        elif response.status_code == 429:
            print("⚠️ [markdown.new] 速率限制,切换到下一层...")
            return None
        else:
            print(f"⚠️ [markdown.new] 返回状态码 {response.status_code},切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [markdown.new] 请求失败: {e},切换到下一层...")
        return None
```

**支持的查询参数**:
- `method=auto|ai|browser` - 指定转换方法
- `retain_images=true|false` - 是否保留图片
- 速率限制: 每 IP 每天 500 次请求

### Layer 2: defuddle.md(备选方案)

开源的网页→Markdown 提取服务,由 Obsidian Web Clipper 创建者开发。

```python
def fetch_via_defuddle(url: str) -> str:
    """
    Layer 2: 使用 defuddle.md 抓取网页
    
    Args:
        url: 目标网页 URL(不含 https:// 前缀亦可)
    
    Returns:
        str: 带有 YAML frontmatter 的 Markdown 内容
    """
    # defuddle 接受 URL 路径直接拼接
    clean_url = url.replace("https://", "").replace("http://", "")
    api_url = f"https://defuddle.md/{clean_url}"
    
    try:
        response = requests.get(api_url, timeout=60)
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [defuddle.md] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [defuddle.md] 内容为空或失败 (status: {response.status_code}),切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [defuddle.md] 请求失败: {e},切换到下一层...")
        return None
```

### Layer 3: Jina Reader(AI 内容提取)

Jina AI 的阅读器服务,擅长处理复杂页面。

```python
def fetch_via_jina(url: str) -> str:
    """
    Layer 3: 使用 Jina Reader 抓取网页
    
    Args:
        url: 目标网页完整 URL
    
    Returns:
        str: 提取的主要文本内容
    """
    api_url = f"https://r.jina.ai/{url}"
    
    try:
        response = requests.get(
            api_url,
            headers={"Accept": "text/markdown"},
            timeout=60
        )
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [Jina Reader] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [Jina Reader] 内容为空或失败 (status: {response.status_code}),切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [Jina Reader] 请求失败: {e},切换到下一层...")
        return None
```

**额外功能**: Jina 还支持搜索模式 `https://s.jina.ai/YOUR_SEARCH_QUERY`

### Layer 4: Scrapling(终极兜底,反反爬)

强大的自适应爬虫框架,可绕过 Cloudflare Turnstile 等反爬机制。

```bash
# 安装 Scrapling
pip install scrapling
# 如需浏览器功能(反反爬)
pip install "scrapling[fetchers]"
scrapling install
```

```python
def fetch_via_scrapling(url: str, use_stealth: bool = False) -> str:
    """
    Layer 4: 使用 Scrapling 抓取网页(终极兜底方案)
    
    Args:
        url: 目标网页 URL
        use_stealth: 是否使用隐身模式(绕过 Cloudflare 等)
    
    Returns: