Web Scraper Pro

SkillDB 作者 web-scraper v1.0.0
Intelligent web scraper that fetches any URL and returns clean Markdown content. Triggers on requests like "帮我抓取网页", "获取这个网页内容", "fetch this URL", "scrape this page", "读取网页", "get web content", "爬取", "抓取", or when users provide a URL they want to read/extract content from.
源码 ↗
安装 / 下载方式

TotalClaw CLI推荐
totalclaw install skilldb:codehourra~web-scraper-pro
cURL直接下载，无需登录
curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Acodehourra~web-scraper-pro/file -o web-scraper-pro.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/ed6fd3483babd5fbcf49e31f25d76a46646200d0
# Web Scraper Pro

Intelligent web page fetcher that converts any URL into clean, readable Markdown content. Uses a multi-layer fallback strategy to ensure maximum compatibility.

**💰 Pricing: 0.001 USDT per web fetch**

## ⚠️ CRITICAL: Payment Integration (MUST READ FIRST)

This skill is monetized via [SkillPay](https://skillpay.me). **Before fetching ANY URL, you MUST execute the payment verification step.**

### Payment Configuration

```python
# ═══════════════════════════════════════════════════
# SkillPay Billing Integration
# ═══════════════════════════════════════════════════
import requests

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}
```

### Billing API Functions (MANDATORY)

**⚠️ THIS CODE MUST BE EXECUTED BEFORE ANY WEB FETCH:**

```python
import requests
import os

BILLING_API_URL = 'https://skillpay.me'
BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b'
SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b'
HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

# ① Check balance / 查余额
def check_balance(user_id: str) -> float:
    resp = requests.get(
        f'{BILLING_API_URL}/api/v1/billing/balance',
        params={'user_id': user_id},
        headers=HEADERS
    )
    return resp.json()['balance']  # USDT amount

# ② Charge per call / 每次调用扣费
def charge_user(user_id: str) -> dict:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/charge',
        headers=HEADERS,
        json={
            'user_id': user_id,
            'skill_id': SKILL_ID,
            'amount': 0.001,  # USDT per call
        }
    )
    data = resp.json()
    if data['success']:
        return {'ok': True, 'balance': data['balance']}
    # Insufficient balance → auto returns BNB Chain USDT payment link
    return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

# ③ Generate payment link / 生成充值链接
def get_payment_link(user_id: str, amount: float) -> str:
    resp = requests.post(
        f'{BILLING_API_URL}/api/v1/billing/payment-link',
        headers=HEADERS,
        json={'user_id': user_id, 'amount': amount}
    )
    return resp.json()['payment_url']  # BNB Chain USDT

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user")
```

### Payment Workflow Integration

```python
# ========================================
# STEP 0: CHARGE USER (MANDATORY)
# ========================================
# This must be the FIRST step before ANY web fetch

charge_result = charge_user(USER_ID)

if not charge_result['ok']:
    print(f"""
╔══════════════════════════════════════════════════════════════╗
║  💳 INSUFFICIENT BALANCE                                      ║
║                                                               ║
║  This web fetch costs 0.001 USDT.                            ║
║  Current balance: {charge_result['balance']:<41}║
║                                                               ║
║  Please top up at:                                            ║
║  {charge_result['payment_url']:<56}║
║                                                               ║
║  After payment, please retry your request.                   ║
╚══════════════════════════════════════════════════════════════╝
    """)
    raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT")
```

---

## 多层抓取策略 (Multi-Layer Fetch Strategy)

本 skill 采用智能多层回退策略，确保最大兼容性：

| 层级 | 服务 | URL 前缀 | 特点 | 适用场景 |
|------|------|----------|------|----------|
| **Layer 1** | markdown.new | `https://markdown.new/` | Cloudflare 原生，三层回退，最快 | 大部分网站（首选） |
| **Layer 2** | defuddle.md | `https://defuddle.md/` | 开源轻量，支持 YAML frontmatter | 非 Cloudflare 站点 |
| **Layer 3** | Jina Reader | `https://r.jina.ai/` | AI 驱动，内容提取精准 | 复杂页面 |
| **Layer 4** | Scrapling | Python 库 | 自适应爬虫，反反爬能力强 | 最后兜底 |

### Layer 1: markdown.new（首选，最快）

Cloudflare 驱动的 URL→Markdown 转换服务，内置三层回退：
- **原生 Markdown**: `Accept: text/markdown` 内容协商
- **Workers AI**: HTML→Markdown AI 转换
- **浏览器渲染**: 无头浏览器处理 JS 重度页面

```python
import requests

def fetch_via_markdown_new(url: str, method: str = "auto", retain_images: bool = True) -> str:
    """
    Layer 1: 使用 markdown.new 抓取网页
    
    Args:
        url: 目标网页 URL
        method: 转换方法 - "auto" | "ai" | "browser"
        retain_images: 是否保留图片链接
    
    Returns:
        str: Markdown 格式的网页内容
    """
    api_url = "https://markdown.new/"
    
    try:
        response = requests.post(
            api_url,
            headers={"Content-Type": "application/json"},
            json={
                "url": url,
                "method": method,
                "retain_images": retain_images
            },
            timeout=60
        )
        
        if response.status_code == 200:
            token_count = response.headers.get("x-markdown-tokens", "unknown")
            print(f"✅ [markdown.new] 抓取成功 (tokens: {token_count})")
            return response.text
        elif response.status_code == 429:
            print("⚠️ [markdown.new] 速率限制，切换到下一层...")
            return None
        else:
            print(f"⚠️ [markdown.new] 返回状态码 {response.status_code}，切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [markdown.new] 请求失败: {e}，切换到下一层...")
        return None
```

**支持的查询参数**:
- `method=auto|ai|browser` - 指定转换方法
- `retain_images=true|false` - 是否保留图片
- 速率限制: 每 IP 每天 500 次请求

### Layer 2: defuddle.md（备选方案）

开源的网页→Markdown 提取服务，由 Obsidian Web Clipper 创建者开发。

```python
def fetch_via_defuddle(url: str) -> str:
    """
    Layer 2: 使用 defuddle.md 抓取网页
    
    Args:
        url: 目标网页 URL（不含 https:// 前缀亦可）
    
    Returns:
        str: 带有 YAML frontmatter 的 Markdown 内容
    """
    # defuddle 接受 URL 路径直接拼接
    clean_url = url.replace("https://", "").replace("http://", "")
    api_url = f"https://defuddle.md/{clean_url}"
    
    try:
        response = requests.get(api_url, timeout=60)
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [defuddle.md] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [defuddle.md] 内容为空或失败 (status: {response.status_code})，切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [defuddle.md] 请求失败: {e}，切换到下一层...")
        return None
```

### Layer 3: Jina Reader（AI 内容提取）

Jina AI 的阅读器服务，擅长处理复杂页面。

```python
def fetch_via_jina(url: str) -> str:
    """
    Layer 3: 使用 Jina Reader 抓取网页
    
    Args:
        url: 目标网页完整 URL
    
    Returns:
        str: 提取的主要文本内容
    """
    api_url = f"https://r.jina.ai/{url}"
    
    try:
        response = requests.get(
            api_url,
            headers={"Accept": "text/markdown"},
            timeout=60
        )
        
        if response.status_code == 200 and len(response.text.strip()) > 50:
            print(f"✅ [Jina Reader] 抓取成功")
            return response.text
        else:
            print(f"⚠️ [Jina Reader] 内容为空或失败 (status: {response.status_code})，切换到下一层...")
            return None
            
    except requests.exceptions.RequestException as e:
        print(f"⚠️ [Jina Reader] 请求失败: {e}，切换到下一层...")
        return None
```

**额外功能**: Jina 还支持搜索模式 `https://s.jina.ai/YOUR_SEARCH_QUERY`

### Layer 4: Scrapling（终极兜底，反反爬）

强大的自适应爬虫框架，可绕过 Cloudflare Turnstile 等反爬机制。

```bash
# 安装 Scrapling
pip install scrapling
# 如需浏览器功能（反反爬）
pip install "scrapling[fetchers]"
scrapling install
```

```python
def fetch_via_scrapling(url: str, use_stealth: bool = False) -> str:
    """
    Layer 4: 使用 Scrapling 抓取网页（终极兜底方案）
    
    Args:
        url: 目标网页 URL
        use_stealth: 是否使用隐身模式（绕过 Cloudflare 等）
    
    Returns: