web-scraper-as-a-service

TotalClaw 作者 totalclaw

构建具有干净数据输出的客户端就绪网络抓取工具。在为客户创建抓取工具、从网站提取数据或交付抓取项目时使用。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~seanwyngaard-web-scraper-as-a-service
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~seanwyngaard-web-scraper-as-a-service/file -o seanwyngaard-web-scraper-as-a-service.md
# Web Scraper as a Service

Turn scraping briefs into deliverable scraping projects. Generates the scraper, runs it, cleans the data, and packages everything for the client.

## How to Use

```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```

## Scraper Generation Pipeline

### Step 1: Analyze the Target

Before writing any code:

1. **Fetch the target URL** to understand the page structure
2. **Identify**:
   - Is the site server-rendered (static HTML) or client-rendered (JavaScript/SPA)?
   - What anti-scraping measures are visible? (Cloudflare, CAPTCHAs, rate limits)
   - Pagination pattern (URL params, infinite scroll, load more button)
   - Data structure (product cards, table rows, list items)
   - Total estimated volume (number of pages/items)
3. **Choose the right tool**:
   - Static HTML → Python + `requests` + `BeautifulSoup`
   - JavaScript-rendered → Python + `playwright`
   - API available → Direct API calls (check network tab patterns)

### Step 2: Build the Scraper

Generate a complete Python script in `scraper/` directory:

```
scraper/
  scrape.py           # Main scraper script
  requirements.txt    # Dependencies
  config.json         # Target URLs, fields, settings
  README.md           # Setup and usage instructions for client
```

**`scrape.py` must include**:

```python
# Required features in every scraper:

# 1. Configuration
import json
config = json.load(open('config.json'))

# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2  # seconds, adjustable in config

# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5

# 4. User-Agent rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    # ... at least 5 user agents
]

# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")

# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt

# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving

# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```

### Step 3: Data Cleaning

After scraping, clean the data:

1. **Remove duplicates** (by unique identifier or composite key)
2. **Normalize text** (strip extra whitespace, fix encoding issues, consistent capitalization)
3. **Validate data** (no empty required fields, prices are numbers, URLs are valid)
4. **Standardize formats** (dates to ISO 8601, currency to numbers, consistent units)
5. **Generate data quality report**:
   ```
   Data Quality Report
   ───────────────────
   Total records: 2,487
   Duplicates removed: 13
   Empty fields filled: 0
   Fields with issues: price (3 records had non-numeric values — cleaned)
   Completeness: 99.5%
   ```

### Step 4: Client Deliverable Package

Generate a complete deliverable:

```
delivery/
  data.csv                    # Clean data in requested format
  data.json                   # JSON alternative
  data-quality-report.md      # Quality metrics
  scraper-documentation.md    # How the scraper works
  README.md                   # Quick start guide
```

**`scraper-documentation.md`** includes:
- What was scraped and from where
- How many records collected
- Data fields and their descriptions
- How to re-run the scraper
- Known limitations
- Date of scraping

### Step 5: Output to User

Present:
1. **Summary**: X records scraped from Y pages, Z% data quality
2. **Sample data**: First 5 rows of the output
3. **File locations**: Where the deliverables are saved
4. **Client handoff notes**: What to tell the client about the data

## Scraper Templates

Based on the target type, use the appropriate template:

### E-commerce Product Scraper
Fields: name, price, original_price, discount, description, images, category, sku, rating, review_count, availability, url

### Real Estate Listings
Fields: address, price, bedrooms, bathrooms, sqft, lot_size, listing_type, agent, description, images, url

### Job Listings
Fields: title, company, location, salary, job_type, description, requirements, posted_date, url

### Directory/Business Listings
Fields: business_name, address, phone, website, category, rating, review_count, hours, description

### News/Blog Articles
Fields: title, author, date, content, tags, url, image

## Ethical Scraping Rules

1. **Always respect robots.txt** — check before scraping
2. **Rate limit** — minimum 2 second delay between requests
3. **Identify yourself** — use realistic but honest User-Agent
4. **Don't scrape personal data** (emails, phone numbers) unless explicitly authorized by the client AND the data is publicly displayed
5. **Cache responses** — don't re-scrape pages unnecessarily
6. **Check ToS** — note if the site's terms prohibit scraping and inform the client

---

## 中文说明

# Web Scraper as a Service

将抓取需求转化为可交付的抓取项目。生成抓取器、运行它、清洗数据,并将所有内容打包交付给客户。

## How to Use

```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```

## Scraper Generation Pipeline

### Step 1: Analyze the Target

在编写任何代码之前:

1. **抓取目标 URL** 以了解页面结构
2. **识别**:
   - 站点是服务端渲染(静态 HTML)还是客户端渲染(JavaScript/SPA)?
   - 有哪些可见的反爬措施?(Cloudflare、CAPTCHA、速率限制)
   - 分页模式(URL 参数、无限滚动、加载更多按钮)
   - 数据结构(产品卡片、表格行、列表项)
   - 预估总量(页数/条目数)
3. **选择合适的工具**:
   - 静态 HTML → Python + `requests` + `BeautifulSoup`
   - JavaScript 渲染 → Python + `playwright`
   - 有可用 API → 直接调用 API(检查 network 标签中的模式)

### Step 2: Build the Scraper

在 `scraper/` 目录中生成一个完整的 Python 脚本:

```
scraper/
  scrape.py           # Main scraper script
  requirements.txt    # Dependencies
  config.json         # Target URLs, fields, settings
  README.md           # Setup and usage instructions for client
```

**`scrape.py` 必须包含**:

```python
# Required features in every scraper:

# 1. Configuration
import json
config = json.load(open('config.json'))

# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2  # seconds, adjustable in config

# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5

# 4. User-Agent rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    # ... at least 5 user agents
]

# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")

# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt

# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving

# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```

### Step 3: Data Cleaning

抓取完成后,清洗数据:

1. **去重**(按唯一标识符或组合键)
2. **规范化文本**(去除多余空白、修复编码问题、统一大小写)
3. **校验数据**(必填字段不为空、价格为数字、URL 合法)
4. **统一格式**(日期转为 ISO 8601、货币转为数字、单位一致)
5. **生成数据质量报告**:
   ```
   Data Quality Report
   ───────────────────
   Total records: 2,487
   Duplicates removed: 13
   Empty fields filled: 0
   Fields with issues: price (3 records had non-numeric values — cleaned)
   Completeness: 99.5%
   ```

### Step 4: Client Deliverable Package

生成完整的交付物:

```
delivery/
  data.csv                    # Clean data in requested format
  data.json                   # JSON alternative
  data-quality-report.md      # Quality metrics
  scraper-documentation.md    # How the scraper works
  README.md