web-scraper-as-a-service
构建具有干净数据输出的客户端就绪网络抓取工具。在为客户创建抓取工具、从网站提取数据或交付抓取项目时使用。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~seanwyngaard-web-scraper-as-a-servicecURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~seanwyngaard-web-scraper-as-a-service/file -o seanwyngaard-web-scraper-as-a-service.md# Web Scraper as a Service
Turn scraping briefs into deliverable scraping projects. Generates the scraper, runs it, cleans the data, and packages everything for the client.
## How to Use
```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```
## Scraper Generation Pipeline
### Step 1: Analyze the Target
Before writing any code:
1. **Fetch the target URL** to understand the page structure
2. **Identify**:
- Is the site server-rendered (static HTML) or client-rendered (JavaScript/SPA)?
- What anti-scraping measures are visible? (Cloudflare, CAPTCHAs, rate limits)
- Pagination pattern (URL params, infinite scroll, load more button)
- Data structure (product cards, table rows, list items)
- Total estimated volume (number of pages/items)
3. **Choose the right tool**:
- Static HTML → Python + `requests` + `BeautifulSoup`
- JavaScript-rendered → Python + `playwright`
- API available → Direct API calls (check network tab patterns)
### Step 2: Build the Scraper
Generate a complete Python script in `scraper/` directory:
```
scraper/
scrape.py # Main scraper script
requirements.txt # Dependencies
config.json # Target URLs, fields, settings
README.md # Setup and usage instructions for client
```
**`scrape.py` must include**:
```python
# Required features in every scraper:
# 1. Configuration
import json
config = json.load(open('config.json'))
# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2 # seconds, adjustable in config
# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5
# 4. User-Agent rotation
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# ... at least 5 user agents
]
# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")
# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt
# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving
# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```
### Step 3: Data Cleaning
After scraping, clean the data:
1. **Remove duplicates** (by unique identifier or composite key)
2. **Normalize text** (strip extra whitespace, fix encoding issues, consistent capitalization)
3. **Validate data** (no empty required fields, prices are numbers, URLs are valid)
4. **Standardize formats** (dates to ISO 8601, currency to numbers, consistent units)
5. **Generate data quality report**:
```
Data Quality Report
───────────────────
Total records: 2,487
Duplicates removed: 13
Empty fields filled: 0
Fields with issues: price (3 records had non-numeric values — cleaned)
Completeness: 99.5%
```
### Step 4: Client Deliverable Package
Generate a complete deliverable:
```
delivery/
data.csv # Clean data in requested format
data.json # JSON alternative
data-quality-report.md # Quality metrics
scraper-documentation.md # How the scraper works
README.md # Quick start guide
```
**`scraper-documentation.md`** includes:
- What was scraped and from where
- How many records collected
- Data fields and their descriptions
- How to re-run the scraper
- Known limitations
- Date of scraping
### Step 5: Output to User
Present:
1. **Summary**: X records scraped from Y pages, Z% data quality
2. **Sample data**: First 5 rows of the output
3. **File locations**: Where the deliverables are saved
4. **Client handoff notes**: What to tell the client about the data
## Scraper Templates
Based on the target type, use the appropriate template:
### E-commerce Product Scraper
Fields: name, price, original_price, discount, description, images, category, sku, rating, review_count, availability, url
### Real Estate Listings
Fields: address, price, bedrooms, bathrooms, sqft, lot_size, listing_type, agent, description, images, url
### Job Listings
Fields: title, company, location, salary, job_type, description, requirements, posted_date, url
### Directory/Business Listings
Fields: business_name, address, phone, website, category, rating, review_count, hours, description
### News/Blog Articles
Fields: title, author, date, content, tags, url, image
## Ethical Scraping Rules
1. **Always respect robots.txt** — check before scraping
2. **Rate limit** — minimum 2 second delay between requests
3. **Identify yourself** — use realistic but honest User-Agent
4. **Don't scrape personal data** (emails, phone numbers) unless explicitly authorized by the client AND the data is publicly displayed
5. **Cache responses** — don't re-scrape pages unnecessarily
6. **Check ToS** — note if the site's terms prohibit scraping and inform the client
---
## 中文说明
# Web Scraper as a Service
将抓取需求转化为可交付的抓取项目。生成抓取器、运行它、清洗数据,并将所有内容打包交付给客户。
## How to Use
```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```
## Scraper Generation Pipeline
### Step 1: Analyze the Target
在编写任何代码之前:
1. **抓取目标 URL** 以了解页面结构
2. **识别**:
- 站点是服务端渲染(静态 HTML)还是客户端渲染(JavaScript/SPA)?
- 有哪些可见的反爬措施?(Cloudflare、CAPTCHA、速率限制)
- 分页模式(URL 参数、无限滚动、加载更多按钮)
- 数据结构(产品卡片、表格行、列表项)
- 预估总量(页数/条目数)
3. **选择合适的工具**:
- 静态 HTML → Python + `requests` + `BeautifulSoup`
- JavaScript 渲染 → Python + `playwright`
- 有可用 API → 直接调用 API(检查 network 标签中的模式)
### Step 2: Build the Scraper
在 `scraper/` 目录中生成一个完整的 Python 脚本:
```
scraper/
scrape.py # Main scraper script
requirements.txt # Dependencies
config.json # Target URLs, fields, settings
README.md # Setup and usage instructions for client
```
**`scrape.py` 必须包含**:
```python
# Required features in every scraper:
# 1. Configuration
import json
config = json.load(open('config.json'))
# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2 # seconds, adjustable in config
# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5
# 4. User-Agent rotation
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# ... at least 5 user agents
]
# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")
# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt
# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving
# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```
### Step 3: Data Cleaning
抓取完成后,清洗数据:
1. **去重**(按唯一标识符或组合键)
2. **规范化文本**(去除多余空白、修复编码问题、统一大小写)
3. **校验数据**(必填字段不为空、价格为数字、URL 合法)
4. **统一格式**(日期转为 ISO 8601、货币转为数字、单位一致)
5. **生成数据质量报告**:
```
Data Quality Report
───────────────────
Total records: 2,487
Duplicates removed: 13
Empty fields filled: 0
Fields with issues: price (3 records had non-numeric values — cleaned)
Completeness: 99.5%
```
### Step 4: Client Deliverable Package
生成完整的交付物:
```
delivery/
data.csv # Clean data in requested format
data.json # JSON alternative
data-quality-report.md # Quality metrics
scraper-documentation.md # How the scraper works
README.md