MrScraper

SkillDB 作者 ai-mrscraper v1.0.4

Run AI-powered, unblockable web scraping, data extraction with natural language via the MrScraper API

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install skilldb:ai-mrscraper~mrscraper
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/skilldb%3Aai-mrscraper~mrscraper/file -o mrscraper.md
Git 仓库获取源码
git clone https://github.com/openclaw/skills/commit/17df49e67eb181cf1e37d26c13d3fa1f17c0ad10
# MrScraper

Run AI-powered, unblockable web scraping, data extraction with natural language via the MrScraper API

## Actions

This skill supports:

- Opening blocked pages through unblocker (stealth browser + IP rotation)
- Starting AI scraper runs from natural-language instructions
- Rerunning existing scraper configurations on one or multiple URLs
- Running manual workflow-based reruns
- Fetching paginated results and detailed results by ID

This skill is API-only and does not depend on bundled local scripts.

## Base URLs

- Unblocker API: `https://api.mrscraper.com`
- Platform API: `https://api.app.mrscraper.com`

## Authentication

### Unblocker API auth

Use query-param auth on unblocker endpoint:

- `token=<MRSCRAPER_API_TOKEN>`

### Platform API auth

Use header-based auth on platform endpoints:

```http
x-api-token: <MRSCRAPER_API_TOKEN>
accept: application/json
content-type: application/json
```

### How to get `MRSCRAPER_API_TOKEN`?

An API token lets your applications securely interact with MrScraper APIs and rerun scrapers created in the dashboard.

Follow these steps in the dashboard:

1. Click your **User Profile** at the top-right corner.
2. Select **API Tokens**.
3. Click **New Token**.
4. Enter a **name** and set an **expiration date**.
5. Click **Create**.
6. Copy the new token and store it securely as `MRSCRAPER_API_TOKEN`.
7. Use it in requests through the `x-api-token` header.

Security rule:

- Never expose tokens in client-side code (browser/mobile app bundles).
- Store tokens in environment variables or server-side secret managers.

Notes from the auth docs:

- The API key works for all V3 Platform endpoints.
- The same key can be used for endpoints on `sync.scraper.mrscraper.com`.
- For access to endpoints on other hosts, contact `support@mrscraper.com`.

## Install and Runtime

- No local install step is required by this skill document.
- No bundled `scripts/` are required.
- Calls are direct HTTPS requests to the two base URLs above.

## Data and Scope

- Data is sent only to `api.app.mrscraper.com` and `api.mrscraper.com`.
- Responses may contain extracted page content and scrape metadata.
- This skill does not define hidden persistence or background jobs.
- Never expose tokens in logs, commits, or output.

## Endpoints

### 1. Unblocker

- Method: `GET`
- URL: `https://api.mrscraper.com`
- Auth: `token` query parameter

Opens a target URL through stealth browsing and IP rotation, then returns HTML. Use this when direct access is blocked by captcha or anti-bot protections.

#### Query parameters:

| Field            | Type      | Required | Default | Description                             |
| ---------------- | --------- | -------- | ------- | --------------------------------------- |
| `token`          | `string`  | Yes      | —       | Unblocker token (`MRSCRAPER_API_TOKEN`) |
| `url`            | `string`  | Yes      | —       | URL-encoded target URL                  |
| `timeout`        | `number`  | No       | 60      | Max wait in seconds (example `120`)     |
| `geoCode`        | `string`  | No       | None    | Geographic routing code (example `SG`)  |
| `blockResources` | `boolean` | No       | false   | Block non-essential resources           |

#### Request example:

```bash
curl --location 'https://api.mrscraper.com?token=<MRSCRAPER_API_TOKEN>&timeout=120&geoCode=SG&url=https%3A%2F%2Fwww.lazada.sg%2Fproducts%2Fpdp-i111650098-s23209659764.html&blockResources=false'
```

#### Response example:

```html
<!doctype html>
<html>
  <head>...</head>
  <body>...</body>
</html>
```

#### Notes:

- Prefer explicit `geoCode` and practical timeouts for repeatable behavior.
- Only pass cookies when session-specific content is required.

### 2. Create AI Scraper

- Method: `POST`
- Host: `https://api.app.mrscraper.com`
- Path: `/api/v1/scrapers-ai`
- Auth: `x-api-token`

Create a new AI scraper run from natural-language instructions.

#### Payload parameters (for `agent`: `general` or `agent`: `listing`):

| Field          | Type     | Required | Default  | Description                                                |
| -------------- | -------- | -------- | -------- | ---------------------------------------------------------- |
| `url`          | string   | Yes      | —        | Target URL                                                 |
| `message`      | string   | Yes      | —        | Extraction instruction                                     |
| `agent`        | string   | No       | general  | The AI agent type to use for scraping: `general`, `listing`, or `map`  |
| `proxyCountry` | string   | No       | None     | ISO country code for proxy-based scraping                  |

#### Payload parameters (for `agent`: `map`):

| Field             | Type     | Required | Default   | Description                                                                                                   |
| ----------------- | -------- | -------- | --------- | ------------------------------------------------------------------------------------------------------------- |
| `url`             | `string` | Yes      | —         | Target URL                                           |
| `agent`           | `string` | No       | map       | The AI agent type to use for scraping (for this case it is `map`)                                             |
| `maxDepth`        | `number` | No       | 2         | Maximum depth level for crawling links from the starting URL.<br>0 = only the starting URL, 1 = +direct links |
| `maxPages`        | `number` | No       | 50        | Maximum number of pages to scrape during the crawling process.                                                |
| `limit`           | `number` | No       | 1000      | Maximum number of data records to extract across all pages. Scraping stops when this limit is reached.        |
| `includePatterns` | `string` | No       | ""        | Regex patterns to include (separate multiple with `\|\|`)    |
| `excludePatterns` | `string` | No       | ""        | Regex patterns to exclude (separate multiple with `\|\|`)                                      |

#### Request example:

```bash
curl -X POST "https://api.app.mrscraper.com/api/v1/scrapers-ai" \
  -H "x-api-token: <MRSCRAPER_API_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "message": "Extract title, price, stocks, and rating",
    "agent": "general"
  }'
```

#### Response example:

```json
{
  "id": "497f6eca-6276-4993-bfeb-53cbbbba6f08",
  "createdAt": "2019-08-24T14:15:22Z",
  "createdById": "e13e432a-5323-4484-a91d-b5969bc564d9",
  "updatedAt": "2019-08-24T14:15:22Z",
  "updatedById": "d8bc6076-4141-4a88-80b9-0eb31643066f",
  "deletedAt": "2019-08-24T14:15:22Z",
  "deletedById": "8ef578ad-7f1e-4656-b48b-b1b4a9aaa1cb",
  "userId": "2c4a230c-5085-4924-a3e1-25fb4fc5965b",
  "scraperId": "6695bf87-aaa6-46b0-b1ee-88586b222b0b",
  "type": "AI",
  "url": "http://example.com",
  "status": "Finished",
  "error": "string",
  "tokenUsage": 0,
  "runtime": 0,
  "data": {}, // MAIN SCRAPED DATA
  "htmlPath": "string",
  "recordingPath": "string",
  "screenshotPath": "string",
  "dataPath": "string"
}
```

#### Notes:

- Choose agent type correctly as each agent is specialized for specified use cases. Use `general` for most standard web scraping tasks. The go to agent if the user doesn't specify or the connected LLM is not confident about the type of page. But mostly used for scraping product page, but handles any type of page very well as well. Use `listing` for scraping listing pages like product listings, job listings, etc. Choose this if the connected LLM can confidently identify whether the given URL is a listing page. Use `map` for crawling and getting all subdomain or subpages of a website. Choose this if the user specifies that the given URL is a website and not a specific page. For `map` agent type, there is a special args