pulpminer

TotalClaw 作者 totalclaw

使用 AI 将任何网页转换为结构化 JSON 数据。抓取网站、将数据提取到自定义 JSON 模式中,并以编程方式调用保存的 API。对于网络抓取、数据提取、内容监控、潜在客户生成、价格跟踪和构建数据管道很有用。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~melvin2016-webscraper-pulpminer
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~melvin2016-webscraper-pulpminer/file -o melvin2016-webscraper-pulpminer.md
# PulpMiner — AI Web Scraping & JSON API

PulpMiner converts any webpage into structured JSON using AI. You provide a URL and optionally a JSON template, and PulpMiner scrapes the page, runs it through an LLM, and returns clean structured data.

## Authentication

All API calls require the `apikey` header:

```
apikey: <PULPMINER_API_KEY>
```

Get your API key from https://pulpminer.com/api — click "Regenerate Key" if you don't have one.

## Core Workflow

PulpMiner works in two phases:

1. **Create a saved API** — Configure a URL, scraper, LLM, and optional JSON template via the PulpMiner dashboard at https://pulpminer.com/api
2. **Call the saved API** — Use the external endpoint with your API key to fetch structured JSON

## Calling a Saved API

### Static API (fixed URL)

```bash
curl -X GET "https://api.pulpminer.com/external/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>"
```

Returns JSON extracted from the configured webpage.

### Dynamic API (URL with variables)

For APIs saved with template URLs like `https://example.com/search?q={{query}}&page={{page}}`:

```bash
curl -X POST "https://api.pulpminer.com/external/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"query": "javascript frameworks", "page": "1"}'
```

The `{{variable}}` placeholders in the saved URL get replaced with the values you provide.

## Response Format

Successful responses return:

```json
{
  "data": { ... },
  "errors": null
}
```

Error responses return:

```json
{
  "data": null,
  "errors": "Error message describing what went wrong"
}
```

## Caching

- API responses are cached for **24 hours** by default
- If cache is older than 15 minutes, PulpMiner serves the cached version while refreshing in the background
- Cache can be disabled per-API in the dashboard settings

## Configuration Options (Set in Dashboard)

When creating a saved API at https://pulpminer.com/api, you can configure:

| Option | Description |
|--------|-------------|
| **URL** | The webpage to scrape |
| **JSON Template** | Optional JSON structure for the LLM to follow (e.g., `{"name": "", "price": ""}`) |
| **Render JS** | Enable for SPAs and JS-heavy pages (uses headless browser) |
| **CSS Selector** | Extract only a specific part of the page (e.g., `.product-list`, `#main-content`) |
| **Extra Instructions** | Additional guidance for the AI (e.g., "Only extract items with prices above $50") |
| **Dynamic URL** | Enable template variables in the URL with `{{variable}}` syntax |
| **Cache** | Toggle response caching on/off |

## Integration with Zapier

For async scraping in Zapier workflows:

```bash
# Static API
curl -X POST "https://api.pulpminer.com/external/zapier/get/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -d '{"callbackURL": "https://hooks.zapier.com/..."}'

# Dynamic API
curl -X POST "https://api.pulpminer.com/external/zapier/post/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -d '{"callbackURL": "https://hooks.zapier.com/...", "query": "value"}'
```

Returns `201` immediately. Sends scraped data to the callback URL when complete.

## Integration with n8n

Verify authentication:

```bash
curl -X GET "https://api.pulpminer.com/external/n8n/auth" \
  -H "apikey: <PULPMINER_API_KEY>"
```

Then use the standard `/external/<apiId>` endpoints for data fetching.

## Credits

- Each API call costs **0.25–0.4 credits** depending on the endpoint
- JavaScript rendering adds **0.1 credits** extra
- New users get **5 free credits**
- Purchase more at https://pulpminer.com/credits

## Tips

- Use **CSS selectors** to narrow down the scraped content and improve accuracy
- Provide a **JSON template** for consistent, predictable output structures
- Enable **JS rendering** only when needed — static pages scrape faster and cost fewer credits
- Use **extra instructions** to guide the AI (e.g., "Return dates in ISO 8601 format")
- For monitoring use cases, keep **caching enabled** to reduce credit usage
- Use the **playground** first to verify a URL is scrapable before saving an API config
- Dynamic APIs are ideal for search pages, paginated content, and parameterized URLs

## Links

- Website: https://pulpminer.com
- API Dashboard: https://pulpminer.com/api

---

## 中文说明

# PulpMiner —— AI 网页抓取与 JSON API

PulpMiner 使用 AI 将任何网页转换为结构化 JSON。你提供一个 URL 和可选的 JSON 模板,PulpMiner 会抓取页面、将其通过 LLM 处理,并返回干净的结构化数据。

## 认证

所有 API 调用都需要 `apikey` 请求头:

```
apikey: <PULPMINER_API_KEY>
```

从 https://pulpminer.com/api 获取你的 API key——如果没有,点击 "Regenerate Key"。

## 核心工作流

PulpMiner 分两个阶段工作:

1. **创建已保存的 API** —— 在 https://pulpminer.com/api 的 PulpMiner 仪表板中配置 URL、抓取器、LLM 和可选的 JSON 模板
2. **调用已保存的 API** —— 使用外部端点和你的 API key 来获取结构化 JSON

## 调用已保存的 API

### 静态 API(固定 URL)

```bash
curl -X GET "https://api.pulpminer.com/external/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>"
```

返回从配置网页中提取的 JSON。

### 动态 API(带变量的 URL)

对于保存了模板 URL(如 `https://example.com/search?q={{query}}&page={{page}}`)的 API:

```bash
curl -X POST "https://api.pulpminer.com/external/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"query": "javascript frameworks", "page": "1"}'
```

已保存 URL 中的 `{{variable}}` 占位符会被你提供的值替换。

## 响应格式

成功的响应返回:

```json
{
  "data": { ... },
  "errors": null
}
```

错误的响应返回:

```json
{
  "data": null,
  "errors": "Error message describing what went wrong"
}
```

## 缓存

- 默认情况下,API 响应缓存 **24 小时**
- 如果缓存超过 15 分钟,PulpMiner 会在后台刷新的同时先返回缓存版本
- 可在仪表板设置中按 API 禁用缓存

## 配置选项(在仪表板中设置)

在 https://pulpminer.com/api 创建已保存的 API 时,你可以配置:

| 选项 | 说明 |
|--------|-------------|
| **URL** | 要抓取的网页 |
| **JSON Template** | 供 LLM 遵循的可选 JSON 结构(如 `{"name": "", "price": ""}`) |
| **Render JS** | 为 SPA 和重 JS 页面启用(使用无头浏览器) |
| **CSS Selector** | 只提取页面的特定部分(如 `.product-list`、`#main-content`) |
| **Extra Instructions** | 给 AI 的额外指引(如 "Only extract items with prices above $50") |
| **Dynamic URL** | 在 URL 中启用 `{{variable}}` 语法的模板变量 |
| **Cache** | 开/关响应缓存 |

## 与 Zapier 集成

用于 Zapier 工作流中的异步抓取:

```bash
# Static API
curl -X POST "https://api.pulpminer.com/external/zapier/get/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -d '{"callbackURL": "https://hooks.zapier.com/..."}'

# Dynamic API
curl -X POST "https://api.pulpminer.com/external/zapier/post/<apiId>" \
  -H "apikey: <PULPMINER_API_KEY>" \
  -d '{"callbackURL": "https://hooks.zapier.com/...", "query": "value"}'
```

立即返回 `201`。完成后将抓取的数据发送到回调 URL。

## 与 n8n 集成

验证认证:

```bash
curl -X GET "https://api.pulpminer.com/external/n8n/auth" \
  -H "apikey: <PULPMINER_API_KEY>"
```

然后使用标准的 `/external/<apiId>` 端点获取数据。

## 额度

- 每次 API 调用消耗 **0.25–0.4 额度**,具体取决于端点
- JavaScript 渲染额外增加 **0.1 额度**
- 新用户获得 **5 个免费额度**
- 在 https://pulpminer.com/credits 购买更多

## 提示

- 使用 **CSS 选择器**缩小抓取内容范围并提高准确性
- 提供 **JSON 模板**以获得一致、可预测的输出结构
- 仅在需要时启用 **JS 渲染** —— 静态页面抓取更快、消耗更少额度
- 使用 **额外指令**引导 AI(如 "Return dates in ISO 8601 format")
- 对于监控类用例,保持 **启用缓存**以减少额度消耗
- 先用 **playground** 验证 URL 是否可抓取,再保存 API 配置
- 动态 API 非常适合搜索页、分页内容和带参数的 URL

## 链接

- Website: https://pulpminer.com
- API Dashboard: https://pulpminer.com/api