pulpminer
使用 AI 将任何网页转换为结构化 JSON 数据。抓取网站、将数据提取到自定义 JSON 模式中,并以编程方式调用保存的 API。对于网络抓取、数据提取、内容监控、潜在客户生成、价格跟踪和构建数据管道很有用。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~melvin2016-webscraper-pulpminercURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~melvin2016-webscraper-pulpminer/file -o melvin2016-webscraper-pulpminer.md# PulpMiner — AI Web Scraping & JSON API
PulpMiner converts any webpage into structured JSON using AI. You provide a URL and optionally a JSON template, and PulpMiner scrapes the page, runs it through an LLM, and returns clean structured data.
## Authentication
All API calls require the `apikey` header:
```
apikey: <PULPMINER_API_KEY>
```
Get your API key from https://pulpminer.com/api — click "Regenerate Key" if you don't have one.
## Core Workflow
PulpMiner works in two phases:
1. **Create a saved API** — Configure a URL, scraper, LLM, and optional JSON template via the PulpMiner dashboard at https://pulpminer.com/api
2. **Call the saved API** — Use the external endpoint with your API key to fetch structured JSON
## Calling a Saved API
### Static API (fixed URL)
```bash
curl -X GET "https://api.pulpminer.com/external/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>"
```
Returns JSON extracted from the configured webpage.
### Dynamic API (URL with variables)
For APIs saved with template URLs like `https://example.com/search?q={{query}}&page={{page}}`:
```bash
curl -X POST "https://api.pulpminer.com/external/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"query": "javascript frameworks", "page": "1"}'
```
The `{{variable}}` placeholders in the saved URL get replaced with the values you provide.
## Response Format
Successful responses return:
```json
{
"data": { ... },
"errors": null
}
```
Error responses return:
```json
{
"data": null,
"errors": "Error message describing what went wrong"
}
```
## Caching
- API responses are cached for **24 hours** by default
- If cache is older than 15 minutes, PulpMiner serves the cached version while refreshing in the background
- Cache can be disabled per-API in the dashboard settings
## Configuration Options (Set in Dashboard)
When creating a saved API at https://pulpminer.com/api, you can configure:
| Option | Description |
|--------|-------------|
| **URL** | The webpage to scrape |
| **JSON Template** | Optional JSON structure for the LLM to follow (e.g., `{"name": "", "price": ""}`) |
| **Render JS** | Enable for SPAs and JS-heavy pages (uses headless browser) |
| **CSS Selector** | Extract only a specific part of the page (e.g., `.product-list`, `#main-content`) |
| **Extra Instructions** | Additional guidance for the AI (e.g., "Only extract items with prices above $50") |
| **Dynamic URL** | Enable template variables in the URL with `{{variable}}` syntax |
| **Cache** | Toggle response caching on/off |
## Integration with Zapier
For async scraping in Zapier workflows:
```bash
# Static API
curl -X POST "https://api.pulpminer.com/external/zapier/get/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-d '{"callbackURL": "https://hooks.zapier.com/..."}'
# Dynamic API
curl -X POST "https://api.pulpminer.com/external/zapier/post/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-d '{"callbackURL": "https://hooks.zapier.com/...", "query": "value"}'
```
Returns `201` immediately. Sends scraped data to the callback URL when complete.
## Integration with n8n
Verify authentication:
```bash
curl -X GET "https://api.pulpminer.com/external/n8n/auth" \
-H "apikey: <PULPMINER_API_KEY>"
```
Then use the standard `/external/<apiId>` endpoints for data fetching.
## Credits
- Each API call costs **0.25–0.4 credits** depending on the endpoint
- JavaScript rendering adds **0.1 credits** extra
- New users get **5 free credits**
- Purchase more at https://pulpminer.com/credits
## Tips
- Use **CSS selectors** to narrow down the scraped content and improve accuracy
- Provide a **JSON template** for consistent, predictable output structures
- Enable **JS rendering** only when needed — static pages scrape faster and cost fewer credits
- Use **extra instructions** to guide the AI (e.g., "Return dates in ISO 8601 format")
- For monitoring use cases, keep **caching enabled** to reduce credit usage
- Use the **playground** first to verify a URL is scrapable before saving an API config
- Dynamic APIs are ideal for search pages, paginated content, and parameterized URLs
## Links
- Website: https://pulpminer.com
- API Dashboard: https://pulpminer.com/api
---
## 中文说明
# PulpMiner —— AI 网页抓取与 JSON API
PulpMiner 使用 AI 将任何网页转换为结构化 JSON。你提供一个 URL 和可选的 JSON 模板,PulpMiner 会抓取页面、将其通过 LLM 处理,并返回干净的结构化数据。
## 认证
所有 API 调用都需要 `apikey` 请求头:
```
apikey: <PULPMINER_API_KEY>
```
从 https://pulpminer.com/api 获取你的 API key——如果没有,点击 "Regenerate Key"。
## 核心工作流
PulpMiner 分两个阶段工作:
1. **创建已保存的 API** —— 在 https://pulpminer.com/api 的 PulpMiner 仪表板中配置 URL、抓取器、LLM 和可选的 JSON 模板
2. **调用已保存的 API** —— 使用外部端点和你的 API key 来获取结构化 JSON
## 调用已保存的 API
### 静态 API(固定 URL)
```bash
curl -X GET "https://api.pulpminer.com/external/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>"
```
返回从配置网页中提取的 JSON。
### 动态 API(带变量的 URL)
对于保存了模板 URL(如 `https://example.com/search?q={{query}}&page={{page}}`)的 API:
```bash
curl -X POST "https://api.pulpminer.com/external/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{"query": "javascript frameworks", "page": "1"}'
```
已保存 URL 中的 `{{variable}}` 占位符会被你提供的值替换。
## 响应格式
成功的响应返回:
```json
{
"data": { ... },
"errors": null
}
```
错误的响应返回:
```json
{
"data": null,
"errors": "Error message describing what went wrong"
}
```
## 缓存
- 默认情况下,API 响应缓存 **24 小时**
- 如果缓存超过 15 分钟,PulpMiner 会在后台刷新的同时先返回缓存版本
- 可在仪表板设置中按 API 禁用缓存
## 配置选项(在仪表板中设置)
在 https://pulpminer.com/api 创建已保存的 API 时,你可以配置:
| 选项 | 说明 |
|--------|-------------|
| **URL** | 要抓取的网页 |
| **JSON Template** | 供 LLM 遵循的可选 JSON 结构(如 `{"name": "", "price": ""}`) |
| **Render JS** | 为 SPA 和重 JS 页面启用(使用无头浏览器) |
| **CSS Selector** | 只提取页面的特定部分(如 `.product-list`、`#main-content`) |
| **Extra Instructions** | 给 AI 的额外指引(如 "Only extract items with prices above $50") |
| **Dynamic URL** | 在 URL 中启用 `{{variable}}` 语法的模板变量 |
| **Cache** | 开/关响应缓存 |
## 与 Zapier 集成
用于 Zapier 工作流中的异步抓取:
```bash
# Static API
curl -X POST "https://api.pulpminer.com/external/zapier/get/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-d '{"callbackURL": "https://hooks.zapier.com/..."}'
# Dynamic API
curl -X POST "https://api.pulpminer.com/external/zapier/post/<apiId>" \
-H "apikey: <PULPMINER_API_KEY>" \
-d '{"callbackURL": "https://hooks.zapier.com/...", "query": "value"}'
```
立即返回 `201`。完成后将抓取的数据发送到回调 URL。
## 与 n8n 集成
验证认证:
```bash
curl -X GET "https://api.pulpminer.com/external/n8n/auth" \
-H "apikey: <PULPMINER_API_KEY>"
```
然后使用标准的 `/external/<apiId>` 端点获取数据。
## 额度
- 每次 API 调用消耗 **0.25–0.4 额度**,具体取决于端点
- JavaScript 渲染额外增加 **0.1 额度**
- 新用户获得 **5 个免费额度**
- 在 https://pulpminer.com/credits 购买更多
## 提示
- 使用 **CSS 选择器**缩小抓取内容范围并提高准确性
- 提供 **JSON 模板**以获得一致、可预测的输出结构
- 仅在需要时启用 **JS 渲染** —— 静态页面抓取更快、消耗更少额度
- 使用 **额外指令**引导 AI(如 "Return dates in ISO 8601 format")
- 对于监控类用例,保持 **启用缓存**以减少额度消耗
- 先用 **playground** 验证 URL 是否可抓取,再保存 API 配置
- 动态 API 非常适合搜索页、分页内容和带参数的 URL
## 链接
- Website: https://pulpminer.com
- API Dashboard: https://pulpminer.com/api