xiaohongshu-extract

TotalClaw 作者 totalclaw

通过解析 window.__INITIAL_STATE__ 并返回注释详细信息,从小红书 (XHS) 共享或发现 URL 中提取元数据。当要求从公共 XHS 链接获取 XHS 页面内容、注释元数据、视频信息或参与统计信息时使用。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~jovijovi-xiaohongshu-extract
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~jovijovi-xiaohongshu-extract/file -o jovijovi-xiaohongshu-extract.md
# Xiaohongshu Extract

## Overview

Extract note metadata (title, desc, type, time, user, engagement, tags, video stream info) from an XHS share or discovery URL using the bundled script.

## Quick Start

Run the extractor and print JSON to stdout:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --pretty
```

Write JSON to a file:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --output /tmp/xhs_note.json
```

Output only the flattened record:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --pretty
```

Write only the flattened record to a file:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --output /tmp/xhs_flat.json
```

Emit errors as JSON:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json
```

Emit errors as JSON to a file:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json --output /tmp/xhs_error.json
```

## Workflow

1. Run `scripts/xiaohongshu_extract.py` with the user-provided URL.
2. If the script fails to find `window.__INITIAL_STATE__`, ask the user for a direct discovery URL.
3. Use the JSON output to summarize note metadata or to feed downstream analysis.

## Output Notes

The script returns a JSON object with:

- `note_id`, `title`, `desc`, `type`, `time`, `ip_location`
- `user` (nickname, user_id, avatar)
- `interact` (liked/collected/comment/share counts, plus normalized *_num values)
- `tags`
- `video` (video_id, duration, width, height, fps, size, stream_url)
- `field_mapping` (nested-to-flat field name map)
- `flat` (flattened record with normalized counts and ISO timestamp)

If the stream list is empty, `video` fields may be null or empty.

If `--flat-only` is set, only `flat` is printed. If `--error-json` is set, errors are emitted as JSON and may include `final_url` and `status_code` when available.

## Resources

### scripts/

- `scripts/xiaohongshu_extract.py` extracts note metadata from XHS share/discovery URLs.

---

## 中文说明

# Xiaohongshu Extract

## 概述

使用内置脚本,从小红书(XHS)分享或发现 URL 中提取笔记元数据(标题、描述、类型、时间、用户、互动数据、标签、视频流信息)。

## 快速开始

运行提取器并将 JSON 打印到 stdout:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --pretty
```

将 JSON 写入文件:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --output /tmp/xhs_note.json
```

仅输出扁平化记录:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --pretty
```

仅将扁平化记录写入文件:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --output /tmp/xhs_flat.json
```

以 JSON 形式输出错误:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json
```

以 JSON 形式将错误输出到文件:

```bash
python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json --output /tmp/xhs_error.json
```

## 工作流程

1. 使用用户提供的 URL 运行 `scripts/xiaohongshu_extract.py`。
2. 如果脚本无法找到 `window.__INITIAL_STATE__`,请向用户索要一个直接的发现 URL。
3. 使用 JSON 输出来总结笔记元数据,或将其输入下游分析。

## 输出说明

脚本返回一个 JSON 对象,包含:

- `note_id`、`title`、`desc`、`type`、`time`、`ip_location`
- `user`(nickname、user_id、avatar)
- `interact`(点赞/收藏/评论/分享数,以及归一化后的 *_num 值)
- `tags`
- `video`(video_id、duration、width、height、fps、size、stream_url)
- `field_mapping`(嵌套到扁平的字段名映射)
- `flat`(带归一化计数和 ISO 时间戳的扁平化记录)

如果流列表为空,`video` 字段可能为 null 或为空。

如果设置了 `--flat-only`,则只打印 `flat`。如果设置了 `--error-json`,错误将以 JSON 形式输出,在可用时可能包含 `final_url` 和 `status_code`。

## 资源

### scripts/

- `scripts/xiaohongshu_extract.py` 从 XHS 分享/发现 URL 中提取笔记元数据。