xiaohongshu-extract
通过解析 window.__INITIAL_STATE__ 并返回注释详细信息,从小红书 (XHS) 共享或发现 URL 中提取元数据。当要求从公共 XHS 链接获取 XHS 页面内容、注释元数据、视频信息或参与统计信息时使用。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~jovijovi-xiaohongshu-extractcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~jovijovi-xiaohongshu-extract/file -o jovijovi-xiaohongshu-extract.md# Xiaohongshu Extract ## Overview Extract note metadata (title, desc, type, time, user, engagement, tags, video stream info) from an XHS share or discovery URL using the bundled script. ## Quick Start Run the extractor and print JSON to stdout: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --pretty ``` Write JSON to a file: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --output /tmp/xhs_note.json ``` Output only the flattened record: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --pretty ``` Write only the flattened record to a file: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --output /tmp/xhs_flat.json ``` Emit errors as JSON: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json ``` Emit errors as JSON to a file: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json --output /tmp/xhs_error.json ``` ## Workflow 1. Run `scripts/xiaohongshu_extract.py` with the user-provided URL. 2. If the script fails to find `window.__INITIAL_STATE__`, ask the user for a direct discovery URL. 3. Use the JSON output to summarize note metadata or to feed downstream analysis. ## Output Notes The script returns a JSON object with: - `note_id`, `title`, `desc`, `type`, `time`, `ip_location` - `user` (nickname, user_id, avatar) - `interact` (liked/collected/comment/share counts, plus normalized *_num values) - `tags` - `video` (video_id, duration, width, height, fps, size, stream_url) - `field_mapping` (nested-to-flat field name map) - `flat` (flattened record with normalized counts and ISO timestamp) If the stream list is empty, `video` fields may be null or empty. If `--flat-only` is set, only `flat` is printed. If `--error-json` is set, errors are emitted as JSON and may include `final_url` and `status_code` when available. ## Resources ### scripts/ - `scripts/xiaohongshu_extract.py` extracts note metadata from XHS share/discovery URLs. --- ## 中文说明 # Xiaohongshu Extract ## 概述 使用内置脚本,从小红书(XHS)分享或发现 URL 中提取笔记元数据(标题、描述、类型、时间、用户、互动数据、标签、视频流信息)。 ## 快速开始 运行提取器并将 JSON 打印到 stdout: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --pretty ``` 将 JSON 写入文件: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --output /tmp/xhs_note.json ``` 仅输出扁平化记录: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --pretty ``` 仅将扁平化记录写入文件: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --flat-only --output /tmp/xhs_flat.json ``` 以 JSON 形式输出错误: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json ``` 以 JSON 形式将错误输出到文件: ```bash python scripts/xiaohongshu_extract.py "<xhs_url>" --error-json --output /tmp/xhs_error.json ``` ## 工作流程 1. 使用用户提供的 URL 运行 `scripts/xiaohongshu_extract.py`。 2. 如果脚本无法找到 `window.__INITIAL_STATE__`,请向用户索要一个直接的发现 URL。 3. 使用 JSON 输出来总结笔记元数据,或将其输入下游分析。 ## 输出说明 脚本返回一个 JSON 对象,包含: - `note_id`、`title`、`desc`、`type`、`time`、`ip_location` - `user`(nickname、user_id、avatar) - `interact`(点赞/收藏/评论/分享数,以及归一化后的 *_num 值) - `tags` - `video`(video_id、duration、width、height、fps、size、stream_url) - `field_mapping`(嵌套到扁平的字段名映射) - `flat`(带归一化计数和 ISO 时间戳的扁平化记录) 如果流列表为空,`video` 字段可能为 null 或为空。 如果设置了 `--flat-only`,则只打印 `flat`。如果设置了 `--error-json`,错误将以 JSON 形式输出,在可用时可能包含 `final_url` 和 `status_code`。 ## 资源 ### scripts/ - `scripts/xiaohongshu_extract.py` 从 XHS 分享/发现 URL 中提取笔记元数据。