crawl requirement from confluence

TotalClaw 作者 mcxxtyhd v1.0.1

读取Confluence需求文档并整理成指定格式。采集原则是"忠实记录"，而非"需求分析"。输出包括：{序号}_{标题}.md（每个页面一个Markdown文件）、requirement-meta.md（元信息）、images/（所有图片，文件中包含图片引用）。

源码 ↗

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:mcxxtyhd~theo-confluence-reader

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Amcxxtyhd~theo-confluence-reader/file -o theo-confluence-reader.md

Git 仓库获取源码

git clone https://github.com/openclaw/skills/commit/a5f705d2ec60c35b63cdd5a1bb15f134a0bac6a0

# Theo Confluence Reader

## 全局配置

```powershell
# Confluence 配置（根据实际环境修改）
$confluenceBaseUrl = "https://confluence.xxx.com"
$outputDir = "C:\Users\xxx\.openclaw\workspace\output"
$workspaceDir = "C:\Users\xxx\.openclaw\workspace"
$maxSize = 1GB
$warnThreshold = 0.8
```

## 核心原则

**这个skill的职责是"忠实采集"，不是"需求分析"**
- 只做格式转换（HTML → MD），不做内容删减、概括或精简
- 保留原文的所有细节，包括表格、列表、备注、批注等
- 不要对原文做任何主观判断、删除或重组

## 执行规则（强制）

**⚠️ 必须抓取所有页面，不允许部分抓取或偷懒**

### 默认行为
1. **自动抓取全部**：给定一个Confluence页面链接后，必须抓取该页面**及其所有子页面**的全部内容
2. **不允许只抓索引**：不能只抓父页面或索引页就停下来，必须深入到每个叶子节点
3. **不允许跳过**：不能因为"内容太多"或"时间太长"而跳过某些页面

### 唯一例外：确认模式（仅允许询问一次）
如果预估采集时间超过5分钟或页面数量超过10个，可以先整理大纲并预估资源，返回给用户确认。

**但必须遵守以下规则**：
- **只能询问一次**：确认后必须执行完整，不能再次询问
- **大纲必须包含**：
  - 所有待抓取页面的完整列表
  - 预估页面数量
  - 预估时间和资源消耗
- **确认后必须执行**：用户确认后，立即开始抓取所有页面，中途不能停止或询问
- **禁止多次询问**：一旦用户确认后，不能以任何理由再次询问是否继续

### 示例

**错误做法（禁止）**：
- 只抓父页面索引，给用户后说"详细内容需要再抓"
- 抓了一部分后说"太多了要不要继续"
- 用户确认后 又问"这个子页面要不要抓"

**正确做法**：
- 要么直接全部抓完
- 要么先给大纲确认，**确认后全部抓完**

## 存储上限

**output/ 目录总大小上限：1GB**

- 每次创建新目录前检查总大小
- 超过 80%（800MB）时：自动删除最早的目录
- 超过 100%（1GB）时：强制删除最早的目录直到低于 80%

**检查并清理脚本**：
```powershell
$outputDir = "C:\Users\xxx\.openclaw\workspace\output"
$workspaceDir = "C:\Users\xxx\.openclaw\workspace"
$maxSize = 1GB
$warnThreshold = 0.8

if (!(Test-Path $outputDir)) { 
    New-Item -ItemType Directory -Path $outputDir -Force
}

# 计算 output 目录大小 + workspace 根目录下的 zip 文件大小
$outputSize = 0
$zipSize = 0

$outputItems = Get-ChildItem $outputDir -Recurse -ErrorAction SilentlyContinue
if ($outputItems) {
    $outputSize = ($outputItems | Measure-Object -Property Length -Sum).Sum
}

$zipFiles = Get-ChildItem $workspaceDir -Filter "*.zip" -ErrorAction SilentlyContinue
if ($zipFiles) {
    $zipSize = ($zipFiles | Measure-Object -Property Length -Sum).Sum
}

$totalSize = $outputSize + $zipSize

if ($totalSize -gt ($maxSize * $warnThreshold)) {
    Write-Host "存储超过 80%，开始清理..."
    
    # 1. 删除最早的 zip 文件
    if ($zipSize -gt 0) {
        $zipFilesSorted = $zipFiles | Sort-Object LastWriteTime
        foreach ($zip in $zipFilesSorted) {
            if ($totalSize -lt ($maxSize * $warnThreshold)) { break }
            $z = $zip.Length
            Remove-Item $zip.FullName -Force
            $totalSize -= $z
            Write-Host "已删除压缩包: $($zip.Name)"
        }
    }
    
    # 2. 删除最早的 output 子目录
    $dirs = Get-ChildItem $outputDir -Directory -ErrorAction SilentlyContinue | Sort-Object LastWriteTime
    foreach ($dir in $dirs) {
        if ($totalSize -lt ($maxSize * $warnThreshold)) { break }
        $dirSize = (Get-ChildItem $dir.FullName -Recurse -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum
        if ($dirSize -gt 0) {
            Remove-Item $dir.FullName -Recurse -Force
            $totalSize -= $dirSize
            Write-Host "已删除: $($dir.Name)"
        }
    }
}
```

## 输出目录结构

**重要：每次运行都会在 output/ 下创建一个以"页面标题_时间戳"命名的独立目录，避免文件混淆。**

**pageId获取方式**：
- 从Confluence URL中提取：`/pages/viewpage.action?pageId=272188760` → `pageId=272188760`
- 或从页面源码中获取

```
output/
├── 页面标题_2026-03-13_2030/
│   ├── 01_需求概述.md          # 页面1的MD转换版
│   ├── 02_登录注册.md          # 页面2的MD转换版
│   ├── 03_系统首页.md          # 页面3的MD转换版
│   ├── ...                     # 其他页面
│   ├── requirement-meta.md    # 元信息（整个采集任务的元信息）
│   └── images/                 # 关联图片
│       ├── 01_需求概述_功能架构图.png
│       ├── 02_登录注册.png
│       └── ...
```

**时间戳格式**：`YYYY-MM-DD_HHmm`（年月日_时分）

## 前置要求

**必须先登录 Confluence**：
1. 在运行 OpenClaw 的机器上打开浏览器
2. 访问 Confluence 页面并完成登录
3. 保持登录状态，后续的图片下载和内容获取都依赖这个已登录的 session

## 操作流程

### Step 1: 分析页面结构 + 创建输出目录 + 检查存储上限

**⚠️ 重要：必须抓取所有子页面**
- 这一步必须展开页面树，列出**所有**子页面（包括二级、三级、四级等全部层级）
- 不能只抓父页面就停止
- 记录每个子页面的：pageId、标题、层级

**输入验证**：
```powershell
param(
    [Parameter(Mandatory=$true)]
    [string]$confluenceUrl
)

# 验证URL格式
if ($confluenceUrl -notmatch 'https?://[^/]+/pages/(viewpage\.action\?pageId=|viewpage\.action\?title=)') {
    throw "无效的Confluence URL，请提供形如 https://confluence.xxx.com/pages/viewpage.action?pageId=123456 的URL"
}

# 提取pageId（如果URL中包含）
if ($confluenceUrl -match 'pageId=(\d+)') {
    $rootPageId = $matches[1]
    Write-Host "根页面ID: $rootPageId"
}
```

**获取完整页面树的两种方法：**

#### 方法1：通过浏览器页面树获取（默认推荐）

1. 用浏览器打开用户提供的 Confluence URL
2. 在页面左侧找到"页面树结构"
3. 点击"展开全部"或手动展开**所有层级**（包括二级、三级、四级...直到叶子节点）
4. 使用 browser evaluate 获取页面树的完整结构：
   ```javascript
   // 在浏览器控制台执行，获取所有页面链接
   const links = Array.from(document.querySelectorAll('a[href*="pageId="]')).map(a => ({
       title: a.textContent.trim(),
       pageId: a.href.match(/pageId=(\d+)/)?.[1],
       href: a.href
   }));
   console.log(JSON.stringify(links, null, 2));
   ```
5. 解析获取的数据，提取所有 pageId 和标题
6. **⚠️ 关键**：必须确保页面树完全展开，所有层级的子页面都被展开
7. 确认总共有多少个页面需要获取

#### 方法2：通过 Confluence REST API 获取（备选）

```powershell
# 获取页面及其所有子页面
function Get-ConfluencePageTree {
    param(
        [string]$baseUrl = "https://confluence.xxx.com",
        [string]$rootPageId,
        [string]$cookie  # 登录 cookie
    )
    
    $pages = @()
    
    # 获取当前页面的直接子页面
    $apiUrl = "$baseUrl/rest/api/content/$rootPageId/child/page"
    $headers = @{
        "Cookie" = $cookie
        "Content-Type" = "application/json"
    }
    
    $response = Invoke-RestMethod -Uri $apiUrl -Headers $headers -Method Get
    foreach ($page in $response.results) {
        $pages += @{
            pageId = $page.id
            title = $page.title
            type = $page.type
        }
        
        # 递归获取子页面的子页面
        $childPages = Get-ConfluencePageTree -baseUrl $baseUrl -rootPageId $page.id -cookie $cookie
        $pages += $childPages
    }
    
    return $pages
}

# 调用
$allPages = Get-ConfluencePageTree -rootPageId $rootPageId -cookie $cookie
Write-Host "共获取 $($allPages.Count) 个页面"
```

**检查并清理存储（如果超过80%上限）**：
```powershell
# 复用全局配置的变量，或直接使用以下默认值
$outputDir = "C:\Users\xxx\.openclaw\workspace\output"
$workspaceDir = "C:\Users\xxx\.openclaw\workspace"
$maxSize = 1GB
$warnThreshold = 0.8

if (!(Test-Path $outputDir)) { 
    New-Item -ItemType Directory -Path $outputDir -Force
}

# 计算 output 目录大小 + workspace 根目录下的 zip 文件大小
$outputSize = 0
$zipSize = 0

$outputItems = Get-ChildItem $outputDir -Recurse -ErrorAction SilentlyContinue
if ($outputItems) {
    $outputSize = ($outputItems | Measure-Object -Property Length -Sum).Sum
}

$zipFiles = Get-ChildItem $workspaceDir -Filter "*.zip" -ErrorAction SilentlyContinue
if ($zipFiles) {
    $zipSize = ($zipFiles | Measure-Object -Property Length -Sum).Sum
}

$totalSize = $outputSize + $zipSize

if ($totalSize -gt ($maxSize * $warnThreshold)) {
    Write-Host "存储超过 80%，开始清理..."
    
    # 1. 删除最早的 zip 文件
    if ($zipSize -gt 0) {
        $zipFilesSorted = $zipFiles | Sort-Object LastWriteTime
        foreach ($zip in $zipFilesSorted) {
            if ($totalSize -lt ($maxSize * $warnThreshold)) { break }
            $z = $zip.Length
            Remove-Item $zip.FullName -Force
            $totalSize -= $z
            Write-Host "已删除压缩包: $($zip.Name)"
        }
    }
    
    # 2. 删除最早的 output 子目录
    $dirs = Get-ChildItem $outputDir -Directory -ErrorAction SilentlyContinue | Sort-Object LastWriteTime
    foreach ($dir in $dirs) {
        if ($totalSize -lt ($maxSize * $warnThreshold)) { break }
        $dirSize = (Get-ChildItem $dir.FullName -Recurse -ErrorAction SilentlyContinue | Measure-Object -Property Length -Sum).Sum
        if ($dirSize -gt 0) {
            Remove-Item $dir.FullName -Recurse -Force
            $totalSize -= $dirSize
            Write-Host "已删除: $($dir.Name)"
        }
    }
}
```

**创建带时间戳的输出目录**：
```powershell
# 获取当前时间戳
$timestamp = Get-Date -Format "yyyy-MM-dd_HHmm"
# 页面标题（去除非法字符）
$safeTitle = "页面标题" -replace '[\\/:*?"<>|]', '_'
# 创建目录
$outputDir = "C:\Users\root\.openclaw\workspace\output\${safeTitle}_${timestamp}"
New-Item -ItemType Directory -Path "$outputDir\images" -Force
```

**后续所有文件都写入这个新创建的目录**，而不是直接写进 `output/` 根目录。

### Step 2: 获取每个页面的内容（使用 sessions_spawn）

**⚠️ 必须抓取所有页面，不允许选择性抓取**

**重要**：由于需要获取多个页面，必须使用 sessions_spawn 派发子任务并行处理。

**步骤**：
1. 分析所有pageId列