visual-rpa

TotalClaw 作者 totalclaw

视觉 RPA 桌面自动化技能。当用户要求操作桌面应用程序、单击图标、打开应用程序、在输入字段中键入文本、单击按钮、滚动页面、通过微信或其他应用程序发送消息时使用。使用屏幕捕获和 Qwen 视觉模型进行纯视觉定位，无需 DOM 或辅助功能 API。

安装 / 下载方式

TotalClaw CLI推荐

totalclaw install totalclaw:totalclaw~neilhexiaoning-alt-visual-rpa-skill

cURL直接下载，无需登录

curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~neilhexiaoning-alt-visual-rpa-skill/file -o neilhexiaoning-alt-visual-rpa-skill.md

## 概述（中文）

视觉 RPA 桌面自动化技能。当用户要求操作桌面应用程序、单击图标、打开应用程序、在输入字段中键入文本、单击按钮、滚动页面、通过微信或其他应用程序发送消息时使用。使用屏幕捕获和 Qwen 视觉模型进行纯视觉定位，无需 DOM 或辅助功能 API。

## 原文

# Visual RPA Desktop Automation

> Auto-execute all steps without waiting for user confirmation between steps.

Desktop automation via screen capture + Qwen vision model (Qwen-VL). No DOM or accessibility API needed.

## How it works

1. Capture screen -> thumbnail rough positioning
2. Full-resolution crop -> precise coordinate refinement
3. Execute mouse/keyboard action -> screenshot verification
4. Compound instructions automatically decomposed into atomic steps

## Usage

Use exec tool to run commands. Script path: `$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py`

Requires `DASHSCOPE_API_KEY` environment variable to be set.

### Single task

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click to open WeChat"
```

### Compound task (auto-decomposed)

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer chat, type hello in input box, click send"
```

### Multi-step task (manually specified)

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type baidu.com in address bar and press enter" "type weather in search box" "click search button"
```

### Skip verification (faster)

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --no-verify --task "click to open Calculator"
```

### Parameters

| Parameter | Description |
|-----------|-------------|
| `--mode task` | Batch task mode (required) |
| `--mode interactive` | Interactive mode (default) |
| `--task "step1" "step2"` | Task instructions, supports multiple |
| `--no-verify` | Skip post-action verification |
| `--model MODEL` | Vision model name (default: qwen-vl-max-latest) |
| `--api-key KEY` | API Key (defaults to DASHSCOPE_API_KEY env var) |

## Supported actions

| Action | Example instructions |
|--------|---------------------|
| Click | "click start menu", "click Chrome icon" |
| Double click | "double click Recycle Bin on desktop" |
| Right click | "right click on desktop blank area" |
| Type text | "type weather in search box", "type hello in input box" |
| Hotkey | "press Ctrl+C" |
| Scroll | "scroll down the page" |
| Wait | "wait for page to load" |

## Instruction tips

- Be specific: "click WeChat icon on taskbar" is better than "open WeChat"
- Instructions can be in Chinese or English, the model understands both
- Complex operations can be written as compound instructions, system auto-decomposes
- For text input: say "type XXX in YYY", system auto-detects as input action

## Output format

```
  [OK] Step 0: click to open WeChat
       click @ (375,1591)
  [OK] Step 1: click File Transfer Assistant in WeChat
       click @ (154,97)
  [FAIL] Step 2: type hello in input box
       type @ (300,1364)
  2/3 succeeded
```

- **OK** = action succeeded and verified
- **FAIL** = action failed or verification failed, auto-retries up to 3 times

## Common scenarios

### Send WeChat message

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer Assistant chat, type hello in input box, click send"
```

### Open app and navigate

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type https://www.baidu.com in address bar and press enter"
```

### Desktop operations

```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "right click on desktop blank area" "click New Folder"
```

## Notes

- Each step takes 3-8 seconds (screenshot + API calls + verification)
- Chinese text input uses clipboard paste, will overwrite current clipboard
- Only operates on primary screen
- Logs and screenshots saved in `./rpa_logs/` directory for debugging