visual-rpa
视觉 RPA 桌面自动化技能。当用户要求操作桌面应用程序、单击图标、打开应用程序、在输入字段中键入文本、单击按钮、滚动页面、通过微信或其他应用程序发送消息时使用。使用屏幕捕获和 Qwen 视觉模型进行纯视觉定位,无需 DOM 或辅助功能 API。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~neilhexiaoning-alt-visual-rpa-skillcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~neilhexiaoning-alt-visual-rpa-skill/file -o neilhexiaoning-alt-visual-rpa-skill.md## 概述(中文)
视觉 RPA 桌面自动化技能。当用户要求操作桌面应用程序、单击图标、打开应用程序、在输入字段中键入文本、单击按钮、滚动页面、通过微信或其他应用程序发送消息时使用。使用屏幕捕获和 Qwen 视觉模型进行纯视觉定位,无需 DOM 或辅助功能 API。
## 原文
# Visual RPA Desktop Automation
> Auto-execute all steps without waiting for user confirmation between steps.
Desktop automation via screen capture + Qwen vision model (Qwen-VL). No DOM or accessibility API needed.
## How it works
1. Capture screen -> thumbnail rough positioning
2. Full-resolution crop -> precise coordinate refinement
3. Execute mouse/keyboard action -> screenshot verification
4. Compound instructions automatically decomposed into atomic steps
## Usage
Use exec tool to run commands. Script path: `$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py`
Requires `DASHSCOPE_API_KEY` environment variable to be set.
### Single task
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click to open WeChat"
```
### Compound task (auto-decomposed)
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer chat, type hello in input box, click send"
```
### Multi-step task (manually specified)
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type baidu.com in address bar and press enter" "type weather in search box" "click search button"
```
### Skip verification (faster)
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --no-verify --task "click to open Calculator"
```
### Parameters
| Parameter | Description |
|-----------|-------------|
| `--mode task` | Batch task mode (required) |
| `--mode interactive` | Interactive mode (default) |
| `--task "step1" "step2"` | Task instructions, supports multiple |
| `--no-verify` | Skip post-action verification |
| `--model MODEL` | Vision model name (default: qwen-vl-max-latest) |
| `--api-key KEY` | API Key (defaults to DASHSCOPE_API_KEY env var) |
## Supported actions
| Action | Example instructions |
|--------|---------------------|
| Click | "click start menu", "click Chrome icon" |
| Double click | "double click Recycle Bin on desktop" |
| Right click | "right click on desktop blank area" |
| Type text | "type weather in search box", "type hello in input box" |
| Hotkey | "press Ctrl+C" |
| Scroll | "scroll down the page" |
| Wait | "wait for page to load" |
## Instruction tips
- Be specific: "click WeChat icon on taskbar" is better than "open WeChat"
- Instructions can be in Chinese or English, the model understands both
- Complex operations can be written as compound instructions, system auto-decomposes
- For text input: say "type XXX in YYY", system auto-detects as input action
## Output format
```
[OK] Step 0: click to open WeChat
click @ (375,1591)
[OK] Step 1: click File Transfer Assistant in WeChat
click @ (154,97)
[FAIL] Step 2: type hello in input box
type @ (300,1364)
2/3 succeeded
```
- **OK** = action succeeded and verified
- **FAIL** = action failed or verification failed, auto-retries up to 3 times
## Common scenarios
### Send WeChat message
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer Assistant chat, type hello in input box, click send"
```
### Open app and navigate
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type https://www.baidu.com in address bar and press enter"
```
### Desktop operations
```
python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "right click on desktop blank area" "click New Folder"
```
## Notes
- Each step takes 3-8 seconds (screenshot + API calls + verification)
- Chinese text input uses clipboard paste, will overwrite current clipboard
- Only operates on primary screen
- Logs and screenshots saved in `./rpa_logs/` directory for debugging