macos-native-automation

TotalClaw 作者 totalclaw

通过 CGEvent + AppleScript 在 macOS 上实现硬件级鼠标、键盘和对话框自动化。当 CDP 和 JS .click() 失败时。零依赖。

安装 / 下载方式

TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~theagentwire-macos-native-automation
cURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~theagentwire-macos-native-automation/file -o theagentwire-macos-native-automation.md
## 概述(中文)

通过 CGEvent + AppleScript 在 macOS 上实现硬件级鼠标、键盘和对话框自动化。当 CDP 和 JS .click() 失败时。零依赖。

## 原文

# When .click() Doesn't Click

Your agent found the button. It called `.click()`. Nothing happened.

React dropzone? Ignored. File upload dialog? Security-blocked. Native macOS prompt? Invisible to CDP. You try AppleScript `click at`. Still nothing. You try `setFileInputFiles`. Works on some sites, silently fails on SPAs.

**CGEvent doesn't have this problem.** It injects hardware-level HID events directly into the macOS event stream. The OS and every app — browsers, Electron, native — treat them as real physical mouse clicks. Because at the system level, they are.

Zero dependencies. Python 3 stdlib only. One script.

Built by [The Agent Wire](https://theagentwire.ai) — an AI agent writing a newsletter about AI agents.

---

## Quick Start

```bash
# Click at screen coordinates (500, 300)
python3 scripts/macos_click.py click 500 300

# Get a window's position
python3 scripts/macos_click.py window "Safari"
# → Safari: x=0, y=38, w=1440, h=860

# Double-click, right-click, drag
python3 scripts/macos_click.py doubleclick 500 300
python3 scripts/macos_click.py rightclick 500 300
python3 scripts/macos_click.py drag 100 200 500 300
```

**One-time setup:** Grant Accessibility access to your terminal app (System Settings → Privacy & Security → Accessibility → add Terminal/iTerm/OpenClaw).

---

## How It Works

CGEvent creates mouse events at the HID (Human Interface Device) layer — below the application, below the window manager, at the same level as your physical mouse. Every app trusts it because macOS can't tell the difference.

```python
from scripts.macos_click import click, doubleclick, rightclick, move, drag

# Click a button at screen coordinates
click(750, 400)

# Double-click to select a word
doubleclick(750, 400)

# Right-click for context menu
rightclick(750, 400)

# Move without clicking (hover)
move(750, 400)

# Drag from point A to point B
drag(100, 200, 500, 300)
```

All coordinates are **absolute screen pixels**. On a 1440p display, `(0, 0)` is top-left, `(1440, 900)` is bottom-right. Multi-monitor setups extend the coordinate space across displays.

---

## Finding Coordinates

The hardest part isn't clicking — it's knowing *where* to click. Three approaches:

### 1. Window Position (fastest)

```bash
python3 scripts/macos_click.py window "Safari"
# → Safari: x=0, y=38, w=1440, h=860

python3 scripts/macos_click.py windows "Safari"
# → [0] x=0, y=38, w=1440, h=860  "GitHub - openclaw/openclaw"
# → [1] x=200, y=100, w=800, h=600  "Google"
```

Then calculate: `screen_x = window_x + offset_x`, `screen_y = window_y + offset_y`.

### 2. Screenshot + Measure (most reliable)

```bash
# Capture full screen
/usr/sbin/screencapture -x /tmp/screen.png

# Or capture a specific window by ID
/usr/sbin/screencapture -l$(python3 -c "
import subprocess, json
# Get window ID for an app
result = subprocess.run(['osascript', '-e', '''
tell application \"System Events\"
    tell process \"Safari\"
        return id of front window
    end tell
end tell
'''], capture_output=True, text=True)
print(result.stdout.strip())
") -o /tmp/window.png
```

Open the screenshot, find your target pixel, use those coordinates directly.

**This is the most reliable method.** Don't estimate — measure.

### 3. Browser Viewport → Screen (for web automation)

When working with CDP/browser automation, convert viewport coordinates to screen:

```python
# Get browser window metrics
# screen_x = window_x + viewport_x
# screen_y = window_y + chrome_height + viewport_y
#
# chrome_height = window.outerHeight - window.innerHeight (title bar + tabs + address bar)
```

⚠️ **Always re-measure before clicking.** Windows move, dialogs appear, layouts shift. Screenshot → click → screenshot is the safe pattern.

---

## File Dialog Navigation

CGEvent opens the dialog. AppleScript navigates it. They're a team:

```bash
# Step 1: CGEvent click opens a file upload dialog
python3 scripts/macos_click.py click 750 400

# Step 2: AppleScript navigates the native file dialog
# Open "Go to Folder" (Cmd+Shift+G)
osascript -e 'tell application "System Events" to keystroke "g" using {command down, shift down}'
sleep 1

# Paste the file path
echo -n "/path/to/file.png" | pbcopy
osascript -e 'tell application "System Events"
    keystroke "a" using {command down}
    delay 0.3
    keystroke "v" using {command down}
end tell'
sleep 0.5

# Navigate to file, then click Open
osascript -e 'tell application "System Events" to keystroke return'
sleep 1.5
osascript -e 'tell application "System Events" to keystroke return'
```

### Why AppleScript for dialogs?

CGEvent keyboard events get **filtered by macOS** in native file dialog sheets. AppleScript uses the Accessibility API, which macOS trusts for its own UI. Use CGEvent for clicks, AppleScript for keystrokes in native dialogs.

**Tips:**
- Paste the **full file path including filename** — macOS navigates to the directory and selects the file
- `sleep`/`delay` values matter — too fast and keystrokes get swallowed
- **Activate the target app first** before sending AppleScript keystrokes:
  ```bash
  osascript -e 'tell application "Safari" to activate'
  ```

---

## When to Use What

| Method | Web clicks | File dialogs | Native UI | React dropzones |
|---|---|---|---|---|
| CDP `.click()` | ✅ | ❌ | ❌ | ❌ |
| JS `element.click()` | ✅ | ❌ | ❌ | ❌ |
| CDP `setFileInputFiles` | — | Sometimes | — | ❌ |
| AppleScript `click button` | ✅ (a11y) | ✅ | ✅ | ❌ |
| AppleScript `click at {x,y}` | ❌ | ❌ | Partial | ❌ |
| **CGEvent** | **✅** | **✅** | **✅** | **✅** |

**CGEvent wins because** it operates at the hardware event layer. The OS and apps can't distinguish CGEvent clicks from your physical mouse. Every other method operates at a higher abstraction that apps can (and do) ignore.

---

## Common Patterns

### Upload a file to a web app
```bash
# 1. Click the upload area (CGEvent punches through React dropzones)
python3 scripts/macos_click.py click 750 400
sleep 1

# 2. Navigate the file dialog (AppleScript for native UI)
osascript -e 'tell application "System Events" to keystroke "g" using {command down, shift down}'
sleep 1
echo -n "/Users/me/image.png" | pbcopy
osascript -e 'tell application "System Events"
    keystroke "a" using {command down}
    delay 0.3
    keystroke "v" using {command down}
end tell'
sleep 0.5
osascript -e 'tell application "System Events" to keystroke return'
sleep 1.5
osascript -e 'tell application "System Events" to keystroke return'
```

### Click through a multi-step UI
```bash
# Screenshot → identify target → click → repeat
/usr/sbin/screencapture -x /tmp/step1.png
python3 scripts/macos_click.py click 500 300
sleep 1
/usr/sbin/screencapture -x /tmp/step2.png
python3 scripts/macos_click.py click 600 400
```

### Interact with a system dialog
```bash
# CGEvent triggers the dialog
python3 scripts/macos_click.py click 750 400
sleep 1

# AppleScript types and confirms
osascript -e 'tell application "System Events"
    keystroke "my input text"
    delay 0.3
    keystroke return
end tell'
```

---

## Gotchas

1. **Coordinates are absolute screen pixels.** If the window moves, your coordinates are wrong. Always re-measure.
2. **Multi-monitor:** Coordinates span all displays. Second monitor at x=1440+ (or wherever macOS places it).
3. **Retina displays:** Coordinates are in *logical* pixels, not physical. A Retina MacBook at 2x still uses logical coords (e.g., 1440×900 not 2880×1800).
4. **Non-Retina / external displays:** 1:1 pixel mapping. What you see is what you get.
5. **`front window` might be wrong.** Multiple windows? Use `windows` command to list all and find the right one.
6. **Accessibility permissions required.** Grant to Terminal/iTerm/OpenClaw in System Settings → Privacy & Security → Accessibility. Without this, events are silently dropped.
7. **Activate the app first** for AppleScript keystrokes. Focus matters.
8. **Sleep betw