VNC Control — AI Desktop Bridge¶

The Problem It Solves¶

AI agents can browse the web, run code, and call APIs — but they can't click permission dialogs, interact with native app UI, or handle system-level prompts. There's a class of actions that exist only in the visual world of a desktop: macOS TCC consent popups, installer wizards, authentication sheets, drag-and-drop operations, screen lock recovery. No API covers these. No terminal command reaches them.

VNC gives an AI agent a universal visual control channel for anything on a screen.

How It Works¶

The architecture is deliberately simple. All intelligence lives in the AI vision model; the tool is a dumb bridge.

1. Screenshot → capture the remote desktop as a JPEG
2. AI analyzes → vision model determines where to click (x,y)
3. Move/click  → send the action through VNC protocol
4. Screenshot → verify the result
5. Repeat

This is the same observe→decide→act→verify loop that a human uses when looking at a screen. The VNC tool handles steps 1, 3, and 4. The AI handles step 2. That separation is the whole design.

Architecture¶

Two operating modes evolved through production use:

v1 — vnc-control.py (standalone CLI): Each command opens a fresh VNC connection (~1.3s). Simple, reliable, zero state. Good for infrequent one-off actions. 3,086 lines of Python covering screenshot, click, move, type, key, OCR, annotation, image diff, workflow engine, vision-assisted detection, and more.

v2 — vnc-session.py (daemon): Persistent Unix socket server with keepalive (prevents macOS screen lock via center-area mouse jiggle every 25s), coordinate space tracking, and command dispatch. Client wrapper vnc auto-activates the venv and talks to the daemon. Faster per-command because the daemon maintains state, but adds complexity.

Both use vncdo (vncdotool) subprocess calls under the hood. This was a deliberate choice — persistent VNC API connections (vncdotool threaded API, asyncvnc) were extensively tested and abandoned due to macOS ARD framebuffer issues (black screenshots, timeout hangs). Subprocess-per-command is the proven reliable path.

HTTP API (vnc_api.py): FastAPI server exposing all commands via REST for multi-agent or remote orchestration. Optional shared-secret auth. 13 unit tests.

Coordinate Translation — The Critical Path¶

The most treacherous part of the system. Any mismatch between vision model output space and VNC native space causes wrong clicks.

VNC native resolution     3420 × 2214 (MacBook Air M4 Retina)
  ↓ capture at scale (default 0.5)
Screenshot image          1710 × 1107
  ↓ fed to vision model
Model output              coords in screenshot-space or normalized 0-1
  ↓ resolve_native_coords()
Native VNC coords         used for move + click

The iron rule: the scale used to produce the screenshot fed to the model must be the same scale used to invert the coords. If they differ, the click lands in the wrong place.

Three coordinate spaces are supported: - capture (default): pixel coordinates in the screenshot image - native: raw screen resolution coordinates - normalized: 0.0–1.0 fractions of screen dimensions

The resolve_native_coords() function is the single point of truth. All vision backends route through it regardless of their output format (Moondream2 returns screenshot-space px, Gemma 4 returns normalized 0–1 floats, Anthropic returns screenshot-space px).

Vision Model Backends¶

The tool supports pluggable vision backends for the "where should I click?" step:

Backend	Model	Speed	Approach	Notes
Moondream2	vikhyatk/moondream2	~5-8s	Local MPS inference	Hallucinates absent elements; returns bounding boxes even when nothing matches
Gemma 4	gemma-4-26b-a4b-it-4bit	~5-8s	Local MLX server	Correctly returns `found: false` for missing elements; better varied-resolution handling
Anthropic	Remote API	~2-3s	Cloud vision	Highest accuracy, but costs money and requires network

Key finding: Moondream2 will confidently report coordinates for UI elements that don't exist. Gemma 4 is more honest about uncertainty. For precision work, Gemma 4 is preferred (when its prefix cache limitations aren't a factor — see Prefix Cache & Hybrid Attention).

macOS ARD: A Graveyard of Quirks¶

The primary deployment target is macOS Apple Remote Desktop, which has a uniquely hostile set of VNC quirks:

key escape times out. Fixed by alias normalization (escape → esc) in the daemon.
Hot corners trigger sleep. The keepalive mouse jiggle was inadvertently hitting the top-left hot corner ("Put Display to Sleep"). Fixed by disabling all hot corners and moving the jiggle to center-area.
! character doesn't type. The VNC type command silently drops special characters. Workaround: key shift-1 for !. Raw vncdo type works correctly including specials, but the daemon wrapper has this gap.
Lock screen submit is context-dependent. key return works in browser fields but not on the macOS lock screen. The proven unlock sequence: vncdo key bsp ×20 (clear) → vncdo type '<password>' → vncdo key enter (bypassing the daemon).
TCC permission dialogs ignore VNC mouse events. macOS Transparency, Consent, and Control popups are rendered in a security layer that ignores VNC pointer input by design. Workaround: hybrid VNC + AppleScript, where osascript reaches the accessibility layer that VNC cannot.
Screen lock timing. Root cause found by Tom: Lock Screen policy was set to "require password after 2 seconds", causing fast relock during VNC reconnects. Changed to 1 hour.
Persistent connections hang. Both vncdotool threaded API and asyncvnc produce black screenshots or timeout hangs with macOS ARD. Subprocess-per-command is the only reliable path found.

Every one of these was discovered through operational pain, not documentation.

The Tool-Call Loop Bug¶

The most significant operational incident: the VNC tool caused repeated Telegram session deadlocks (GitHub issue #1). The agent session would enter an unbounded screenshot→crop→analyze→re-crop→re-analyze cycle when trying to find a UI element that wasn't there. The agent couldn't process new inbound messages while stuck in the tool-call chain.

Four root causes identified: 1. No retry limit on the screenshot→analyze loop 2. Reusing stale screenshots for multiple crop attempts instead of taking a fresh capture 3. No graceful exit when repeated analysis says "element not found" 4. No way to interrupt a long tool-call chain from outside the session

This is a general agent architecture problem, not just a VNC problem — any tool that an agent can call in an unbounded loop creates a session deadlock risk. The VNC case just made it viscerally obvious because each iteration took seconds and generated visible artifacts.

Click Lab — Ground Truth Testing¶

A standalone Next.js app (labs/vnc-click-lab/) provides a controlled test environment:

22-button grid with known positions for click accuracy testing
/api/element-coords endpoint returns ground truth coordinates
JSONL telemetry logging for every click event
Automated regression suites:
click-regression.py — 22-button sweep validator
input-key-regression.py — field input + keystroke coverage
click-calibrator.py — builds affine correction from actual vs requested coords

The lab proved that coordinate translation is correct at v0.2.0 (22/22 buttons hit accurately). When clicks are wrong, the problem is in the vision model's coordinate output, not in the translation layer.

Project Stats¶

61 commits, 5 versioned releases (v0.1.0 → v0.5.0)
~5,900 lines of Python across core modules
8 test files covering unit, integration, API, workflow, hooks, macros, multi-session, and annotation
GitHub: tomsalphaclawbot/openclaw-vnc-control (public)
OpenClaw skill available at skill/SKILL.md for any agent to use

Why This Matters¶

This tool exists because an AI agent looked at its own screen for the first time (see: The Mirror Test) and needed a way to interact with what it saw. The gap between "I can see the permission dialog" and "I can click Allow" is the gap between perception and agency. VNC bridges it — crudely, with 1.3s latency per action and a graveyard of macOS quirks — but it bridges it.

The long-term direction isn't better VNC tooling. It's native OS accessibility APIs, proper computer-use frameworks, and models that understand screen layouts natively. But until those exist reliably on macOS, a subprocess calling vncdo is what works.

OpenClaw Architecture — the agent framework that uses this tool
Prefix Cache & Hybrid Attention — why Gemma 4 vision backend has context limitations
Local MLX Inference — the hardware running the local vision backends