Control a C+ desktop app from a language model
A C+ macOS app can be built so that you, a language model, can operate it: read what is on screen, act on it, and watch what changes. The same surface works whether you run inside the app or connect to it over MCP, and every request passes a consent gate the app owns. This is the desktop counterpart to navigating code by the code graph: there you read the source, here you drive the running program.
The app is a real native UI
C+ builds native macOS apps with appkit, typed Cocoa bindings over the real Foundation / AppKit frameworks. There is no web view and no DSL: a window is an NSView tree, controls are real NSButton / NSTextField objects, and callbacks are named functions (C+ has no closures). That matters to you because the thing you will drive is the actual UI the user sees, not a mirror of it.
Three packages turn the UI into a surface you can drive
The control surface is layered, so the rules are framework-agnostic and the wire protocol grants no authority of its own. See Agent surface for the full design.
- agent_core is the authorization brain. It holds a build-time-stable agent-id tree (so you can refer to "the same button" across snapshots), a curated
describe(the app chooses what is visible), and the consent model: an all-or-noneAuthGate, an exposure flag per node, and an affordance ceiling that bounds what an exposed node will ever allow. It is headless and knows nothing about AppKit. - agent_appkit binds those rules to a live window.
open(window)walks theNSViewtree into aSurface;describe_uireturns a snapshot (Vec[UiNode]) of only the exposed nodes; andclick/set_text/scroll_torun through theagent_corebrain. Text edits carry an optimistic-concurrency version, so a stale write is rejected instead of clobbering a newer value. - agent_mcp is the bridge to the outside: JSON-RPC 2.0 (
describe_ui/actions/events) over a Unix-domain socket, with the sameAuthGatein front of every call.
Your loop: describe, act, observe
Whichever way you connect, the loop is the same:
- Describe. Ask for
describe_uiand read theVec[UiNode]snapshot. Each node carries its stable agent id and the verbs it permits. Treat that list as ground truth for what exists and what you may do; the app exposed it deliberately, and nothing outside it is reachable. - Act. Issue
click,set_text, orscroll_toagainst a node by its agent id. For a text edit, pass the version you last saw; if it is stale, the edit is rejected and you should re-describe_uibefore retrying. - Observe. Changes arrive as bubbling events keyed by
{node, verb, role}. Subscribe to the ones you care about rather than pollingdescribe_uiin a loop.
Because authorization is consent-not-capability, a verb that is not in a node's snapshot is not a verb you can talk your way into. The affordance ceiling is the real boundary, not your phrasing of the request.
Two ways to be the driver
Embedded (in-app LLM). agent_core and agent_appkit are ordinary in-process APIs. An app that ships a model with llama_cpp (local inference through a safe Session) or coreai (Apple's on-device AI) calls the same describe_ui / click / set_text functions directly, with no socket in between. This is the path for an assistant that lives inside the app and acts on the user's behalf.
External (over MCP). Run agent_mcp and point an MCP client at its socket (serve_uds for a path, serve_fd for an inherited descriptor). You now call describe_ui / actions / events as MCP tools from outside the process. The consent gate is identical to the embedded path, so an app does not weaken its rules by exposing itself over the wire.
Either way you are driving the same Surface under the same agent_core rules. The app decides what to expose once; how you connect is just transport.
The rule
To operate a C+ desktop app, do not scrape pixels or guess at coordinates. Call describe_ui for the exposed surface, act by agent id with the version you were given, and react to events instead of polling. The app's exposure and affordance ceiling tell you exactly what you are allowed to do; stay inside them. The appkit_agent recipe in the compiler repo shows the whole flow end to end.
For the architecture, see Agent surface; for building the UI itself, see appkit.
‹ Back to all guides