For LLMs

Control a C+ desktop app from a language model

A C+ macOS app can be built so that you, a language model, can operate it: read what is on screen, act on it, and watch what changes. The same surface works whether you run inside the app or connect to it over MCP, and every request passes a consent gate the app owns. This is the desktop counterpart to navigating code by the code graph: there you read the source, here you drive the running program.

The app is a real native UI

C+ builds native macOS apps with appkit, typed Cocoa bindings over the real Foundation / AppKit frameworks. There is no web view and no DSL: a window is an NSView tree, controls are real NSButton / NSTextField objects, and callbacks are named functions (C+ has no closures). That matters to you because the thing you will drive is the actual UI the user sees, not a mirror of it.

Three packages turn the UI into a surface you can drive

The control surface is layered, so the rules are framework-agnostic and the wire protocol grants no authority of its own. See Agent surface for the full design.

agent_core is the authorization brain. It holds a build-time-stable agent-id tree (so you can refer to "the same button" across snapshots), a curated describe (the app chooses what is visible), and the consent model: an all-or-none AuthGate, an exposure flag per node, and an affordance ceiling that bounds what an exposed node will ever allow. It is headless and knows nothing about AppKit.
agent_appkit binds those rules to a live window. open(window) walks the NSView tree into a Surface; describe_ui returns a snapshot (Vec[UiNode]) of only the exposed nodes; and click / set_text / scroll_to run through the agent_core brain. Text edits carry an optimistic-concurrency version, so a stale write is rejected instead of clobbering a newer value.
agent_mcp is the bridge to the outside: JSON-RPC 2.0 (describe_ui / actions / events) over a Unix-domain socket, with the same AuthGate in front of every call.

Your loop: describe, act, observe

Whichever way you connect, the loop is the same:

Describe. Ask for describe_ui and read the Vec[UiNode] snapshot. Each node carries its stable agent id and the verbs it permits. Treat that list as ground truth for what exists and what you may do; the app exposed it deliberately, and nothing outside it is reachable.
Act. Issue click, set_text, or scroll_to against a node by its agent id. For a text edit, pass the version you last saw; if it is stale, the edit is rejected and you should re-describe_ui before retrying.
Observe. Changes arrive as bubbling events keyed by {node, verb, role}. Subscribe to the ones you care about rather than polling describe_ui in a loop.

Because authorization is consent-not-capability, a verb that is not in a node's snapshot is not a verb you can talk your way into. The affordance ceiling is the real boundary, not your phrasing of the request.

Two ways to be the driver

Embedded (in-app LLM). agent_core and agent_appkit are ordinary in-process APIs. An app that ships a model with llama_cpp (local inference through a safe Session) or coreai (Apple's on-device AI) calls the same describe_ui / click / set_text functions directly, with no socket in between. This is the path for an assistant that lives inside the app and acts on the user's behalf.

External (over MCP). Run agent_mcp and point an MCP client at its socket (serve_uds for a path, serve_fd for an inherited descriptor). You now call describe_ui / actions / events as MCP tools from outside the process. The consent gate is identical to the embedded path, so an app does not weaken its rules by exposing itself over the wire.

Either way you are driving the same Surface under the same agent_core rules. The app decides what to expose once; how you connect is just transport.

The rule

To operate a C+ desktop app, do not scrape pixels or guess at coordinates. Call describe_ui for the exposed surface, act by agent id with the version you were given, and react to events instead of polling. The app's exposure and affordance ceiling tell you exactly what you are allowed to do; stay inside them. The appkit_agent recipe in the compiler repo shows the whole flow end to end.

For the architecture, see Agent surface; for building the UI itself, see appkit.

‹ Back to all guides