Cua — Investment Memo · Orange Collective

Thesis

Every computer-use agent needs a desktop. Cua is the open-source container — Docker for computer-use AI — that gives any agent a sandboxed cloud desktop in one command, hot-started in under a second.^[1] ^[2] The longer-term bet: as Anthropic, OpenAI, and every framework downstream push agents from the browser onto the OS, Cua becomes the default infrastructure layer underneath them. The model layer ships the brains; Cua ships the body.

01
Computer-use is the next API surface. Anthropic shipped computer use with Claude 3.5 Sonnet in October 2024 — 14.9% on OSWorld, nearly 2× the next-best model.^[7] OpenAI followed with Operator in January 2025; Google with Gemini 2.5 Computer Use in October 2025.^[4] ^[19] By February 2026 the frontier had crossed OSWorld's 72.36% human baseline.^[16] ^[17] GUIs are the universal interface: the agent that controls a desktop can do any work a human can. Browser-only is a subset; the full stack is the OS.
02
Docker is the right metaphor — and the right wedge. A single command spins up a sandboxed desktop. Containerization unlocks fan-out, reproducibility, and security for agent workloads. Browserbase did this for browsers and built a real business on top; Cua extends the same primitive to macOS, Linux, Windows, and Android.^[5] ^[2]
03
OSS wins this category by default. Builders need to inspect, fork, and self-host computer-use infrastructure — the security surface is too sensitive to outsource to a closed binary. MIT-licensed, 17.9k GitHub stars and growing, native integrations with Claude Code, Cursor, Codex, plus OpenAI, Anthropic, Ollama, and OpenRouter as model providers.^[1] The OSS core is the distribution; the managed cloud is the renewal.
04
The founder is the infra builder. Francesco co-created Windows Agent Arena at Microsoft AI — the canonical benchmark for OS-level computer-use agents — before starting Cua.^[3] The person who designed the eval is now building the runtime everyone else has to pass it on. That sequence — research the problem, define the benchmark, ship the production primitive — is structurally hard to imitate.

Problem

Every computer-use agent has the same first problem: where does it run?

Letting the agent loose on the user's host is a security non-starter — the LLM hallucinates a rm -rf, exfiltrates a credential, or installs a payload. Spinning up a desktop VM is the obvious answer, but every team building one ends up rebuilding the same six layers: Apple Virtualization Framework wiring, screen capture, input simulation, container orchestration, isolation policy, and a hot-start runtime that boots fast enough for a chat-speed agent loop.^[1]

Browser agents got a head start because the browser is already a sandbox. Browser Use commoditized the agent framework on top — 98k+ stars and a $17M Felicis-led seed.^[13] ^[21] Browserbase raised a $40M Series B at a $300M valuation to build managed browser infrastructure underneath.^[22] But the browser is a subset — Anthropic and OpenAI both shipped computer-use APIs at the OS level precisely because the bulk of real work happens outside the tab.^[7] ^[4]

There is no Browserbase for the desktop. Every team trying to ship a CUA-class product is paying the infrastructure tax in-house — and the tax is high enough that some give up and ship browser-only. Cua is the missing primitive: one command, MIT-licensed Linux, macOS, Windows, or Android container, exposed through a computer-use interface any LLM can drive, running at 97% native CPU on Apple Silicon.^[1] ^[2]

72.7%

Claude Opus 4.6 on OSWorld

Feb 2026 — above the 72.36% human baseline[16]

5×

OSWorld SOTA gain in 16 months

14.9% (Oct 2024) → 72.7% (Feb 2026) — model risk retired[7][16]

$22B → $90B+

RPA market 2024 → 2030

The legacy automation budget computer-use replaces[6]

The model layer is past proof-of-concept — past the human baseline, in fact. The runtime layer is the open question — and the one Cua exists to answer.

Why Now

The agent stack split into model, framework, and infrastructure layers — inside a single year.

Anthropic and OpenAI validated the demand curve at the model layer. Browser Use and Browserbase proved the framework /infrastructure split works. Cua is the OS-level counterpart to Browserbase — the runtime under everything else.

Developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text.

Anthropic^[7]

Claude 3.5 Sonnet launch · Oct 2024

Operator can fill forms, place online orders, schedule appointments, and complete other repetitive tasks — a first step toward AI that acts on your behalf across the digital world.

OpenAI^[4]

Operator launch · Jan 2025

The agent stack is splitting into model, framework, and infrastructure layers — the value will compound at the layer that handles the runtime, not the one that handles the prompt.

Browser agent market map^[11]

Theta Labs · 2025

Three preconditions converged in eighteen months.

All three frontier labs now ship computer-use models. Anthropic opened the category with Claude 3.5 Sonnet in October 2024 — 14.9% on OSWorld, nearly 2× the next-best system, with Replit, Canva, Asana, DoorDash, Cognition, and The Browser Company on launch day.^[7] OpenAI followed with Operator in January 2025, then folded it into ChatGPT agent in mid-2025.^[4] ^[20] Google joined in October 2025 with the Gemini 2.5 Computer Use model.^[19] Three labs, one shared assumption: the customer brings the desktop.

The models crossed the human baseline. Claude Sonnet 4.5 hit 61.4% on OSWorld in September 2025 — up from Sonnet 4's 42.2% just four months earlier.^[15] Simular's Agent S crossed OSWorld's 72.36% human baseline in December 2025 at 72.6%; Claude Opus 4.6 followed at 72.7% in February 2026.^[17] ^[16] Model capability is no longer the gating risk. The bottleneck moved to the runtime — boot time, snapshot/restore, isolation, fleet orchestration — exactly the surface Cua engineers against.^[1]

Capital marked the adjacent layers — and left this one open. In eleven months, investors priced every neighboring shelf of the agent-sandbox stack: Browser Use's $17M seed, then Browserbase's $40M Series B at $300M, E2B's $21M Series A with 88% of the Fortune 100 on the platform, and Daytona's $24M Series A.^[21] ^[22] ^[23] ^[24] Browser sandboxes and code sandboxes are now funded categories. The OS-level desktop runtime — the layer the computer-use models actually assume — is the one seat still unpriced, and Cua holds the OSS default with 17.9k stars and 50k+ engineers.^[1] ^[2]

Developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text.

— Anthropic, Claude 3.5 Sonnet launch^[7]

How It Works

Three layers. One container per agent. Boot to action in under a second.

CuaBot driving LibreOffice Calc inside a Cua sandbox from a three-line CLI session — CuaBot driving a native desktop app from three CLI commands — click, type, verify^[1] ^[26]

The 2026 surface: from one container to a product line.

Cua Driver — background computer-use. Agents drive native apps on macOS and Windows without stealing the cursor, focus, or Space — including non-AX surfaces like Chromium web content and canvas tools (Blender, Figma, DAWs, game engines). Same CLI and MCP server from Claude Code, Cursor, Codex, and custom clients; every session recorded as a replayable trajectory.^[1] ^[26] This is the feature that turns computer-use from a demo you watch into a coworker that runs while you keep typing.

One SDK, every OS, bring your own image. Sandbox.ephemeral(Image.linux()) — or .macos(), .windows(), .android() — against the cloud or local QEMU, with custom .qcow2/.iso images self-hosted today and in the cloud next.^[1] The uniform API across six runtimes is the contract competitors with one substrate cannot offer.

Cua-Bench — the eval and RL-environment loop. Run agents against OSWorld, ScreenSpot, Windows Arena, and custom tasks; export trajectories for training.^[1] ^[26] As labs buy RL environments for computer-use post-training, the benchmark registry is a second revenue surface — and it is the founder's home turf, given Windows Agent Arena.^[3]

Self-hosted is the funnel. The managed cloud is where the workload ships.

Hot-start under one second. The managed cloud snapshots a warm desktop image and restores it for every new agent session. The cost difference between a cold-boot VM and a hot-start image is roughly 60× — the difference between a runtime you can spin up per chat turn and one you can't.^[2]

Cross-OS fleet orchestration. macOS, Linux, Windows, and Android containers from one control plane. Windows desktops in particular are exclusive to the cloud — Apple Silicon licensing makes self-hosting Windows impractical for most teams, which makes managed the only path for the workloads that actually require it (legacy ERP, .NET, native enterprise tooling).^[2]

Observability, recording, and replay. Every agent action recorded as a video plus structured trace. The artifact stack is what turns an agent prototype into a production workload — eval harnesses, regression testing, incident debugging. The OSS gives you a container; the cloud gives you a system.^[2]

Docker for Computer-Use

The OS-level agent stack is at its pre-Docker moment.

The pattern is exact. Before Docker, every team rebuilt the same chroot plus init system plus image layer plus networking stack. After Docker, none of them did — and the value moved up to orchestration, registries, and managed clouds. Computer-use is at the same moment now.

The container is the wedge. The fleet is the business.

The runtime stops being a moat the moment it becomes a standard. Docker the company didn't capture the orchestration value — Kubernetes, registries, and the hyperscalers did. Cua's thesis is to be the OSS standard at the runtime layer and the first mover at the orchestration layer. Browserbase ran the same play in browsers and built a real business on it.^[5]

The operational memory compounds. Every agent run leaves a trace inside the container — what worked, what failed, which UI surfaces broke, which recovery strategies converged. That dataset is the natural input for RL training, regression evals, and reliability improvements. Cua-Bench is already in the repo for exactly this loop.^[1]

The container is the API contract that survives model churn. Frontier models cycle every six months. The OS surface doesn't. A team that builds against Cua's computer-use interface today will run the same code against whichever model is best in 18 months. Abstraction over substrate is the durable position.

Market

The runtime layer is structurally larger than the framework layer.

Near-term ICP is every team shipping a computer-use agent: foundation labs running evals (HUD, on the customer list), legacy automation startups (Fira), academic research, YC AI cohort companies, and the 50k+ engineers already building on Cua.^[2] The buyer is a technical founder writing a research preview, or a Series-A team scaling fleet ops — both want OSS by default and managed when production demands it.

Longer-term, the category is agent infrastructure as a line item. RPA is a ~$22B (2024) market growing toward $90B+ by 2030, driven by enterprise digitization.^[6] The CUA-class agent stack is the AI-native successor — the segment where workflows RPA can't reach (legacy ERP, design tools, CAD, native apps with no API) finally become automatable. Browserbase has proven a real business sits under browser agents; the OS-level fleet is structurally a larger surface.

Competitive landscape

Four neighbors. None of them ship the OSS desktop container.

The frontier labs are upstream. Browser infra is adjacent. Dev sandboxes are a different shape. Agent frameworks are downstream consumers. Cua's position — OSS desktop runtime, multi-OS, multi-LLM — has no direct equivalent.

Anthropic · OpenAI · Google — computer-use models

Model layer — upstream

All three frontier labs now ship computer-use models — the brains of the agent — and Opus 4.6 is past the OSWorld human baseline.^[16] None ships the runtime: Claude customers still bring the desktop, Operator was folded into ChatGPT agent in 2025, and Gemini 2.5 Computer Use is browser-first.^[20] ^[19] Cua sits underneath all three, not against them. The labs define the demand curve; Cua absorbs the infrastructure that comes with it.^[7]

The model layer is shipping the brains. Someone has to ship the body. Cua is the open-source default for the runtime — and the OSS default usually wins infrastructure.

— Orange Collective

Founder deep dive

The person who wrote the benchmark is now writing the runtime.

Why Francesco built it. Five years at Microsoft AI, sitting inside the team that defined what "OS-level agent" even meant. He co-created Windows Agent Arena — the benchmark frontier labs use to evaluate computer-use agents.^[3] The pattern is rare and powerful: build the eval, watch every team in the field struggle with the same infrastructure problem, then leave to build the production primitive that fixes it. The founder defined the goalposts before he started building the field.

Why this is the right shape for the founder. Cua is, at its core, a virtualization and runtime company with an agent skin on top. The Apple Virtualization Framework work is hard — 97% native CPU on Apple Silicon is a real engineering achievement, not a flag. The OS-agent surface is even harder; the people qualified to ship both halves are measurably scarce. Francesco's background sits exactly at that intersection.^[1] ^[3]

Why velocity is a feature. 17.9k GitHub stars from a launch that went viral on HN^[25] — and the repo's second year shipped more than its first: Cua Driver (background computer-use on macOS and Windows), CuaBot, Windows and Android sandboxes in both cloud and local QEMU, BYOI images, and the Cua-Bench suite — all inside a small team.^[1] ^[26] The pace of shipping is the leading indicator that maps to what we've watched in every category-winning OSS infra company: Vercel, Supabase, Browserbase. Cua's repo cadence is in that band.^[1]

The long arc. Cua becomes the operating system for AI agents on the desktop. Every agent run goes through one of its containers; every fleet of agents is orchestrated through its cloud; every benchmark in the category cites the runtime. The OSS core wins distribution; the managed cloud captures the workload; the long-term moat is the operational memory of how thousands of agents actually behave inside a sandbox. The container is the wedge. The fleet is the business. The data is the moat.

Founder & team

Francesco Bonacci

Founder

Docker Container for Computer-Use Agents at Cua. Previously at Xbox, Microsoft AI.

Francesco Bonacci

Founder & CEO

Five years at Microsoft across Xbox and Microsoft AI before founding Cua. Co-creator of Windows Agent Arena — the benchmark frontier labs use to evaluate OS-level computer-use agents. Designed the eval, then built the runtime.

Risks & mitigations

Risk

Anthropic, OpenAI, or a hyperscaler bundles a managed desktop alongside their computer-use API and absorbs the runtime layer.

Mitigation

Anthropic, OpenAI, and Google all shipped computer-use models without a sandbox attached — every one of them assumes the customer brings the desktop. [7] [20] [19] Their structural incentive is the opposite of vertically integrating: every closed runtime they ship locks their model to one substrate and rules out the OSS, multi-LLM workloads that actually drive volume. Cua's MIT license, multi-provider adapter (Anthropic + OpenAI + Ollama + LM Studio + OpenRouter), and Linux/macOS/Windows/Android fleet are the surface a single vendor cannot match. The frontier labs define the demand curve; Cua absorbs the infrastructure that comes with it.

Risk

Computer-use adoption is slower than the hype — agents stay browser-centric, the OS-level surface stays a research toy.

Mitigation

The capability question closed in 2025–26. OSWorld SOTA went from 14.9% (Oct 2024) to 72.7% (Feb 2026) — past the 72.36% human baseline — and all three frontier labs now ship computer-use models. [7] [16] [17] [19] Anthropic explicitly bills Opus 4.6 as its best computer-using model. [16] Cua already has 17.9k stars, 50k+ engineers, and named users including HUD (Series A) for agent evaluation. The category is shipping, not speculating.

Risk

Open-source monetization — every infra OSS founder hits the moment where the cloud cannibalizes the wedge.

Mitigation

Cua is following the Browserbase / Vercel / Supabase playbook: MIT core wins distribution, managed cloud wins the renewal. The managed surface — hot-start under one second, fleet orchestration across four OSes, Windows containers exclusive to the cloud, observability and replay — is structurally a different product from the self-hosted runtime. Every paying user converted to managed in the first cohort. The OSS is the funnel; the cloud is where the workload sits in production.

Risk

VM-based infrastructure is expensive — agent workloads at scale eat margin and a lighter shim catches up.

Mitigation

Ephemeral VMs with snapshot/restore and sub-second hot-start are an order of magnitude cheaper than full boots — Cua's runtime is already engineered around this. Apple Virtualization Framework on Apple Silicon achieves 97% native CPU; the marginal cost of an agent action is closer to a function call than a server. The lighter-shim alternative (browser-only, terminal-only) is a strictly smaller product surface and structurally cannot reach the workloads — CAD, design tools, legacy ERP, native apps — that drive the enterprise budget.

Risk

E2B or Daytona — freshly capitalized code-sandbox vendors — bolt a display server onto their Linux sandboxes and attack the desktop runtime from below. [23] [24]

Mitigation

A display server on a Linux container yields a Linux desktop — not macOS, Windows, or Android, which is where the workloads that browser and code sandboxes can't reach actually live. Cua's depth is the multi-OS virtualization stack (Lume on Apple Silicon, Windows and Android in cloud and local QEMU, BYOI images) plus the computer-use interface and the Cua-Bench eval loop — the half of the product that took a Microsoft-AI benchmark career to know how to build. [1] [3] [26] And Cua Driver extends in the opposite direction — background computer-use on the user's own machine — a surface that requires a native driver per OS, which neither vendor ships. The likelier outcome is side-by-side deployment: code sandbox for the shell, Cua for the screen.

What we're watching

Cloud fleet utilization crossing the line where managed revenue exceeds OSS-led conversion — the moment the wedge becomes a business.
First Fortune 500 production workload — the signal that OS-level agents have crossed from research evals into enterprise procurement. E2B's 88%-of-Fortune-100 number shows the buyers are already in the building next door.
Anthropic, OpenAI, or Google partnering with Cua as a reference runtime — explicit or implicit endorsement of the OSS substrate now that all three ship computer-use models.
Cua Driver pull-through from coding agents — every Claude Code, Cursor, and Codex install is a distribution channel for background computer-use on the user's own machine.
Cua-Bench registry traction as labs buy RL environments for computer-use post-training — the eval loop becoming a second revenue line, on the founder's home turf.

References