
Cua
The open-source Docker container for computer-use AI agents — a cloud desktop for every agent, in one command.
Thesis
- 01
Computer-use is the next API surface. Anthropic shipped computer use with Claude 3.5 Sonnet in October 2024 — 14.9% on OSWorld, nearly 2× the next-best model.[7] OpenAI followed with Operator in January 2025.[4] GUIs are the universal interface: the agent that controls a desktop can do any work a human can. Browser-only is a subset; the full stack is the OS.
- 02
Docker is the right metaphor — and the right wedge. A single command spins up a sandboxed desktop. Containerization unlocks fan-out, reproducibility, and security for agent workloads. Browserbase did this for browsers and built a real business on top; Cua extends the same primitive to macOS, Linux, Windows, and Android.[5] [2]
- 03
OSS wins this category by default. Builders need to inspect, fork, and self-host computer-use infrastructure — the security surface is too sensitive to outsource to a closed binary. MIT-licensed, 17.8k GitHub stars and growing, native integrations with Claude Code, Cursor, Codex, plus OpenAI, Anthropic, Ollama, and OpenRouter as model providers.[1] The OSS core is the distribution; the managed cloud is the renewal.
- 04
The founder is the infra builder. Francesco co-created Windows Agent Arena at Microsoft AI — the canonical benchmark for OS-level computer-use agents — before starting Cua.[3] The person who designed the eval is now building the runtime everyone else has to pass it on. That sequence — research the problem, define the benchmark, ship the production primitive — is structurally hard to imitate.
Problem
Every computer-use agent has the same first problem: where does it run?
Letting the agent loose on the user's host is a security non-starter — the LLM hallucinates a rm -rf, exfiltrates a credential, or installs a payload. Spinning up a desktop VM is the obvious answer, but every team building one ends up rebuilding the same six layers: Apple Virtualization Framework wiring, screen capture, input simulation, container orchestration, isolation policy, and a hot-start runtime that boots fast enough for a chat-speed agent loop.[1]
Browser agents got a head start because the browser is already a sandbox. Browser Use commoditized the agent framework on top (50k+ stars).[13] Browserbase took multiple rounds of funding to build managed browser infrastructure underneath.[5] But the browser is a subset — Anthropic and OpenAI both shipped computer-use APIs at the OS level precisely because the bulk of real work happens outside the tab.[7] [4]
There is no Browserbase for the desktop. Every team trying to ship a CUA-class product is paying the infrastructure tax in-house — and the tax is high enough that some give up and ship browser-only. Cua is the missing primitive: one command, MIT-licensed Linux, macOS, Windows, or Android container, exposed through a computer-use interface any LLM can drive, running at 97% native CPU on Apple Silicon.[1] [2]
14.9%
Claude 3.5 Sonnet on OSWorld
Nearly 2× the next-best model — Oct 2024 launch[7]
38.1%
OpenAI Operator on OS-level tasks
Jan 2025 launch · production AI on the desktop[4]
$22B → $90B+
RPA market 2024 → 2030
The legacy automation budget computer-use replaces[6]
The model layer is past proof-of-concept. The runtime layer is the open question — and the one Cua exists to answer.
Why Now
The agent stack split into model, framework, and infrastructure layers — inside a single year.
Anthropic and OpenAI validated the demand curve at the model layer. Browser Use and Browserbase proved the framework /infrastructure split works. Cua is the OS-level counterpart to Browserbase — the runtime under everything else.
Developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text.
Anthropic[7]
Claude 3.5 Sonnet launch · Oct 2024
Operator can fill forms, place online orders, schedule appointments, and complete other repetitive tasks — a first step toward AI that acts on your behalf across the digital world.
OpenAI[4]
Operator launch · Jan 2025
The agent stack is splitting into model, framework, and infrastructure layers — the value will compound at the layer that handles the runtime, not the one that handles the prompt.
Browser agent market map[11]
Theta Labs · 2025
Three preconditions converged inside a single year.
Frontier labs validated the API surface. Anthropic shipped Claude 3.5 Sonnet computer use in October 2024 — 14.9% on OSWorld, nearly 2× the next-best system, and partnered with Replit, Canva, Asana, DoorDash, Cognition, and The Browser Company on launch day.[7] OpenAI followed three months later with Operator at 38.1% on OS-level tasks.[4] The category went from research demo to production primitive in twelve months.
Vision-language models hit the latency floor. Sub-second screen reasoning is now table stakes — Claude Haiku, GPT-4.1 mini, and a wave of open multimodal models cleared the bar. The bottleneck moved from model to runtime. Container boot time, snapshot/restore, and image distribution are now the rate-limiting steps in an agent loop, and they are exactly the surface Cua engineers against.[1]
Developer demand outran the infrastructure. 17.8k GitHub stars on Cua's OSS core, 50k+ engineers using the platform, a 600+ developer Discord, and parallel multi-agent products (Codex, Claude Code's parallel execution) that already spin up multiple desktops concurrently.[1] [2] The orchestration problem isn't a future bet — it's a present-tense one.
Developers can direct Claude to use computers the way people do — by looking at a screen, moving a cursor, clicking buttons, and typing text.
How It Works
Three layers. One container per agent. Boot to action in under a second.
Self-hosted is the funnel. The managed cloud is where the workload ships.
Hot-start under one second. The managed cloud snapshots a warm desktop image and restores it for every new agent session. The cost difference between a cold-boot VM and a hot-start image is roughly 60× — the difference between a runtime you can spin up per chat turn and one you can't.[2]
Cross-OS fleet orchestration. macOS, Linux, Windows, and Android containers from one control plane. Windows desktops in particular are exclusive to the cloud — Apple Silicon licensing makes self-hosting Windows impractical for most teams, which makes managed the only path for the workloads that actually require it (legacy ERP, .NET, native enterprise tooling).[2]
Observability, recording, and replay. Every agent action recorded as a video plus structured trace. The artifact stack is what turns an agent prototype into a production workload — eval harnesses, regression testing, incident debugging. The OSS gives you a container; the cloud gives you a system.[2]
Docker for Computer-Use
The OS-level agent stack is at its pre-Docker moment.
The pattern is exact. Before Docker, every team rebuilt the same chroot plus init system plus image layer plus networking stack. After Docker, none of them did — and the value moved up to orchestration, registries, and managed clouds. Computer-use is at the same moment now.
The container is the wedge. The fleet is the business.
The runtime stops being a moat the moment it becomes a standard. Docker the company didn't capture the orchestration value — Kubernetes, registries, and the hyperscalers did. Cua's thesis is to be the OSS standard at the runtime layer and the first mover at the orchestration layer. Browserbase ran the same play in browsers and built a real business on it.[5]
The operational memory compounds. Every agent run leaves a trace inside the container — what worked, what failed, which UI surfaces broke, which recovery strategies converged. That dataset is the natural input for RL training, regression evals, and reliability improvements. Cua-Bench is already in the repo for exactly this loop.[1]
The container is the API contract that survives model churn. Frontier models cycle every six months. The OS surface doesn't. A team that builds against Cua's computer-use interface today will run the same code against whichever model is best in 18 months. Abstraction over substrate is the durable position.
How can AI agents interact with operating systems, desktop applications, and browsers without jeopardizing security or sacrificing performance?
Market
The runtime layer is structurally larger than the framework layer.
Near-term ICP is every team shipping a computer-use agent: foundation labs running evals (HUD, on the customer list), legacy automation startups (Fira), academic research, YC AI cohort companies, and the 50k+ engineers already building on Cua.[2] The buyer is a technical founder writing a research preview, or a Series-A team scaling fleet ops — both want OSS by default and managed when production demands it.
Longer-term, the category is agent infrastructure as a line item. RPA is a ~$22B (2024) market growing toward $90B+ by 2030, driven by enterprise digitization.[6] The CUA-class agent stack is the AI-native successor — the segment where workflows RPA can't reach (legacy ERP, design tools, CAD, native apps with no API) finally become automatable. Browserbase has proven a real business sits under browser agents; the OS-level fleet is structurally a larger surface.
Every team building a computer-use agent has to solve the same desktop problem. Cua should be the answer by default — and that's how the next generation of agent infrastructure gets written.
Competitive landscape
Four neighbors. None of them ship the OSS desktop container.
The frontier labs are upstream. Browser infra is adjacent. Dev sandboxes are a different shape. Agent frameworks are downstream consumers. Cua's position — OSS desktop runtime, multi-OS, multi-LLM — has no direct equivalent.
The model layer is shipping the brains. Someone has to ship the body. Cua is the open-source default for the runtime — and the OSS default usually wins infrastructure.
Founder deep dive
The person who wrote the benchmark is now writing the runtime.
Founder & team
Risks & mitigations
What we're watching
References
- [1]GitHub — trycua/cua (MIT, 17.8k stars, 1.1k forks)
- [2]Cua — Product homepage (50k+ engineers, <1s hot-start, multi-OS)
- [3]Windows Agent Arena — Evaluating multi-modal OS agents at scale (arXiv 2409.08264, Francesco Bonacci et al.)
- [4]OpenAI — Introducing Operator (Jan 23, 2025)
- [5]Browserbase — Cloud browsers for AI agents (~$40M raised, infra layer reference)
- [6]Grand View Research — RPA market $22B (2024) → $90B+ by 2030
- [7]Anthropic — Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku (Oct 22, 2024)
- [8]Y Combinator — Cua launch: Docker container for computer-use agents
- [9]Y Combinator — Cua company profile (P25, Diana Hu)
- [10]Model Context Protocol — Open standard for connecting AI assistants to tools (Anthropic, Nov 2024)
- [11]Theta Labs — Browser agent market map (X, 2025)
- [12]OSWorld — Benchmarking multimodal agents for open-ended tasks on real computer environments
- [13]Browser Use — Open-source agent framework for web (50k+ stars)
- [14]Cua Discord — 600+ developer community

