Halluminate — Investment Memo · Orange Collective

Thesis

Agent quality now comes from training environments, not bigger base models.^[12] ^[14] Halluminate builds simulated worlds for knowledge-work AI — licensed reconstructions of real professional software, with expert-built tasks and verifiers on top. Labs buy the graded tasks; the simulator carries into every next contract.^[29] This market consolidates to a handful of winners with one shape — infrastructure over labor, depth over breadth, embedded with the labs — and Halluminate is that shape, pointed at finance.^[33] Frontier financial superintelligence will be trained inside someone's simulation of the financial world. Halluminate is building that world.^[30] ^[42]

01
Capability gains keep coming from post-training, not bigger pre-training runs. Returns from scaling parameters and tokens are flattening. The recent agentic capability jumps — o-series reasoning, ChatGPT Agent, Claude Computer Use — were unlocked by reinforcement learning against purpose-built environments, not by another order of magnitude on the base model.^[3] ^[4] The receipts are on the benchmark: OSWorld success went from 12% to 84% in two years, crossing the human baseline in 2026 — almost entirely on RL post-training.^[23] ^[28]
02
Every lab is short on environments. Every lab is building computer-use agents; every lab is short on environments to train and grade them on. Live web training rate-limits, breaks when sites change, and resets state in ways that destroy training signal.^[1] Halluminate ships the missing piece as its product, not as a side project some lab team will get to next quarter.
03
Contracts fund the catalog; the catalog is the company. Labs buy tasks and verifiers; the simulators underneath amortize across every subsequent contract. Every agent run produces a labeled trajectory; every environment seeds the next; every checker doubles as a reward signal. The defensible asset is the catalog — reusable modules, verifiable tasks, and the trajectory corpus nobody can buy off the shelf.
04
The customer is every frontier lab and every serious agent company. Hyperscalers spent $344B on AI in 2025 alone.^[5] Anthropic's leadership has discussed putting more than $1B into RL environments over a single year.^[16] The same dollars that funded pre-training are now flowing into post-training stacks — and the post-training stack is inseparable from environment infrastructure. This is the most capital-rich, urgency-pressed customer set in software.

Problem

Computer-use agents work in the demo. They break in production. The gap is environment quality.

Browser and computer-use AI is the most-watched capability in the model lab roadmap — and the most fragile in deployment. When OSWorld launched in 2024, the best model completed 12.24% of routine desktop tasks against a human baseline of 72.36%.^[23] WebArena showed the same gap on multi-app web workflows.^[1] a16z's framing is identical: "current agent offerings more closely resemble advanced RPA tools than true autonomous systems."^[13] The gap has since closed at exactly the rate labs bought environments — which is the point.

The reason isn't model capacity. The reason is that labs train agents on whatever data they can scrape — public web recordings, OSS web benchmarks, synthetic prompts — and then fine-tune them on a few thousand hand-curated trajectories. None of that mirrors the production stack the agent will actually run against. The Salesforce instance, the Slack workspace, the ServiceNow ticket queue, the QuickBooks reconciliation flow — these are the surfaces the customer cares about, and they look nothing like the public web.

Live-web training makes it worse. Real websites can't be reset. They rate-limit aggressively, ban automation, and punish exploration. They change layout overnight. They cost real money when an agent buys the wrong flight. Every lab knows this; every lab has tried to build a stand-in internally; every lab has discovered that environment authoring is a product problem masquerading as a research problem.

$344B

AI capex (2025)

Hyperscaler spend now spilling into training, eval, and env infra

$14.3B

Meta → Scale AI

Single deal — labs paying premium prices for training data

$15B

Applied Intuition

AV simulation infra — proves the env-infra business model

LA Times AI capex 2025^[5] · NYT Meta-Scale deal^[7] · Reuters Applied Intuition valuation^[11]

Why Now

The post-training era is here. Environment quality is the next constraint.

Three trends collided in the same eighteen months: agents shipped to production, pre-training gains flatlined, and every lab woke up needing the same thing — verifiable environments to train and grade computer-use models on.

Current agents still exhibit significant limitations in capability — struggling with complex or unfamiliar interfaces — and efficiency, operating too slowly and expensively to compete effectively with human operators.

a16z^[13]

Computer-Use & Agentic Coworkers

The RL environment platforms are becoming foundational infrastructure for anyone looking to train generalist AI workers.

Felicis^[14]

Rocket Fuel for AI

The teams that win won't look like traditional tooling vendors; they'll look like thought partners embedded with frontier labs, compounding trust and research depth over time.

Wing VC^[33]

Who Will Win the RL Environment Market

Three preconditions converged in the same eighteen months.

Computer-use agents are now first-party products. OpenAI shipped ChatGPT Agent. Anthropic shipped Claude Computer Use. Google launched Project Mariner. Browserbase shipped Director.^[3] ^[4] ^[13] ^[15] The category went from research demo to flagship product in twelve months. Every one of those launches has the same next-step problem: making the agent reliable on the long tail of enterprise apps.

Post-training is where capability now comes from. The o-series, Claude 3.5 Computer Use, and ChatGPT Agent were all RL post-training stories. Agents can't learn knowledge work through real-world trial and error — they need custom environments that faithfully simulate reality and reward success.^[12] The output of post-training is no better than the environment it was trained against. The benchmark record makes the causality legible: OSWorld went from 14.9% (Claude 3.5 Sonnet, October 2024) to 38.1% (OpenAI's CUA, January 2025) to 61.4% (Sonnet 4.5, September 2025) to 84% (Opus 4.8, May 2026) — each jump an RL-against-environments release, none of them a bigger base model.^[24] ^[25] ^[26] ^[28]

The money is following the bottleneck. Hyperscaler AI capex hit $344B in 2025.^[5] ^[6] Meta paid $14.3B for Scale.^[7] Surge held funding talks at a $25B+ valuation.^[9] ^[20] Mercor quintupled to $10B in eight months on a $450M run rate.^[19] And the spend has gone explicitly environmental: The Information reported Anthropic leadership discussing more than $1B on RL environments over the next year, with typical lab contracts running six to seven figures per quarter.^[16] ^[17] Felicis calls RL environment platforms "foundational infrastructure for anyone looking to train generalist AI workers."^[14]

Computer-use agents crossed the human baseline in two years

Chart

OSWorld task success rate by flagship model release. The 2024 paper's best model scored 12.24% against a 72.36% human baseline; Claude Opus 4.8 reached 84% on OSWorld-Verified in May 2026 (Anthropic updated its evaluation harness for 2026 scores). Every step on this curve was an RL post-training release — trained against purpose-built environments.^[23] ^[24] ^[25] ^[26] ^[27] ^[28]

Source · OSWorld benchmark · Anthropic & OpenAI model announcements (2024–2026)

How It Works

Two products. One loop. Environments and the data they produce.

The loop is the product.

Design task → simulate → instrument → verify → train → review. Halluminate runs the same loop the best in-house lab teams run, packaged as infrastructure. Customers pick a workflow, Westworld stages a sandbox for it, checkers score every episode, and Athena's reviewers triage the failures into reward models. The loop already shows up in customer numbers: one customer reported a ~20% improvement in date-picking performance after training against Halluminate's flight-booking simulator.^[29]

Delivery is verification-gated. A lab runs a smaller model against the environment; a second model grades the runs and flags errors — did the agent invoke the right tool for the task; then a larger model stresses it again. Failures route back to Halluminate to debug before the contract is accepted and paid. Increasingly the training target is the decision layer — the reasoning that picks the right tool, not the execution of the tool itself — which makes the verifiers, not the simulator, the scarce artifact. An environment that survives acceptance is lab-grade by construction.

Reusable modules bend the curve. Authentication, billing, search, form fills, ticket queues, notification banners — the same primitives show up in every enterprise UI. Each new environment shares more of its scaffolding with the last. The catalog grows superlinearly to engineering hours, the same way Applied Intuition's scenario library did in AV.^[11]

Interoperable with the agent stack labs already use. Westworld plugs into popular agent frameworks, browser automation infra (including Browserbase),^[15] and the standard training pipelines labs run. No new framework to adopt — Halluminate becomes the env layer beneath whatever the customer already runs.

Environments Are the New Datasets

Static data taught models to predict. Environments teach agents to act.

The shift from "sweatshop data" to simulation-as-data is the through-line of the last twelve months of frontier research — SemiAnalysis calls the winners "data foundries."^[31] Every lab is saying the same thing. The companies that build the environment layer become the data layer of the next generation.

Static datasets carried pre-training. Post-training needs environments.

Why static data ran out of room. Pre-training was about scraping the world. Post-training is about practicing in it. A static label set can teach a model what a "good" outcome looks like once. An interactive environment teaches it how to recover when something unexpected happens — and recovery is most of what agent reliability turns out to be.

The pattern the labs already ran. The earliest lab investments in environment infrastructure — Procgen, DeepMind's StarCraft sandbox, the OpenAI Gym lineage — were research bets that environments are the scarcest resource in RL.^[8] The same pattern is repeating one level up the stack: instead of toy gym tasks, the scarce environments are real enterprise workflows.

Why environments and the data they produce are sold together. Every agent episode against a Halluminate environment produces a trajectory, a checker score, an annotator review, and a reward signal — exactly the training input the labs need for the next model release. Customers buy environments today and end up paying for the resulting trajectory corpus on every retrain.

Market

The buyer set is small. The budget is enormous.

Frontier model labs. Three to five labs control the high-end of the post-training spend, and the line item is now public: Anthropic leadership has discussed over $1B on RL environments in a year, and typical environment contracts run six to seven figures per quarter.^[16] ^[17] Each lab is staffed with a small team trying to produce computer-use environments fast enough to keep up with the agent roadmap. Halluminate is already in active pilot with one of them, targeted to convert this quarter. Sustained AI capex creates adjacent demand for training and eval infra to help labs realize their model investments.^[5] ^[6]

Serious agent companies. Browser Use, Yutori, Manus, Browserbase Director, and the next wave of agent products all need the same thing the labs need.^[13] ^[15] Their differentiator is reliability in the customer's actual stack — which means training and grading against environments that mirror that stack. Halluminate sells the same product on the same loop.

Enterprise. The medium-term buyer is the enterprise platform team deploying internal agents. Vertical functions — marketing, finance, sales, HR — all require company-specific tuning against company-specific surfaces.^[13] The same Westworld + Athena loop powers internal agent evaluation before the agent ever touches production. Halluminate has since planted its flag on the highest-value vertical first: the company now leads with RL environments for financial services — Excel modeling, investment banking, private equity, and consulting workflows — where task value per episode is highest and domain expertise is the barrier.^[30] The demand side has gone vertical too: Anthropic now ships finance-agent templates with Excel and Moody's integrations — financial institutions are roughly 40% of its top-50 customers — and Rogo's $160M Series D at $2B (April 2026, a 2.7× step-up in under four months) priced what agentic finance is worth.^[34] ^[35] Whoever trains those models needs finance-grade worlds to train them in.

Near term — labs and agent companies

Three to five frontier labs plus the top tier of agent startups. Concentrated, technical, urgency-driven. Buying today, paying premium prices, willing to sole-source on speed of catalog growth. Revenue has crossed from pilot to delivery: paid environment contracts with frontier labs are in production, revenue is up 10× in the last nine months, and roles are publicly chartered to support an eight-figure ramp this year.^[42] Fleet showed what the demand curve looks like when a catalog crosses the lab procurement bar: ~$1M to $60M+ annualized revenue inside a year.^[18]

The training-data and environments layer keeps repricing upward

Chart

Reported valuations across the data/environments stack. Scale priced at ~$29B in Meta's June 2025 deal; Surge held talks at ~$25B; Applied Intuition — the environments business model proven in AV — sits at $15B; Mercor quintupled to $10B in October 2025; Fleet, the first env-native startup on the curve, reached $750M in June 2026 on a months-old revenue base.^[7] ^[20] ^[11] ^[19] ^[18]

Source · NYT · Bloomberg · Reuters · TechCrunch · The Information (2025–2026)

Competitive landscape

Twenty entrants, three to five winners. The fight is over who is infrastructure and who is labor.

When we invested, Halluminate was nearly alone in selling environments as a product. By mid-2026 roughly twenty funded companies sell into the category, and Wing projects consolidation to three to five winners by 2030.^[16] ^[33] The winner's test is simple: reusable infrastructure over labor, depth over breadth, embedded with the labs. Score the field against it.

Deeptune

Env-native · knowledge work

The most direct new entrant: a16z led a $43M Series A in March 2026 for high-fidelity "training gyms" simulating professional workflows — accountants, support, DevOps — across tools like Slack and Salesforce, sold to frontier labs.^[38] Claims hundreds of gyms already built. Same product shape as Westworld, brushing the financial vertical from the accounting side — the sharpest test of whether licensed-software fidelity and expert-built verifiers hold a premium over well-funded breadth.

Expert networks → environments

Handshake · Turing · micro1 · Invisible

The expert-data marketplaces are converting labor scale into environment offerings: Handshake AI (~$1B gross annualized by April 2026) bought Cleanlab explicitly to add "evaluations, AI safety, RL environments";^[39] Turing ($2.2B, ~$300M ARR) sells containerized digital-twin environments with verifiers;^[40] micro1 ($500M, $100M+ ARR) is scaling specialized RL environments;^[41] Invisible raised $100M at $2B+ to build RL gyms; Labelbox's Alignerr is hiring for sandboxed environments. Labor scales their revenue — environments are where everyone believes the margin lives.

Scale and Surge sell workforce. Fleet and Deeptune sell breadth. Mechanize sells depth on code. Wing's test for the survivors — infrastructure over labor, depth over breadth, embedded with the labs — is the test Halluminate was built to pass, pointed at the highest-value workflows in the economy.

— Orange Collective

Founder deep dive

A product-research operator and a startup data engineer building the part of the lab stack no lab has bandwidth for.

Why Jerry built it. Jerry led product and research at Capital One Labs, where he launched one of the first AI agents in production financial services and co-authored three patents on the underlying systems. Capital One was an early canary: the agent demos worked; the deployments hit a wall the moment they touched real enterprise apps. He left convinced the bottleneck wasn't the model — it was the absence of a realistic, resettable, verifiable training and eval surface for the world the agent had to operate in.

Why Wyatt built it. Wyatt is a two-time early-stage data and software engineer who has spent years shipping the unglamorous backend that makes ML systems work in production — ingestion, eval harnesses, instrumentation, telemetry. The same instinct applied to the agent stack produced Westworld. Westworld is what you build when you've spent enough cycles re-stitching together flaky training data pipelines and decided to make the data infrastructure the product.

Why this team is the right team. Jerry has lived the enterprise agent deployment problem from inside a Fortune 100. Wyatt has built the data infrastructure that production ML actually depends on. Together they cover the customer's pain (Jerry) and the product's depth (Wyatt). The two-person founding shape is exactly right for an infrastructure company that has to ship catalog content fast and sell to a small set of technical buyers.

Why velocity matters here specifically. Catalog growth rate is the metric the labs grade them on. Every additional environment is leveraged — modules generalize across tasks, checkers transfer, trajectories add to the reward-modeling corpus. The bet is that two founders moving fast on the highest-leverage env category will out-ship any lab team running the same work as one of fifty internal priorities.

The long arc. Halluminate becomes "Applied Intuition for knowledge work" — the simulation infrastructure every serious computer-use agent is trained and graded against, and the data corpus that powers every reward model for the category. The same shape of business that earned Applied Intuition a $15B valuation in AV, applied to a category with 10× the surface area.^[11]

Founder & team

Wyatt Marshall

Founder

2x early startup data/sw eng Building a platform for environments, evals, and benchmarks to train and test computer agents AMA about LLM/agent benchmarking and evals! https://halluminate.ai/

Jerry Wu

Co-Founder

Jerry Wu is co-founder and CEO of Halluminate. Previously, he led product and research at Capital One Labs, launching one of the first AI agents in financial services and co-authoring three patents. Jerry studied Computer Science and Economics at Cornell, where he researched model quantization methods and served as VP of the Cornell Consulting Group. He was also class speaker at Acton-Boxborough High School.

Jerry Wu

Co-Founder & CEO

Led product and research at Capital One Labs, where he launched one of the first AI agents in financial services and co-authored three patents. Studied Computer Science and Economics at Cornell, where he researched model quantization methods and served as VP of the Cornell Consulting Group. Class speaker at Acton-Boxborough.

Wyatt Marshall

Co-Founder

Two-time early-stage startup data and software engineer. Spent years shipping large-scale data infrastructure at venture-backed startups, then turned the same instinct on the agent stack — building the environment, eval, and benchmark layer every computer-use AI now needs to train against.

Risks & mitigations

Risk

The services hamster wheel — the structural failure mode for every RL-environments and data company: each contract is bespoke delivery, staffed by senior people, accepted one lab at a time. Revenue scales with delivery leads instead of product, and the company wakes up as a consultancy carrying an infrastructure valuation.

Mitigation

The contract structure already separates the consumable from the asset. Labs buy tasks and verifiers; the simulator underneath carries into the next contract — each engagement leaves residue (rebuilt software, a verifier library, licensed datasets, expert workflow maps) that makes the next one cheaper to deliver. The financial-services focus is the accelerant: concentrating contracts in one domain maximizes catalog overlap, so reuse compounds instead of diluting across verticals. The numbers that prove the wheel is breaking: simulator reuse rate per contract, gross margin by contract cohort, and revenue per FTE — all should rise as the catalog deepens. Surge and Mercor show labor-heavy models can clear $1B in revenue; the infrastructure multiple goes to whoever owns the worlds.

Risk

Frontier labs internalize environment construction — OpenAI, Anthropic, and DeepMind have all built sandbox training environments in-house and have effectively unlimited GPU budgets.

Mitigation

The revealed preference says otherwise: Anthropic works with Mechanize on coding environments and its leadership has discussed spending over $1B on externally sourced RL environments in a single year. Labs have built environments for the capabilities they care about most — and have been bottlenecked on everything else. OpenAI's Procgen and DeepMind's StarCraft sandbox are narrow research artifacts, not the broad enterprise-workflow catalog computer-use agents now demand. Environment construction at catalog scale is a product-shaped problem, not a research-shaped one — and product is where labs prefer partners.

Risk

Environment authoring cost scales linearly — each new app sandbox takes weeks of engineering and the catalog has to cover thousands of surfaces to be a moat.

Mitigation

Halluminate is building environment authoring as a product, not a bespoke service. Reusable simulator modules (auth, billing, search, form fills), coding agents accelerating buildout, and an open framework for community-contributed envs all bend the curve from linear to compounding. The unit economics support it: labs pay roughly $20k for a website replica and up to $300k for a high-fidelity clone of a complex app — margins that fund catalog growth. Applied Intuition proved the same playbook in AV simulation, reaching a $15B valuation on a catalog of reusable scenario modules.

Risk

Customer concentration in frontier labs — three or four buyers control the high-value training contracts and have all the leverage in pricing.

Mitigation

Athena (the human-eval/data offering) is already pulling enterprise demand outside the lab cohort — top browser-agent startups and computer-use product teams need exactly the same env + eval loop. The bundle scales horizontally as every serious agent company hits the same wall: their agents work in the demo and break in production. Halluminate sells the bridge.

Risk

Env-native competition is no longer hypothetical — Fleet went from ~$1M to $60M+ annualized revenue in months and is raising at $750M, Mechanize is embedded with Anthropic on coding environments, and Prime Intellect is open-sourcing the long tail through its Environments Hub.

Mitigation

The competitors validate the category and segment it. Mechanize concentrates on a small number of deep coding environments; Fleet builds horizontal app replicas; Prime Intellect commoditizes research-grade environments. Halluminate's wedge is the high-stakes, hard-to-fake end: full-fidelity financial-services workflows (investment banking, private equity, consulting) bundled with human expert evaluation — surfaces where a vibe-coded 80% replica destroys training signal and where domain expertise is the gate, not engineering hours. A market where Anthropic alone discusses $1B+ of annual spend supports multiple winners; the risk is being undifferentiated, not crowded.

What we're watching

Revenue per FTE, simulator reuse rate, and gross margin by contract cohort — the three numbers that prove the catalog is compounding and the company is stepping off the services hamster wheel.
A second buyer cohort beyond the frontier labs — AI-native services firms training small models on their own workflows would expand demand for environments and data while de-risking lab concentration.
Lab relationships maturing from first paid contracts into standing, multi-quarter engagements across more than one frontier lab — the de-risking signal that matters most.
The financial-services wedge — Halluminate now leads with RL environments for investment banking, private equity, and consulting workflows. Whether that vertical focus produces the seven-figure-per-quarter lab contracts the category's leaders command is the next proof point.
Catalog velocity — environments shipped per month and the rate at which reusable modules drive cross-customer leverage.
Athena attach rate — what percentage of Westworld customers also buy human-in-the-loop evals, and what that implies for ACV expansion.
Competitive separation — Fleet's revenue ramp ($1M to $60M+ annualized inside a year) set the pace for the category; Halluminate needs a comparable inflection in its vertical to hold a premium position.
Open-environment ecosystem — whether community-contributed sandboxes (Prime Intellect's Environments Hub, Halluminate's own open Westworld framework) start meaningfully extending the catalog, and how that shapes the moat.

References