Orange Collective
Halluminate

Halluminate

The RL environment and data factory for computer-use AI.

Halluminate YC launch — Westworld + Athena walkthrough[2]

Thesis

Agent quality now comes from training environments, not bigger base models.[12] [14] Halluminate builds simulated worlds for knowledge-work AI — licensed reconstructions of real professional software, with expert-built tasks and verifiers on top. Labs buy the graded tasks; the simulator carries into every next contract.[29] This market consolidates to a handful of winners with one shape — infrastructure over labor, depth over breadth, embedded with the labs — and Halluminate is that shape, pointed at finance.[33] Frontier financial superintelligence will be trained inside someone's simulation of the financial world. Halluminate is building that world.[30] [42]
  1. 01

    Capability gains keep coming from post-training, not bigger pre-training runs. Returns from scaling parameters and tokens are flattening. The recent agentic capability jumps — o-series reasoning, ChatGPT Agent, Claude Computer Use — were unlocked by reinforcement learning against purpose-built environments, not by another order of magnitude on the base model.[3] [4] The receipts are on the benchmark: OSWorld success went from 12% to 84% in two years, crossing the human baseline in 2026 — almost entirely on RL post-training.[23] [28]

  2. 02

    Every lab is short on environments. Every lab is building computer-use agents; every lab is short on environments to train and grade them on. Live web training rate-limits, breaks when sites change, and resets state in ways that destroy training signal.[1] Halluminate ships the missing piece as its product, not as a side project some lab team will get to next quarter.

  3. 03

    Contracts fund the catalog; the catalog is the company. Labs buy tasks and verifiers; the simulators underneath amortize across every subsequent contract. Every agent run produces a labeled trajectory; every environment seeds the next; every checker doubles as a reward signal. The defensible asset is the catalog — reusable modules, verifiable tasks, and the trajectory corpus nobody can buy off the shelf.

  4. 04

    The customer is every frontier lab and every serious agent company. Hyperscalers spent $344B on AI in 2025 alone.[5] Anthropic's leadership has discussed putting more than $1B into RL environments over a single year.[16] The same dollars that funded pre-training are now flowing into post-training stacks — and the post-training stack is inseparable from environment infrastructure. This is the most capital-rich, urgency-pressed customer set in software.

Problem

Computer-use agents work in the demo. They break in production. The gap is environment quality.

Browser and computer-use AI is the most-watched capability in the model lab roadmap — and the most fragile in deployment. When OSWorld launched in 2024, the best model completed 12.24% of routine desktop tasks against a human baseline of 72.36%.[23] WebArena showed the same gap on multi-app web workflows.[1] a16z's framing is identical: "current agent offerings more closely resemble advanced RPA tools than true autonomous systems."[13] The gap has since closed at exactly the rate labs bought environments — which is the point.

The reason isn't model capacity. The reason is that labs train agents on whatever data they can scrape — public web recordings, OSS web benchmarks, synthetic prompts — and then fine-tune them on a few thousand hand-curated trajectories. None of that mirrors the production stack the agent will actually run against. The Salesforce instance, the Slack workspace, the ServiceNow ticket queue, the QuickBooks reconciliation flow — these are the surfaces the customer cares about, and they look nothing like the public web.

Live-web training makes it worse. Real websites can't be reset. They rate-limit aggressively, ban automation, and punish exploration. They change layout overnight. They cost real money when an agent buys the wrong flight. Every lab knows this; every lab has tried to build a stand-in internally; every lab has discovered that environment authoring is a product problem masquerading as a research problem.

$344B

AI capex (2025)

Hyperscaler spend now spilling into training, eval, and env infra

$14.3B

Meta → Scale AI

Single deal — labs paying premium prices for training data

$15B

Applied Intuition

AV simulation infra — proves the env-infra business model

LA Times AI capex 2025[5] · NYT Meta-Scale deal[7] · Reuters Applied Intuition valuation[11]

Why Now

The post-training era is here. Environment quality is the next constraint.

Three trends collided in the same eighteen months: agents shipped to production, pre-training gains flatlined, and every lab woke up needing the same thing — verifiable environments to train and grade computer-use models on.

Current agents still exhibit significant limitations in capability — struggling with complex or unfamiliar interfaces — and efficiency, operating too slowly and expensively to compete effectively with human operators.

a16z

a16z[13]

Computer-Use & Agentic Coworkers

The RL environment platforms are becoming foundational infrastructure for anyone looking to train generalist AI workers.

Felicis

Felicis[14]

Rocket Fuel for AI

The teams that win won't look like traditional tooling vendors; they'll look like thought partners embedded with frontier labs, compounding trust and research depth over time.

Wing VC

Wing VC[33]

Who Will Win the RL Environment Market

Three preconditions converged in the same eighteen months.

Computer-use agents are now first-party products. OpenAI shipped ChatGPT Agent. Anthropic shipped Claude Computer Use. Google launched Project Mariner. Browserbase shipped Director.[3] [4] [13] [15] The category went from research demo to flagship product in twelve months. Every one of those launches has the same next-step problem: making the agent reliable on the long tail of enterprise apps.

Post-training is where capability now comes from. The o-series, Claude 3.5 Computer Use, and ChatGPT Agent were all RL post-training stories. Agents can't learn knowledge work through real-world trial and error — they need custom environments that faithfully simulate reality and reward success.[12] The output of post-training is no better than the environment it was trained against. The benchmark record makes the causality legible: OSWorld went from 14.9% (Claude 3.5 Sonnet, October 2024) to 38.1% (OpenAI's CUA, January 2025) to 61.4% (Sonnet 4.5, September 2025) to 84% (Opus 4.8, May 2026) — each jump an RL-against-environments release, none of them a bigger base model.[24] [25] [26] [28]

The money is following the bottleneck. Hyperscaler AI capex hit $344B in 2025.[5] [6] Meta paid $14.3B for Scale.[7] Surge held funding talks at a $25B+ valuation.[9] [20] Mercor quintupled to $10B in eight months on a $450M run rate.[19] And the spend has gone explicitly environmental: The Information reported Anthropic leadership discussing more than $1B on RL environments over the next year, with typical lab contracts running six to seven figures per quarter.[16] [17] Felicis calls RL environment platforms "foundational infrastructure for anyone looking to train generalist AI workers."[14]

Computer-use agents crossed the human baseline in two years

Chart

OSWorld task success rate by flagship model release. The 2024 paper's best model scored 12.24% against a 72.36% human baseline; Claude Opus 4.8 reached 84% on OSWorld-Verified in May 2026 (Anthropic updated its evaluation harness for 2026 scores). Every step on this curve was an RL post-training release — trained against purpose-built environments.[23] [24] [25] [26] [27] [28]

Source · OSWorld benchmark · Anthropic & OpenAI model announcements (2024–2026)

How It Works

Two products. One loop. Environments and the data they produce.

Step 01

Westworld — the mini-internet

High-fidelity sandbox reconstructions of the software agents need to learn. Halluminate licenses the real tool, snapshots it, and rebuilds it loaded with realistic data — a faithful, resettable replica of the actual software, not a lookalike. Deterministic where it matters, parallelizable for fleet-scale training, version-pinned so agents can be graded against the same fixture twice.

Step 02

Athena — human-in-the-loop data

Expert-built tasks and the humans behind them. Halluminate acquires historical work artifacts — emails, spreadsheets, deal documents — to reconstruct point-in-time world states, then sits with domain experts to encode how a professional would actually break down and evaluate the work. Annotators score agent trajectories, build error taxonomies, and produce reward-model targets that plug straight into lab fine-tuning, DPO, and RL pipelines.

Step 03

Checkers — verifiable rewards

Every task ships with programmatic success metrics — record created with the right fields, cart total reconciled, ticket routed to the right queue. The same checker that grades the agent at eval time is the reward signal at training time. RLVR by construction.

The loop is the product.

Design task → simulate → instrument → verify → train → review. Halluminate runs the same loop the best in-house lab teams run, packaged as infrastructure. Customers pick a workflow, Westworld stages a sandbox for it, checkers score every episode, and Athena's reviewers triage the failures into reward models. The loop already shows up in customer numbers: one customer reported a ~20% improvement in date-picking performance after training against Halluminate's flight-booking simulator.[29]

Delivery is verification-gated. A lab runs a smaller model against the environment; a second model grades the runs and flags errors — did the agent invoke the right tool for the task; then a larger model stresses it again. Failures route back to Halluminate to debug before the contract is accepted and paid. Increasingly the training target is the decision layer — the reasoning that picks the right tool, not the execution of the tool itself — which makes the verifiers, not the simulator, the scarce artifact. An environment that survives acceptance is lab-grade by construction.

Reusable modules bend the curve. Authentication, billing, search, form fills, ticket queues, notification banners — the same primitives show up in every enterprise UI. Each new environment shares more of its scaffolding with the last. The catalog grows superlinearly to engineering hours, the same way Applied Intuition's scenario library did in AV.[11]

Interoperable with the agent stack labs already use. Westworld plugs into popular agent frameworks, browser automation infra (including Browserbase),[15] and the standard training pipelines labs run. No new framework to adopt — Halluminate becomes the env layer beneath whatever the customer already runs.

Environments Are the New Datasets

Static data taught models to predict. Environments teach agents to act.

The shift from "sweatshop data" to simulation-as-data is the through-line of the last twelve months of frontier research — SemiAnalysis calls the winners "data foundries."[31] Every lab is saying the same thing. The companies that build the environment layer become the data layer of the next generation.

Static datasets carried pre-training. Post-training needs environments.

Why static data ran out of room. Pre-training was about scraping the world. Post-training is about practicing in it. A static label set can teach a model what a "good" outcome looks like once. An interactive environment teaches it how to recover when something unexpected happens — and recovery is most of what agent reliability turns out to be.

The pattern the labs already ran. The earliest lab investments in environment infrastructure — Procgen, DeepMind's StarCraft sandbox, the OpenAI Gym lineage — were research bets that environments are the scarcest resource in RL.[8] The same pattern is repeating one level up the stack: instead of toy gym tasks, the scarce environments are real enterprise workflows.

Why environments and the data they produce are sold together. Every agent episode against a Halluminate environment produces a trajectory, a checker score, an annotator review, and a reward signal — exactly the training input the labs need for the next model release. Customers buy environments today and end up paying for the resulting trajectory corpus on every retrain.

The RL environment platforms are becoming foundational infrastructure for anyone looking to train generalist AI workers.
Felicis — Rocket Fuel for AI[14]

Market

The buyer set is small. The budget is enormous.

Frontier model labs. Three to five labs control the high-end of the post-training spend, and the line item is now public: Anthropic leadership has discussed over $1B on RL environments in a year, and typical environment contracts run six to seven figures per quarter.[16] [17] Each lab is staffed with a small team trying to produce computer-use environments fast enough to keep up with the agent roadmap. Halluminate is already in active pilot with one of them, targeted to convert this quarter. Sustained AI capex creates adjacent demand for training and eval infra to help labs realize their model investments.[5] [6]

Serious agent companies. Browser Use, Yutori, Manus, Browserbase Director, and the next wave of agent products all need the same thing the labs need.[13] [15] Their differentiator is reliability in the customer's actual stack — which means training and grading against environments that mirror that stack. Halluminate sells the same product on the same loop.

Enterprise. The medium-term buyer is the enterprise platform team deploying internal agents. Vertical functions — marketing, finance, sales, HR — all require company-specific tuning against company-specific surfaces.[13] The same Westworld + Athena loop powers internal agent evaluation before the agent ever touches production. Halluminate has since planted its flag on the highest-value vertical first: the company now leads with RL environments for financial services — Excel modeling, investment banking, private equity, and consulting workflows — where task value per episode is highest and domain expertise is the barrier.[30] The demand side has gone vertical too: Anthropic now ships finance-agent templates with Excel and Moody's integrations — financial institutions are roughly 40% of its top-50 customers — and Rogo's $160M Series D at $2B (April 2026, a 2.7× step-up in under four months) priced what agentic finance is worth.[34] [35] Whoever trains those models needs finance-grade worlds to train them in.

Near term — labs and agent companies

Three to five frontier labs plus the top tier of agent startups. Concentrated, technical, urgency-driven. Buying today, paying premium prices, willing to sole-source on speed of catalog growth. Revenue has crossed from pilot to delivery: paid environment contracts with frontier labs are in production, revenue is up 10× in the last nine months, and roles are publicly chartered to support an eight-figure ramp this year.[42] Fleet showed what the demand curve looks like when a catalog crosses the lab procurement bar: ~$1M to $60M+ annualized revenue inside a year.[18]

Long term — every enterprise training agents

RL is moving from static models to dynamic learners, and every Fortune 1000 will eventually need a sandbox + eval pipeline for the agents it deploys internally.[14] Applied Intuition reached a $15B valuation selling the same shape of product to the AV industry.[11] Knowledge work has 10× the surface area and 100× the deployment count.

The training-data and environments layer keeps repricing upward

Chart

Reported valuations across the data/environments stack. Scale priced at ~$29B in Meta's June 2025 deal; Surge held talks at ~$25B; Applied Intuition — the environments business model proven in AV — sits at $15B; Mercor quintupled to $10B in October 2025; Fleet, the first env-native startup on the curve, reached $750M in June 2026 on a months-old revenue base.[7] [20] [11] [19] [18]

Source · NYT · Bloomberg · Reuters · TechCrunch · The Information (2025–2026)

Every frontier lab is trying to build the environment stack in-house. Every one of them is failing to keep up with their own agent roadmap. Halluminate is the only company shipping the catalog as a product.
Orange Collective

Competitive landscape

Twenty entrants, three to five winners. The fight is over who is infrastructure and who is labor.

When we invested, Halluminate was nearly alone in selling environments as a product. By mid-2026 roughly twenty funded companies sell into the category, and Wing projects consolidation to three to five winners by 2030.[16] [33] The winner's test is simple: reusable infrastructure over labor, depth over breadth, embedded with the labs. Score the field against it.

Scale AI's RLHF arm

Data-labeling incumbent

Scale priced its labeling motion into a $14B+ deal with Meta and is now pushing into RLHF and post-training data services.[7] The strength is ops machinery and headcount; the gap is that environments are software, not headcount. A workforce that scores trajectories is necessary but not sufficient — the labs need the simulator under the trajectory, and Scale doesn't ship that.

Surge AI

$25B+ data factory

Bootstrapped past $1B revenue, then held its first funding talks at a $25B+ valuation.[9] [20] The faster-growing rival to Scale on labeled data and eval ops — now investing in RL environments to keep pace with the shift from static datasets to interactive simulation.[16] Surge still sells the expert workforce first; Westworld is the layer it is retrofitting, not the layer it was built on.

Mercor

$10B expert network

Quintupled to a $10B valuation in October 2025 on a $450M run rate, paying out $1.5M+ per day to 30,000+ domain experts.[19] The expert-network model is the closest analog to Athena — but Mercor matches humans to labs; it doesn't ship the simulator the humans grade against.

Mechanize

Env-native · coding

The thesis leader — "sweatshop data is over" — and already working with Anthropic on RL environments.[12] [16] Raised $9.1M at a $500M post-money in April 2026 — the cleanest valuation comp for a thin-revenue environments specialist.[37] Deliberately concentrated on a small number of deep software-engineering environments and "replication training" rather than a broad enterprise catalog.[22] Different segment: coding agents, not computer-use workflows.

Fleet

Env-native · horizontal

The category's revenue proof point: ~$1M to $60M+ annualized inside a year, now raising at a $750M valuation with Bain Capital Ventures in talks to lead.[18] Builds replicas of popular apps (Salesforce, Excel) for lab training — the most direct Westworld competitor, competing on breadth where Halluminate is going deep on financial-services fidelity plus bundled human evals.

Deeptune

Env-native · knowledge work

The most direct new entrant: a16z led a $43M Series A in March 2026 for high-fidelity "training gyms" simulating professional workflows — accountants, support, DevOps — across tools like Slack and Salesforce, sold to frontier labs.[38] Claims hundreds of gyms already built. Same product shape as Westworld, brushing the financial vertical from the accounting side — the sharpest test of whether licensed-software fidelity and expert-built verifiers hold a premium over well-funded breadth.

Applied Compute

$1.3B · sells outcomes

Ex-OpenAI o1 researchers selling "Specific Intelligence": finished RL-trained specialist models to enterprises (DoorDash, Cognition, Mercor), building whatever environments they need internally. $80M led by Kleiner Perkins at a $1.3B post in April 2026 — $100M to $1.3B in ten months.[36] Not a bidder for the same lab budgets, but the strongest evidence that RL-environment capability monetizes best when packaged as outcomes.

Expert networks → environments

Handshake · Turing · micro1 · Invisible

The expert-data marketplaces are converting labor scale into environment offerings: Handshake AI (~$1B gross annualized by April 2026) bought Cleanlab explicitly to add "evaluations, AI safety, RL environments";[39] Turing ($2.2B, ~$300M ARR) sells containerized digital-twin environments with verifiers;[40] micro1 ($500M, $100M+ ARR) is scaling specialized RL environments;[41] Invisible raised $100M at $2B+ to build RL gyms; Labelbox's Alignerr is hiring for sandboxed environments. Labor scales their revenue — environments are where everyone believes the margin lives.

Prime Intellect

Open-source hub

Karpathy-backed; launched the Environments Hub in August 2025 as a "Hugging Face for RL environments," crowdsourcing open environments and monetizing the compute underneath.[21] Commoditizes research-grade environments from below — pressure on the low end, but open community environments are exactly the "80% solution" that fails lab-grade training.[29]

In-house lab teams

OpenAI · Anthropic · DeepMind

Frontier labs all build environments internally — and all run perpetually behind their own agent roadmaps. OpenAI shipped Procgen for research,[8] but computer-use envs at enterprise breadth are a different product. The buy-side behavior settles the debate: Anthropic both builds internally and works with Mechanize, while discussing $1B+ of external environment spend.[16] Labs outsource the catalog and keep the training run.

Scale and Surge sell workforce. Fleet and Deeptune sell breadth. Mechanize sells depth on code. Wing's test for the survivors — infrastructure over labor, depth over breadth, embedded with the labs — is the test Halluminate was built to pass, pointed at the highest-value workflows in the economy.
Orange Collective

Founder deep dive

A product-research operator and a startup data engineer building the part of the lab stack no lab has bandwidth for.

Why Jerry built it. Jerry led product and research at Capital One Labs, where he launched one of the first AI agents in production financial services and co-authored three patents on the underlying systems. Capital One was an early canary: the agent demos worked; the deployments hit a wall the moment they touched real enterprise apps. He left convinced the bottleneck wasn't the model — it was the absence of a realistic, resettable, verifiable training and eval surface for the world the agent had to operate in.

Why Wyatt built it. Wyatt is a two-time early-stage data and software engineer who has spent years shipping the unglamorous backend that makes ML systems work in production — ingestion, eval harnesses, instrumentation, telemetry. The same instinct applied to the agent stack produced Westworld. Westworld is what you build when you've spent enough cycles re-stitching together flaky training data pipelines and decided to make the data infrastructure the product.

Why this team is the right team. Jerry has lived the enterprise agent deployment problem from inside a Fortune 100. Wyatt has built the data infrastructure that production ML actually depends on. Together they cover the customer's pain (Jerry) and the product's depth (Wyatt). The two-person founding shape is exactly right for an infrastructure company that has to ship catalog content fast and sell to a small set of technical buyers.

Why velocity matters here specifically. Catalog growth rate is the metric the labs grade them on. Every additional environment is leveraged — modules generalize across tasks, checkers transfer, trajectories add to the reward-modeling corpus. The bet is that two founders moving fast on the highest-leverage env category will out-ship any lab team running the same work as one of fifty internal priorities.

The long arc. Halluminate becomes "Applied Intuition for knowledge work" — the simulation infrastructure every serious computer-use agent is trained and graded against, and the data corpus that powers every reward model for the category. The same shape of business that earned Applied Intuition a $15B valuation in AV, applied to a category with 10× the surface area.[11]

Founder & team

Jerry Wu

Jerry Wu

Co-Founder & CEO

Led product and research at Capital One Labs, where he launched one of the first AI agents in financial services and co-authored three patents. Studied Computer Science and Economics at Cornell, where he researched model quantization methods and served as VP of the Cornell Consulting Group. Class speaker at Acton-Boxborough.

Wyatt Marshall

Wyatt Marshall

Co-Founder

Two-time early-stage startup data and software engineer. Spent years shipping large-scale data infrastructure at venture-backed startups, then turned the same instinct on the agent stack — building the environment, eval, and benchmark layer every computer-use AI now needs to train against.

Risks & mitigations

Risk

The services hamster wheel — the structural failure mode for every RL-environments and data company: each contract is bespoke delivery, staffed by senior people, accepted one lab at a time. Revenue scales with delivery leads instead of product, and the company wakes up as a consultancy carrying an infrastructure valuation.

Mitigation

The contract structure already separates the consumable from the asset. Labs buy tasks and verifiers; the simulator underneath carries into the next contract — each engagement leaves residue (rebuilt software, a verifier library, licensed datasets, expert workflow maps) that makes the next one cheaper to deliver. The financial-services focus is the accelerant: concentrating contracts in one domain maximizes catalog overlap, so reuse compounds instead of diluting across verticals. The numbers that prove the wheel is breaking: simulator reuse rate per contract, gross margin by contract cohort, and revenue per FTE — all should rise as the catalog deepens. Surge and Mercor show labor-heavy models can clear $1B in revenue; the infrastructure multiple goes to whoever owns the worlds.

Risk

Frontier labs internalize environment construction — OpenAI, Anthropic, and DeepMind have all built sandbox training environments in-house and have effectively unlimited GPU budgets.

Mitigation

The revealed preference says otherwise: Anthropic works with Mechanize on coding environments and its leadership has discussed spending over $1B on externally sourced RL environments in a single year. Labs have built environments for the capabilities they care about most — and have been bottlenecked on everything else. OpenAI's Procgen and DeepMind's StarCraft sandbox are narrow research artifacts, not the broad enterprise-workflow catalog computer-use agents now demand. Environment construction at catalog scale is a product-shaped problem, not a research-shaped one — and product is where labs prefer partners.

Risk

Environment authoring cost scales linearly — each new app sandbox takes weeks of engineering and the catalog has to cover thousands of surfaces to be a moat.

Mitigation

Halluminate is building environment authoring as a product, not a bespoke service. Reusable simulator modules (auth, billing, search, form fills), coding agents accelerating buildout, and an open framework for community-contributed envs all bend the curve from linear to compounding. The unit economics support it: labs pay roughly $20k for a website replica and up to $300k for a high-fidelity clone of a complex app — margins that fund catalog growth. Applied Intuition proved the same playbook in AV simulation, reaching a $15B valuation on a catalog of reusable scenario modules.

Risk

Fidelity drift — real apps evolve constantly, and a sandbox that disagrees with the live UI degrades training signal over time.

Mitigation

Automated regression evals run on every release; the same checker infrastructure that scores agent runs flags drift between sandbox and live. Partner sandboxes (the customer's own staging instance) provide a ground-truth backstop. Versioned environments give labs reproducible benchmarks even as the underlying app moves.

Risk

Customer concentration in frontier labs — three or four buyers control the high-value training contracts and have all the leverage in pricing.

Mitigation

Athena (the human-eval/data offering) is already pulling enterprise demand outside the lab cohort — top browser-agent startups and computer-use product teams need exactly the same env + eval loop. The bundle scales horizontally as every serious agent company hits the same wall: their agents work in the demo and break in production. Halluminate sells the bridge.

Risk

Env-native competition is no longer hypothetical — Fleet went from ~$1M to $60M+ annualized revenue in months and is raising at $750M, Mechanize is embedded with Anthropic on coding environments, and Prime Intellect is open-sourcing the long tail through its Environments Hub.

Mitigation

The competitors validate the category and segment it. Mechanize concentrates on a small number of deep coding environments; Fleet builds horizontal app replicas; Prime Intellect commoditizes research-grade environments. Halluminate's wedge is the high-stakes, hard-to-fake end: full-fidelity financial-services workflows (investment banking, private equity, consulting) bundled with human expert evaluation — surfaces where a vibe-coded 80% replica destroys training signal and where domain expertise is the gate, not engineering hours. A market where Anthropic alone discusses $1B+ of annual spend supports multiple winners; the risk is being undifferentiated, not crowded.

What we're watching

  • Revenue per FTE, simulator reuse rate, and gross margin by contract cohort — the three numbers that prove the catalog is compounding and the company is stepping off the services hamster wheel.
  • A second buyer cohort beyond the frontier labs — AI-native services firms training small models on their own workflows would expand demand for environments and data while de-risking lab concentration.
  • Lab relationships maturing from first paid contracts into standing, multi-quarter engagements across more than one frontier lab — the de-risking signal that matters most.
  • The financial-services wedge — Halluminate now leads with RL environments for investment banking, private equity, and consulting workflows. Whether that vertical focus produces the seven-figure-per-quarter lab contracts the category's leaders command is the next proof point.
  • Catalog velocity — environments shipped per month and the rate at which reusable modules drive cross-customer leverage.
  • Athena attach rate — what percentage of Westworld customers also buy human-in-the-loop evals, and what that implies for ACV expansion.
  • Competitive separation — Fleet's revenue ramp ($1M to $60M+ annualized inside a year) set the pace for the category; Halluminate needs a comparable inflection in its vertical to hold a premium position.
  • Open-environment ecosystem — whether community-contributed sandboxes (Prime Intellect's Environments Hub, Halluminate's own open Westworld framework) start meaningfully extending the catalog, and how that shapes the moat.

References

  1. [1]WebArena — Realistic web environment for autonomous agents
  2. [2]Y Combinator — Halluminate company profile
  3. [3]OpenAI — Introducing ChatGPT agent (computer-use launch)
  4. [4]Anthropic — Computer Use documentation
  5. [5]LA Times — Big Tech AI spending to reach $344B in 2025
  6. [6]New York Times — AI spending and the real economy (2025)
  7. [7]New York Times — Meta invests $14.3B in Scale AI
  8. [8]OpenAI — Procgen Benchmark: gym environments for generalization in RL
  9. [9]Reuters — Surge AI explores $1B raise at $30B+ valuation
  10. [10]TechCrunch — Adept raises $350M for computer-use agents
  11. [11]Reuters — Applied Intuition valued at $15B (AV simulation infrastructure)
  12. [12]Mechanize — Sweatshop data is over (RL environments thesis)
  13. [13]a16z — The rise of computer use and agentic coworkers
  14. [14]Felicis — Rocket Fuel for AI: RL environments and the RLaaS market
  15. [15]BuiltIn SF — Browserbase Director and $40M Series B
  16. [16]TechCrunch — Silicon Valley bets big on 'environments' to train AI agents (Anthropic's $1B+ RL-env plans)
  17. [17]Epoch AI — An FAQ on reinforcement learning environments (contract sizes, replica costs)
  18. [18]The Information — RL gym startup Fleet reaches $750M valuation on surging lab demand
  19. [19]TechCrunch — Mercor quintuples valuation to $10B with $350M Series C
  20. [20]Bloomberg — Scale rival Surge AI in talks for funding at $25B value
  21. [21]Prime Intellect — Environments Hub: a community platform to scale RL to open AGI
  22. [22]Mechanize — The upcoming GPT-3 moment for RL (replication training)
  23. [23]OSWorld — Benchmarking multimodal agents in real computer environments (human baseline 72.36%)
  24. [24]Anthropic — Introducing computer use with Claude 3.5 Sonnet (14.9% on OSWorld)
  25. [25]OpenAI — Computer-Using Agent / Operator (38.1% on OSWorld)
  26. [26]Anthropic — Claude Sonnet 4.5 (61.4% on OSWorld)
  27. [27]Anthropic — Claude Opus 4.6 (72.7% on OSWorld)
  28. [28]Anthropic — Claude Opus 4.8 (84% on OSWorld-Verified, May 2026)
  29. [29]Hacker News — Launch HN: Halluminate (YC S25), simulating the internet to train computer use
  30. [30]Halluminate — RL environments for financial services (company site, 2026)
  31. [31]SemiAnalysis — RL environments and RL for science: data foundries and multi-agent architectures
  32. [32]Epoch AI — An FAQ on Reinforcement Learning Environments: contract sizes, replica pricing, exclusivity premiums (Jan 2026)
  33. [33]Wing Venture Capital — Who Will Win the RL Environment Market—and Why (Jan 2026)
  34. [34]Fortune — Anthropic deepens Wall Street push: finance agents, Microsoft 365 integration, Moody's partnership (May 2026)
  35. [35]PR Newswire — Rogo raises $160M Series D at $2B to scale the agentic platform for finance (Apr 2026)
  36. [36]Applied Compute — The Advantage You Own: $80M led by Kleiner Perkins at $1.3B (Apr 2026)
  37. [37]Mechanize (X) — $9.1M raised at a $500M post-money valuation (Apr 2026)
  38. [38]SiliconANGLE — Deeptune raises $43M to accelerate AI learning through virtual training gyms (Mar 2026)
  39. [39]Handshake — Handshake acquires Cleanlab: evaluations, AI safety, RL environments (Jan 2026)
  40. [40]Turing — RL environments for agent training and evaluation (product page, 2026)
  41. [41]TechCrunch — micro1, a Scale AI competitor, touts crossing $100M ARR (Dec 2025)
  42. [42]Halluminate — Careers: help us train financial superintelligence; roles supporting $MMs in 2026 revenue