Halluminate

Halluminate

The RL environment and data factory for computer-use AI.

Halluminate YC launch — Westworld + Athena walkthrough[2]

Seed

Round

YC S25 · revenue-generating

2.8×

MoM revenue growth

Excluding active frontier-lab PoC · mid-5-figure MRR base

$344B

Hyperscaler AI capex (2025)

Spilling into training, eval, and env infra

In production at

Browser Use

OSS browser agents · YC W25

Yutori

Personal AI agents

Frontier lab pilot

Top-tier model lab · undisclosed

Enterprise pilots

CRM, finance, ticketing sandboxes

Two of the largest browser-agent companies plus an active frontier-lab pilot targeted to convert this quarter.[2]

Thesis

The next wave of agent quality won't come from bigger models — it will come from better training environments.[12] [14] OpenAI's Procgen, DeepMind's StarCraft sandbox, and Anthropic's Computer Use all proved that post-training environments are the moat.[3] [4] [8] Halluminate is building the canonical RL environment and data factory for computer-use AI — the training ground every frontier lab and serious agent company already needs and almost no one has the bandwidth to build at catalog scale.
  1. 01

    Post-training is the new pre-training. Frontier model gains from scaling parameters and tokens are flatlining. The capability gains that matter now — agentic planning, long-horizon tool use, computer use — are coming from reinforcement learning in real-world environments. Every recent jump (o-series reasoning, ChatGPT Agent, Claude Computer Use) is a post-training story, not a pre-training one.[3] [4]

  2. 02

    Environment scarcity is the bottleneck. Every lab is building computer-use agents; every lab is starving for environments to train and grade them on. Live web training is unsafe, unreliable, and unscalable — websites rate-limit, change weekly, and reset state in ways that destroy training signal.[1] Halluminate ships the missing piece as a product, not a side project.

  3. 03

    The data flywheel compounds. Every agent run produces a labeled trajectory. Every environment seeds the next. Every checker becomes a reward model. The lead Halluminate accumulates is not the individual sandbox — it's the catalog of reusable modules, the library of verifiable tasks, and the corpus of expert trajectories that compound into a permanent advantage.

  4. 04

    The customer is every frontier lab and every serious agent company. Hyperscalers spent $344B on AI in 2025 alone.[5] The same dollars that funded pre-training are now flowing into post-training stacks — and the post-training stack is inseparable from environment infrastructure. This is the most capital-rich, urgency-pressed customer set in software.

Problem

Computer-use agents work in the demo. They break in production. The gap is environment quality.

Browser and computer-use AI is the most-watched capability in the model lab roadmap — and the most fragile in deployment. WebArena, the canonical research benchmark, still shows agents underperforming humans by enormous margins on routine multi-app workflows.[1] a16z's framing is identical: "current agent offerings more closely resemble advanced RPA tools than true autonomous systems."[13]

The reason isn't model capacity. The reason is that labs train agents on whatever data they can scrape — public web recordings, OSS web benchmarks, synthetic prompts — and then fine-tune them on a few thousand hand-curated trajectories. None of that mirrors the production stack the agent will actually run against. The Salesforce instance, the Slack workspace, the ServiceNow ticket queue, the QuickBooks reconciliation flow — these are the surfaces the customer cares about, and they look nothing like the public web.

Live-web training makes it worse. Real websites can't be reset. They rate-limit aggressively, ban automation, and punish exploration. They change layout overnight. They cost real money when an agent buys the wrong flight. Every lab knows this; every lab has tried to build a stand-in internally; every lab has discovered that environment authoring is a product problem masquerading as a research problem.

$344B

AI capex (2025)

Hyperscaler spend now spilling into training, eval, and env infra

$14.3B

Meta → Scale AI

Single deal — labs paying premium prices for training data

$15B

Applied Intuition

AV simulation infra — proves the env-infra business model

LA Times AI capex 2025[5] · NYT Meta-Scale deal[7] · Reuters Applied Intuition valuation[11]

Why Now

The post-training era is here. Environment quality is the next constraint.

Three trends collided in the same eighteen months: agents shipped to production, pre-training gains flatlined, and every lab woke up needing the same thing — verifiable environments to train and grade computer-use models on.

Until AIs can learn through real-world trial and error like humans do, we must create custom environments that can faithfully simulate reality and accurately reward AIs for skillfully navigating the simulation.

Mechanize

Mechanize[12]

RL environments lab

Current agents still exhibit significant limitations in capability — struggling with complex or unfamiliar interfaces — and efficiency, operating too slowly and expensively to compete effectively with human operators.

a16z

a16z[13]

Computer-Use & Agentic Coworkers

The RL environment platforms are becoming foundational infrastructure for anyone looking to train generalist AI workers.

Felicis

Felicis[14]

Rocket Fuel for AI

Three preconditions converged in the same eighteen months.

Computer-use agents are now first-party products. OpenAI shipped ChatGPT Agent. Anthropic shipped Claude Computer Use. Google launched Project Mariner. Browserbase shipped Director.[3] [4] [13] [15] The category went from research demo to flagship product in twelve months. Every one of those launches has the same next-step problem: making the agent reliable on the long tail of enterprise apps.

Post-training is where capability now comes from. The o-series, Claude 3.5 Computer Use, and ChatGPT Agent were all RL post-training stories. As Mechanize put it, "until AIs can learn through real-world trial and error like humans do, we must create custom environments that can faithfully simulate reality and accurately reward AIs for skillfully navigating the simulation."[12] The output of post-training is no better than the environment it was trained against.

The money is following the bottleneck. Hyperscaler AI capex hit $344B in 2025.[5] [6] Meta paid $14.3B for Scale.[7] Surge is raising $1B at a $30B+ valuation.[9] The same labs and acquirers that bid up the pre-training data stack are turning their attention to the post-training environment stack. Felicis calls RL environment platforms "foundational infrastructure for anyone looking to train generalist AI workers."[14]

Until AIs can learn through real-world trial and error like humans do, we must create custom environments that can faithfully simulate reality and accurately reward AIs for skillfully navigating the simulation.
Mechanize — Sweatshop data is over[12]

How It Works

Two products. One loop. Environments and the data they produce.

Step 01

Westworld — the mini-internet

High-fidelity sandbox clones of the enterprise apps agents need to learn — Salesforce, Slack, ticketing, CRM, finance, e-commerce, travel. Deterministic where it matters, parallelizable for fleet-scale training, fully resettable. Every environment is reproducible and version-pinned so agents can be graded against the same fixture twice.

Step 02

Athena — human-in-the-loop data

Expert annotators score agent trajectories, build error taxonomies, and produce reward-model training targets. The output plugs straight into lab fine-tuning, DPO, and RL pipelines. This is the layer that turns raw simulator runs into the kind of high-signal data labs would otherwise have to assemble themselves.

Step 03

Checkers — verifiable rewards

Every task ships with programmatic success metrics — record created with the right fields, cart total reconciled, ticket routed to the right queue. The same checker that grades the agent at eval time is the reward signal at training time. RLVR by construction.

The loop is the product.

Design task → simulate → instrument → verify → train → review. Halluminate runs the same loop the best in-house lab teams run, packaged as infrastructure. Customers pick a workflow, Westworld stages a sandbox for it, checkers score every episode, and Athena's reviewers triage the failures into reward models.

Reusable modules bend the curve. Authentication, billing, search, form fills, ticket queues, notification banners — the same primitives show up in every enterprise UI. Each new environment shares more of its scaffolding with the last. The catalog grows superlinearly to engineering hours, the same way Applied Intuition's scenario library did in AV.[11]

Interoperable with the agent stack labs already use. Westworld plugs into popular agent frameworks, browser automation infra (including Browserbase),[15] and the standard training pipelines labs run. No new framework to adopt — Halluminate becomes the env layer beneath whatever the customer already runs.

Environments Are the New Datasets

Static data taught models to predict. Environments teach agents to act.

The shift from "sweatshop data" to simulation-as-data is the through-line of the last twelve months of frontier research. Every lab is saying the same thing. The companies that build the environment layer become the data layer of the next generation.

Datasets won the last era. Environments win the next.

Why static data ran out of room. Pre-training was about scraping the world. Post-training is about practicing in it. A static label set can teach a model what a "good" outcome looks like once. An interactive environment teaches it how to recover when something unexpected happens — and that is the entire content of agent reliability.

Why OpenAI built Procgen. The earliest lab investments in environment infrastructure — Procgen, DeepMind's StarCraft sandbox, the OpenAI Gym lineage — were research bets that environments are the scarcest resource in RL.[8] The pattern is now repeating one level up the stack: instead of toy gym tasks, the bottleneck environments are enterprise workflows. Same shape of problem, same shape of moat.

Why the data flywheel compounds. Every agent episode against a Halluminate environment produces a trajectory, a checker score, an annotator review, and a reward signal. That data is the canonical training input for the next model release. Customers buy environments today; they end up renting the resulting data corpus for the next decade. The catalog and the corpus grow together.

The RL environment platforms are becoming foundational infrastructure for anyone looking to train generalist AI workers.
Felicis — Rocket Fuel for AI[14]

Market

The buyer set is small. The budget is enormous.

Frontier model labs. Three to five labs control the high-end of the post-training spend. Each one is staffed with a small team trying to produce computer-use environments fast enough to keep up with the agent roadmap. Halluminate is already in active pilot with one of them, targeted to convert this quarter. Sustained AI capex creates adjacent demand for training and eval infra to help labs realize their model investments.[5] [6]

Serious agent companies. Browser Use, Yutori, Manus, Browserbase Director, and the next wave of agent products all need the same thing the labs need.[13] [15] Their differentiator is reliability in the customer's actual stack — which means training and grading against environments that mirror that stack. Halluminate sells the same product on the same loop.

Enterprise. The medium-term buyer is the enterprise platform team deploying internal agents. Vertical functions — marketing, finance, sales, HR — all require company-specific tuning against company-specific surfaces.[13] The same Westworld + Athena loop powers internal agent evaluation before the agent ever touches production.

Near term — labs and agent companies

Three to five frontier labs plus the top tier of agent startups. Concentrated, technical, urgency-driven. Buying today, paying premium prices, willing to sole-source on speed of catalog growth. Halluminate's current revenue base is mid-five-figures monthly and growing 2.8× MoM — inflection-shape revenue against a small, deep-pocketed cohort.

Long term — every enterprise training agents

RL is moving from static models to dynamic learners, and every Fortune 1000 will eventually need a sandbox + eval pipeline for the agents it deploys internally.[14] Applied Intuition reached a $15B valuation selling the same shape of product to the AV industry.[11] Knowledge work has 10× the surface area and 100× the deployment count.

Every frontier lab is trying to build the environment stack in-house. Every one of them is failing to keep up with their own agent roadmap. Halluminate is the only company shipping the catalog as a product.
Orange Collective

Competitive landscape

Data factories. In-house teams. Halluminate is the only one built for environments.

The adjacent categories all touch the same loop — data-labeling shops, RLaaS platforms, OSS research benches, and lab-internal teams — but none of them are organized around the catalog of verifiable enterprise environments. That gap is the wedge.

Scale AI's RLHF arm

Data-labeling incumbent

Scale priced its labeling motion into a $14B+ deal with Meta and is now pushing into RLHF and post-training data services.[7] The strength is ops machinery and headcount; the gap is that environments are software, not headcount. A workforce that scores trajectories is necessary but not sufficient — the labs need the simulator under the trajectory, and Scale doesn't ship that.

Surge AI

$30B+ data factory

Reportedly raising $1B at a $30B+ valuation; the faster-growing rival to Scale on labeled data and eval ops.[9] Same structural gap — Surge sells the expert workforce, not the verifiable training environment. Halluminate's Athena offering competes on the expert-eval surface; Westworld is the layer Surge doesn't have.

Snorkel / programmatic-data tooling

Weak supervision

Snorkel and the programmatic-data category solved the label-generation problem for static datasets. The world they were built for — predict-the-label supervised learning — is exactly the world post-training moved past. Reward signals from interactive sandboxes can't be generated programmatically without the sandbox.

In-house lab teams

OpenAI · Anthropic · DeepMind

Frontier labs all build environments internally — and all run perpetually behind their own agent roadmaps. OpenAI shipped Procgen for research,[8] but computer-use envs at enterprise breadth are a different product. ChatGPT Agent and Computer Use were built against environments labs cobbled together themselves; every lab we've talked to admits the env layer is the thing they wish they had outsourced two years ago.[3] [4]

Scale and Surge sell workforce. Snorkel sold labels. Labs build sandboxes between releases. Halluminate is the only company building the environment catalog as a product — and the product is exactly what the entire post-training stack is bottlenecked on.
Orange Collective

Founder deep dive

A product-research operator and a startup data engineer building the part of the lab stack no lab has bandwidth for.

Why Jerry built it. Jerry led product and research at Capital One Labs, where he launched one of the first AI agents in production financial services and co-authored three patents on the underlying systems. Capital One was an early canary: the agent demos worked; the deployments hit a wall the moment they touched real enterprise apps. He left convinced the bottleneck wasn't the model — it was the absence of a realistic, resettable, verifiable training and eval surface for the world the agent had to operate in.

Why Wyatt built it. Wyatt is a two-time early-stage data and software engineer who has spent years shipping the unglamorous backend that makes ML systems work in production — ingestion, eval harnesses, instrumentation, telemetry. The same instinct applied to the agent stack produced Westworld. Westworld is what you build when you've spent enough cycles re-stitching together flaky training data pipelines and decided to make the data infrastructure the product.

Why this team is the right team. Jerry has lived the enterprise agent deployment problem from inside a Fortune 100. Wyatt has built the data infrastructure that production ML actually depends on. Together they cover the customer's pain (Jerry) and the product's depth (Wyatt). The two-person founding shape is exactly right for an infrastructure company that has to ship catalog content fast and sell to a small set of technical buyers.

Why velocity is the moat. Catalog growth rate is the metric the labs grade them on. Every additional environment compounds — the modules generalize, the checkers transfer, the trajectories add to the reward-modeling corpus. Halluminate's bet is that two founders moving fast on the highest-leverage env category will out-ship any lab team trying to do the same work as one of fifty internal priorities.

The long arc. Halluminate becomes "Applied Intuition for knowledge work" — the simulation infrastructure every serious computer-use agent is trained and graded against, and the data corpus that powers every reward model for the category. The same shape of business that earned Applied Intuition a $15B valuation in AV, applied to a category with 10× the surface area.[11]

Founder & team

Jerry Wu

Jerry Wu

Co-Founder & CEO

Led product and research at Capital One Labs, where he launched one of the first AI agents in financial services and co-authored three patents. Studied Computer Science and Economics at Cornell, where he researched model quantization methods and served as VP of the Cornell Consulting Group. Class speaker at Acton-Boxborough.

Wyatt Marshall

Wyatt Marshall

Co-Founder

Two-time early-stage startup data and software engineer. Spent years shipping large-scale data infrastructure at venture-backed startups, then turned the same instinct on the agent stack — building the environment, eval, and benchmark layer every computer-use AI now needs to train against.

Risks & mitigations

Risk

Frontier labs internalize environment construction — OpenAI, Anthropic, and DeepMind have all built sandbox training environments in-house and have effectively unlimited GPU budgets.

Mitigation

Labs have built environments for the capabilities they care about most — and have been bottlenecked on everything else. OpenAI's Procgen and DeepMind's StarCraft sandbox are narrow research artifacts, not the broad enterprise-workflow catalog computer-use agents now demand. The same labs that built internal envs for code and math are the ones already running pilots with Halluminate for CRM, ticketing, and travel surfaces. Environment construction at catalog scale is a product-shaped problem, not a research-shaped one — and product is where labs prefer partners.

Risk

Environment authoring cost scales linearly — each new app sandbox takes weeks of engineering and the catalog has to cover thousands of surfaces to be a moat.

Mitigation

Halluminate is building environment authoring as a product, not a bespoke service. Reusable simulator modules (auth, billing, search, form fills), coding agents accelerating buildout, and an open framework for community-contributed envs all bend the curve from linear to compounding. Applied Intuition proved the same playbook in AV simulation — a $15B valuation built on a catalog of reusable scenario modules.<Cite n={11} />

Risk

Fidelity drift — real apps evolve constantly, and a sandbox that disagrees with the live UI degrades training signal over time.

Mitigation

Automated regression evals run on every release; the same checker infrastructure that scores agent runs flags drift between sandbox and live. Partner sandboxes (the customer's own staging instance) provide a ground-truth backstop. Versioned environments give labs reproducible benchmarks even as the underlying app moves.

Risk

Customer concentration in frontier labs — three or four buyers control the high-value training contracts and have all the leverage in pricing.

Mitigation

Athena (the human-eval/data offering) is already pulling enterprise demand outside the lab cohort — top browser-agent startups and computer-use product teams need exactly the same env + eval loop. The bundle scales horizontally as every serious agent company hits the same wall: their agents work in the demo and break in production. Halluminate sells the bridge.

What we're watching

  • Conversion of the active frontier-lab pilot to a paid contract — the signal that catalog quality has crossed the lab procurement bar.
  • Catalog velocity — environments shipped per month and the rate at which reusable modules drive cross-customer leverage.
  • Athena attach rate — what percentage of Westworld customers also buy human-in-the-loop evals, and what that implies for ACV expansion.
  • Open-environment ecosystem — whether community-contributed sandboxes start meaningfully extending the catalog, and how that shapes the moat.

References

  1. [1]WebArena — Realistic web environment for autonomous agents
  2. [2]Y Combinator — Halluminate company profile
  3. [3]OpenAI — Introducing ChatGPT agent (computer-use launch)
  4. [4]Anthropic — Computer Use documentation
  5. [5]LA Times — Big Tech AI spending to reach $344B in 2025
  6. [6]New York Times — AI spending and the real economy (2025)
  7. [7]New York Times — Meta invests $14.3B in Scale AI
  8. [8]OpenAI — Procgen Benchmark: gym environments for generalization in RL
  9. [9]Reuters — Surge AI explores $1B raise at $30B+ valuation
  10. [10]TechCrunch — Adept raises $350M for computer-use agents
  11. [11]Reuters — Applied Intuition valued at $15B (AV simulation infrastructure)
  12. [12]Mechanize — Sweatshop data is over (RL environments thesis)
  13. [13]a16z — The rise of computer use and agentic coworkers
  14. [14]Felicis — Rocket Fuel for AI: RL environments and the RLaaS market
  15. [15]BuiltIn SF — Browserbase Director and $40M Series B