

$31.2M
Seed · May 2026
Co-led by General Catalyst + Lightspeed — among YC's largest seeds
$2M
ARR within weeks of demo day
~6 weeks from launch to revenue
500K+
Contributors · 96 countries
1M+ submissions through the QA pipeline
Thesis
- 01
Rights-cleared provenance is now law, not preference. Since August 2, 2025, GPAI providers serving the EU must publish training-data summaries on a mandatory Commission template and document copyright compliance.[33] Combined with the NYT suit against OpenAI and the wave of publisher licensing deals (Stack Overflow, Reddit, Axel Springer), "where did this data come from" is a checkbox the buyer must clear before they sign.[3] [17] [18] [19] [8] Luel's bundle — consent evidence, chain-of-title, QA logs — is the artifact that clears it.
- 02
Multimodal and robotics models need data the web cannot supply. Web-scale corpora are insufficient for state-of-the-art robotics, egocentric video, and specialist dialogue. Google's RT-1 work is a public example of the alternative: large, purpose-collected teleoperation datasets gathered by hand.[5] Luel's collections go further — physics layers for embodied AI including sensor streams, device pose, and hand-object interaction — data that never existed on the web at all.[23]
- 03
Marketplace plus re-licensable catalog economics compound. Owning non-exclusive rights to a collection lets Luel re-license it at high gross margins after collection costs are amortized — closer to stock-media dynamics, but with AI-grade QA and metadata. The licensing comps put hard numbers on willingness to pay: Google pays Reddit a reported $60M a year, OpenAI an estimated ~$70M, and News Corp's OpenAI deal is worth up to $250M over five years.[30] [32] [31] DefinedCrowd raised $65M building a less rights-focused version of this thesis.[14] [15]
- 04
The Meta–Scale deal opened a neutrality window. Frontier labs are few but spend heavily — Scale sits at a $29B valuation after Meta's $14.3B stake, Surge has been in talks at ≥$25B, and Mercor raised at $10B.[24] [28] [29] But OpenAI and Google pulled work from Scale within days of the Meta announcement, with Google's ~$200M planned 2025 spend in motion — and both Surge and Mercor are annotation-first.[25] [26] The rights-cleared raw multimodal lane is still open, and the buyers are actively shopping for neutral partners.
Problem
The web has been scraped. Now labs need data the web never had.
Frontier AI labs need rights-cleared multimodal training data at scale, and the most accessible source — the public web — has been substantially exhausted for the modalities that matter most.[2] Most available datasets fail production requirements: unclear rights, weak provenance, missing consent, inconsistent metadata.[1] The shortcuts that worked five years ago no longer work.
At the same time, the cost of using shortcuts has gone up. The NYT v. OpenAI suit, the YouTube transcript controversy, and the EU AI Act have collectively pushed legal, brand, and procurement teams to demand explicit licensing and audit-ready documentation before a dataset can enter pre-training.[3] [4] [8]
The remaining option is to commission real-world data: egocentric video from licensed contributors, professional dialogues with consent, robotics teleoperation footage shot to spec. That is operationally hard. It involves recruiting and vetting contributors, capturing consent, running multi-stage QA, and shipping artifacts the procurement team can sign on. Most labs don't want to build the operation in-house — and the vendors that built it at scale have new problems of their own. After Meta took 49% of Scale AI in June 2025, neutrality itself became a procurement question: OpenAI wound down its Scale work and Google moved to cut ties, sending labs shopping for independent partners.[24] [25] [26]
$29B
Scale AI valuation
Meta's 49% stake (Jun 2025) · neutrality now in question
≥$25B
Surge AI raise talks
Reported ~$1B round (Bloomberg, Jul 2025)
$10B
Mercor Series C
$350M round (Oct 2025) · ~$1.5M/day to contractors
Why Now
Five forces hit at once. Rights-cleared multimodal data is the convergence trade.
Exhaustion of public corpora, regulatory hardening, the post-Meta neutrality shock, the rise of multimodal / robotics, and the maturation of contributor-marketplace operations all arrive in the same 24-month window.
Public data ran out and procurement teams stopped looking the other way.
Data exhaustion is no longer hypothetical. Epoch AI's projections put high-quality public text data on a finite curve, with constraints intensifying as training compute scales.[2] Multimodal demand makes the gap larger, not smaller — real-world video, audio, and robotics footage was never on the web in usable quantities to begin with.
Lawsuits and licensing deals reshaped the buyer's risk model. The NYT v. OpenAI litigation, OpenAI's deal with Stack Overflow, Google's licensing arrangement with Reddit, and Axel Springer's partnership with OpenAI together signal that the era of "scrape now, apologize later" is closing.[3] [17] [18] [19] Buyers now want a paper trail.
The EU AI Act's GPAI obligations are now in force. Final approval in May 2024 codified provenance and consent expectations; since August 2, 2025, GPAI providers must publish training-data summaries on a mandatory Commission template and maintain a copyright-compliance policy.[8] [33] Provenance paperwork moved from best practice to legal requirement — and it effectively exports globally for any lab serving European customers.
Neutrality became a buying criterion. Meta's $14.3B stake in Scale (June 2025) triggered an immediate customer exodus: OpenAI phased out its Scale work, Google planned to walk from roughly $200M of 2025 spend, and Scale's competitors reported an influx of labs seeking neutral partners.[24] [25] [26] Independent vendors are catching budgets that were locked up a year ago.
Multimodal and robotics need bespoke collection. State-of-the-art robotics systems like RT-1 rely on hand-collected teleoperation datasets, not scrapes.[5] RLHF practice has converged on the conclusion that quality, diverse human data — not sheer volume — is the bottleneck.[6]
Frontier AI labs need rights-cleared multimodal training data at scale, but public web data is exhausted and most available datasets fail production requirements due to unclear rights, weak provenance, missing consent, and inconsistent metadata.
How It Works
Spec in. Audit-ready dataset out. In days, not quarters.
Two product motions on one operational substrate.
Bespoke collection. The high-margin, high-touch motion: a frontier lab needs gemstone manufacturing footage, or patient-doctor dialogues, or egocentric data from cooks in a commercial kitchen. Luel turns the request around in days with the full paper trail. Customers pay for speed and for the procurement-ready artifacts.[1]
Off-the-shelf catalog. The compounding motion: collections that were funded by a bespoke job become re-licensable inventory. The customer gets a fast start; Luel gets a margin profile closer to stock-media than to services. Inigo's published Ego-Realm dataset on Hugging Face is an early demonstration of the catalog motion.
Compliance is the product, not a wrapper. Standard license templates, usage scopes, consent revocation flows, and documentation aligned to EU AI Act risk categories are not features bolted on — they are why the buyer chooses Luel.[8]
Interoperability. APIs, SDKs, and delivery formats are designed to drop into the lab's existing pre-training, fine-tuning, and evals pipelines so the dataset doesn't sit in a slow review queue.
The artifacts that ship with every dataset
Paper trailThe bundle that lets a frontier lab clear procurement, legal, and brand review without a multi-month back-and-forth.
Traction & Round
One of the largest seed rounds in YC history, weeks after demo day.
$31.2M
Seed round · May 2026
Co-led by General Catalyst + Lightspeed
1M+
Submissions through QA
40+ active dataset campaigns at any time
96
Countries in the network
500K+ vetted contributors
The round priced the wedge. The revenue arrived before the round did.
In May 2026, Luel announced a $31.2M seed co-led by General Catalyst and Lightspeed — one of the largest seed rounds in Y Combinator's history.[22] [23] Additional backers include Paul Graham, SV Angel, Human Capital, and Orange Collective.[22]
The traction preceded the capital: $2M ARR within roughly six weeks of demo day, over a million submissions processed through the QA pipeline, and 40+ dataset campaigns running concurrently.[22] The customer base already spans generative AI labs, robotics companies, speech research teams, major social platforms, universities, hospitals, and banks — broader than the frontier-lab-only wedge we expected at memo time.[22] [23]
Lightspeed's stated rationale is the "data wall": models exhausting public web data and requiring massive net-new, human-generated, rights-cleared data across modalities and geographies.[23] That is the same thesis as this memo — now underwritten at institutional size.
Market
A market that is small today and inevitable tomorrow.
AI training datasets and the broader data collection & labelling services market are both compounding at roughly 27–28% annually — multimodal is the fastest-growing segment.
The training dataset market is tripling inside the next five years.
The AI training datasets market sits at roughly $2.82B in 2024 and is projected to reach ~$9.58B by 2029 at ~27.7% CAGR, with multimodal as the fastest-growing segment.[9]
The broader data collection and labelling services market is roughly $3.0B in 2023 heading to ~$29.2B by 2032 at ~28.5% CAGR.[10] Publisher and platform licensing deals validate willingness to pay at the upper end of that range: Google pays Reddit a reported $60M a year, OpenAI's Reddit deal is estimated at ~$70M a year, and News Corp's OpenAI agreement is worth up to $250M over five years.[30] [32] [31]
Two structural tailwinds compound on top of the headline numbers. First, modern robotics and multimodal systems require real-world egocentric and device-specific data that scraping cannot provide.[5] Second, public-data exhaustion drives a premium on curated, re-licensable corpora — and pushes more of the spend toward bespoke collection rather than pre-existing dumps.[2]
Training-data market: today vs. projected
Chart
AI training datasets: $2.82B (2024) → $9.58B (2029E), ~27.7% CAGR.[9] Data collection & labelling services: $3.0B (2023) → $29.2B (2032E), ~28.5% CAGR.[10]
Source · MarketsandMarkets · Allied Market Research
What rights-cleared content is worth: licensing deal economics
Chart
Annualized values of disclosed/reported AI content-licensing deals. Reddit→Google: reported $60M/yr.[30] Reddit→OpenAI: estimated ~$70M/yr.[32] News Corp→OpenAI: up to $250M over five years in cash and credits (~$50M/yr).[31]
Source · CBS News · Columbia Journalism Review · Variety
Competitive landscape
A $29B incumbent in a neutrality crisis, a $25B bootstrapper, and one open lane.
Labeling/RLHF incumbents (Scale, Surge, Mercor, Appen) dominate annotation. Rights-cleared marketplaces (Defined.ai) and consumer data apps (Kled, Sapien) are adjacent. The "rights-cleared raw multimodal data" lane is where Luel differentiates — and the Meta–Scale deal put the largest incumbent's neutrality in question.
Data-platform valuations, mid-2025 to late 2025
Chart
Scale AI: $29B implied by Meta's $14.3B purchase of a 49% stake (Jun 2025).[24] Surge AI: reported talks to raise ~$1B at ≥$25B (Jul 2025).[28] Mercor: $350M Series C at $10B (Oct 2025).[29]
Source · TechCrunch · Bloomberg
Luel differentiates on rights trail plus speed for bespoke multimodal collections, on catalog re-licensing economics the labeling incumbents are not optimized to build — and on independence, at the exact moment labs are fleeing a Meta-owned incumbent.
Founder deep dive
A two-founder team that walked away from Berkeley to build the data layer.
Founders
Founder signal
Risks & mitigations
What we're watching
References
- [1]Y Combinator — Luel company profile
- [2]Epoch AI — Will we run out of data?
- [3]Reuters — New York Times sues OpenAI and Microsoft over copyright
- [4]The Verge — OpenAI reportedly used YouTube video transcriptions for training
- [5]Google Robotics — RT-1 Robotics Transformer (real-world data collection)
- [6]Latent Space — RLHF 201 (Nathan Lambert)
- [7]a16z Policy — AI, copyright, and fair use (submission)
- [8]Council of the EU — AI Act final approval (press release)
- [9]MarketsandMarkets — AI Training Dataset Market (press release)
- [10]Allied Market Research — Data Collection and Labelling Market
- [11]Reuters — U.S. Labor Department investigating Scale AI
- [12]Reuters — Scale AI raises $1B at ~$13.8B valuation
- [13]Reuters — Surge AI seeks up to $1B raise at >$15B valuation
- [14]TechCrunch — DefinedCrowd raises $50.5M Series B
- [15]TechCrunch — DefinedCrowd raises additional $15M
- [16]VentureBeat — Sapien raises $5M seed (Train2Earn)
- [17]Reuters — OpenAI signs deal with Stack Overflow
- [18]Reuters — Google reaches content licensing deal with Reddit
- [19]Axel Springer — Partnership with OpenAI (press)
- [20]ArXiv — Nightshade: Prompt-specific poisoning attacks
- [21]MIT Technology Review — Artists use Nightshade to poison AI models
- [22]Luel — $31.2M seed round led by General Catalyst and Lightspeed (announcement)
- [23]Lightspeed — Our Investment in Luel: The Marketplace for Multimodal AI Training Data
- [24]TechCrunch — Scale AI confirms 'significant' investment from Meta, CEO Alexandr Wang departing
- [25]TechCrunch — OpenAI drops Scale AI as a data provider following Meta deal
- [26]TechCrunch — Google reportedly plans to cut ties with Scale AI
- [27]CNBC — Scale AI cuts 14% of workforce after Meta investment
- [28]Bloomberg — Scale rival Surge AI in talks for funding at $25 billion value
- [29]TechCrunch — Mercor quintuples valuation to $10B with $350M Series C
- [30]CBS News — Google strikes $60 million deal with Reddit for AI training
- [31]Variety — News Corp inks OpenAI licensing deal potentially worth more than $250 million
- [32]Columbia Journalism Review — Reddit is winning the AI licensing game
- [33]Mayer Brown — EU AI Act: GPAI rules start applying; training-data summary template finalized
- [34]TechCrunch (via AOL) — 'No crying in the casino': a viral startup spat exposes tech's crazed state



