Luel

Luel

Turning everyday words and actions into usable training data.

Luel — The Marketplace for Multimodal Data[1]

$9.6B

AI dataset TAM by 2029

~27.7% CAGR · multimodal fastest-growing

$29.2B

Data collection & labelling by 2032

~28.5% CAGR (Allied Market Research)

Days

To-spec delivery target

Custom multimodal w/ audit-ready artifacts

Thesis

Public web data is exhausted, copyright lawsuits and the EU AI Act are turning consent and chain-of-title into procurement requirements, and multimodal/robotics models need real-world data that scraping cannot supply.[2] [3] [8] [5] Luel is building the rights-cleared default — a marketplace and collection engine that delivers bespoke multimodal datasets to frontier labs in days, then re-licenses the resulting catalog at high margins.[1]
  1. 01

    Rights-cleared provenance is becoming a procurement prerequisite. The NYT suit against OpenAI, the wave of publisher licensing deals (Stack Overflow, Reddit, Axel Springer), and the EU AI Act have collectively turned "where did this data come from" into a checkbox the buyer will ask before they sign.[3] [17] [18] [19] [8] Luel's bundle — consent evidence, chain-of-title, QA logs — is the artifact buyers need to clear that checkbox.

  2. 02

    Multimodal and robotics models need data the web cannot supply. Web-scale corpora are insufficient for state-of-the-art robotics, egocentric video, and specialist dialogue. Google's RT-1 work is a public example of the alternative: large, purpose-collected teleoperation datasets gathered by hand.[5] Luel's edge-case focus — niche languages, robotics POV, patient-doctor conversations — targets exactly the high-pain, high-value gaps.

  3. 03

    Marketplace plus re-licensable catalog economics compound. Owning non-exclusive rights to a collection lets Luel re-license it at high gross margins after collection costs are amortized — closer to stock-media dynamics, but with AI-grade QA and metadata. DefinedCrowd raised $65M building a less rights-focused version of this thesis.[14] [15]

  4. 04

    A concentrated buyer set lets the winner scale fast. Frontier labs are few but spend heavily — Scale's $1B round at a $13.8B valuation and Surge's reported $1B+ raise at >$15B are the market signal.[12] [13] Becoming the default rights-cleared partner for time-sensitive, compliance-gated budgets is a category-defining outcome.

Problem

The web has been scraped. Now labs need data the web never had.

Frontier AI labs need rights-cleared multimodal training data at scale, and the most accessible source — the public web — has been substantially exhausted for the modalities that matter most.[2] Most available datasets fail production requirements: unclear rights, weak provenance, missing consent, inconsistent metadata.[1] The shortcuts that worked five years ago no longer work.

At the same time, the cost of using shortcuts has gone up. The NYT v. OpenAI suit, the YouTube transcript controversy, and the EU AI Act have collectively pushed legal, brand, and procurement teams to demand explicit licensing and audit-ready documentation before a dataset can enter pre-training.[3] [4] [8]

The remaining option is to commission real-world data: egocentric video from licensed contributors, professional dialogues with consent, robotics teleoperation footage shot to spec. That is operationally hard. It involves recruiting and vetting contributors, capturing consent, running multi-stage QA, and shipping artifacts the procurement team can sign on. Most labs don't want to build the operation in-house — and the labs that have built it (Scale, Surge, Mercor) are not optimized for rights-cleared catalog re-licensing.

$1B+

Surge AI raise (reported)

>$15B valuation · frontier lab dependency

$1B

Scale AI's 2024 round

$13.8B valuation · category gravity

$65M+

DefinedCrowd total raise

Closest rights-cleared marketplace analog

Why Now

Four forces hit at once. Rights-cleared multimodal data is the convergence trade.

Exhaustion of public corpora, regulatory hardening, the rise of multimodal / robotics, and the maturation of contributor-marketplace operations all arrive in the same 24-month window.

Public data ran out and procurement teams stopped looking the other way.

Data exhaustion is no longer hypothetical. Epoch AI's projections put high-quality public text data on a finite curve, with constraints intensifying as training compute scales.[2] Multimodal demand makes the gap larger, not smaller — real-world video, audio, and robotics footage was never on the web in usable quantities to begin with.

Lawsuits and licensing deals reshaped the buyer's risk model. The NYT v. OpenAI litigation, OpenAI's deal with Stack Overflow, Google's licensing arrangement with Reddit, and Axel Springer's partnership with OpenAI together signal that the era of "scrape now, apologize later" is closing.[3] [17] [18] [19] Buyers now want a paper trail.

The EU AI Act hardens the requirement. Final approval in May 2024 codified provenance and consent expectations for high-risk modalities — health, biometric, PII — and effectively exports those expectations globally for any lab serving European customers.[8]

Multimodal and robotics need bespoke collection. State-of-the-art robotics systems like RT-1 rely on hand-collected teleoperation datasets, not scrapes.[5] RLHF practice has converged on the conclusion that quality, diverse human data — not sheer volume — is the bottleneck.[6]

Frontier AI labs need rights-cleared multimodal training data at scale, but public web data is exhausted and most available datasets fail production requirements due to unclear rights, weak provenance, missing consent, and inconsistent metadata.
Luel's positioning statement[1]

How It Works

Spec in. Audit-ready dataset out. In days, not quarters.

Step 01

Spec

Customer specifies modality, scenarios, devices, demographics, QA rules. Luel translates the spec into a collection plan, contributor profile, and acceptance criteria.

Step 02

Recruit & match

A global contributor network is matched to the spec — by hardware, geography, language, profession. Contributors are vetted, onboarded, and paid through mainstream rails.

Step 03

Collect

Multimodal capture: video, audio, image, robotics POV, niche languages. Consent is captured at the source. Provenance metadata is attached to every artifact at ingestion.

Step 04

Multi-stage QA

Quality control, provenance verification, and metadata normalization tuned for pre-training and evals ingestion. Failed artifacts are flagged before they ever reach the customer.

Step 05

Deliver with paper trail

Datasets ship with consent evidence, chain-of-title, license templates, and audit-ready documentation. The buyer's procurement team gets what they need to sign.

Step 06

Catalog re-license

Non-exclusive collections enter the off-the-shelf catalog and are re-licensed to additional customers at high gross margin — DefinedCrowd-style economics with AI-grade QA.

Two product motions on one operational substrate.

Bespoke collection. The high-margin, high-touch motion: a frontier lab needs gemstone manufacturing footage, or patient-doctor dialogues, or egocentric data from cooks in a commercial kitchen. Luel turns the request around in days with the full paper trail. Customers pay for speed and for the procurement-ready artifacts.[1]

Off-the-shelf catalog. The compounding motion: collections that were funded by a bespoke job become re-licensable inventory. The customer gets a fast start; Luel gets a margin profile closer to stock-media than to services. Inigo's published Ego-Realm dataset on Hugging Face is an early demonstration of the catalog motion.

Compliance is the product, not a wrapper. Standard license templates, usage scopes, consent revocation flows, and documentation aligned to EU AI Act risk categories are not features bolted on — they are why the buyer chooses Luel.[8]

Interoperability. APIs, SDKs, and delivery formats are designed to drop into the lab's existing pre-training, fine-tuning, and evals pipelines so the dataset doesn't sit in a slow review queue.

The artifacts that ship with every dataset

Paper trail
Consent evidenceChain-of-titleLicense templatesQA logsProvenance metadataUsage scopes

The bundle that lets a frontier lab clear procurement, legal, and brand review without a multi-month back-and-forth.

Market

A market that is small today and inevitable tomorrow.

AI training datasets and the broader data collection & labelling services market are both compounding at roughly 27–28% annually — multimodal is the fastest-growing segment.

The training dataset market is tripling inside the next five years.

The AI training datasets market sits at roughly $2.82B in 2024 and is projected to reach ~$9.58B by 2029 at ~27.7% CAGR, with multimodal as the fastest-growing segment.[9]

The broader data collection and labelling services market is roughly $3.0B in 2023 heading to ~$29.2B by 2032 at ~28.5% CAGR.[10] Publisher and platform licensing deals — Stack Overflow, Reddit, Axel Springer — validate willingness to pay for rights-cleared content at the upper end of that range.[17] [18] [19]

Two structural tailwinds compound on top of the headline numbers. First, modern robotics and multimodal systems require real-world egocentric and device-specific data that scraping cannot provide.[5] Second, public-data exhaustion drives a premium on curated, re-licensable corpora — and pushes more of the spend toward bespoke collection rather than pre-existing dumps.[2]

Near term — bespoke frontier-lab collections

Heads of data, applied research, and model training at frontier labs and advanced enterprise AI teams. Compliance-gated budgets, time-sensitive procurement. Two product motions: rapid bespoke collection and catalog re-licensing. Few buyers, large checks.[1]

Long term — the rights-cleared catalog

As the catalog compounds, the economic center of gravity shifts from bespoke services to inventory re-licensing — closer to stock-media gross margins, but with AI-grade QA and metadata. The buyer base broadens from frontier labs to every team training in regulated or rights-sensitive verticals.[14] [15]

Competitive landscape

Two incumbents, two adjacents, one open lane.

Labeling/RLHF incumbents (Scale, Surge, Appen) dominate annotation. Rights-cleared marketplaces (Defined.ai) and consumer data apps (Kled, Sapien) are adjacent. The "rights-cleared raw multimodal data" lane is where Luel differentiates.

Scale AI

Incumbent · labeling

$1B raise at ~$13.8B valuation in 2024. Gold-standard scale and enterprise penetration, but rights/provenance is not the core product wedge and US Labor Department scrutiny has surfaced contributor compliance risk. Less focused on owning re-licensable rights-cleared raw multimodal data.[12]

Surge AI

Frontier RLHF

Reported to be raising up to $1B at >$15B valuation. Elite RLHF, evals, and human data for frontier labs. Strong expert network and lab penetration, but the focus is annotation and human feedback — not raw rights-cleared multimodal pre-training corpora.[13]

Defined.ai (DefinedCrowd)

Closest adjacent

Ethical AI data marketplace plus custom collection. Raised $50.5M Series B in 2020 and $15M in 2022. Enterprise-ready marketplace with established licenses and QA. Broader, slower enterprise focus — Luel competes on frontier edge cases and speed.[14]

Appen

Public incumbent

General-purpose labeling and collection at global scale. Strong contributor crowd and procurement muscle, but less emphasis on rights-cleared marketplace dynamics or edge-case multimodal collections.

Sapien

Train2Earn marketplace

Gamified labeling marketplace with blockchain incentives. $5M seed in 2024. Focused on annotation and RLHF rather than raw rights-cleared pre-training corpora.[16]

Kled / consumer data apps

Consumer supply

Consumer-first 'human data marketplace' (uploads → datasets). Strong consumer supply acquisition and crypto-native growth tactics, but a weaker enterprise procurement posture and limited bespoke to-spec delivery.

Mercor

Contractor marketplace

Large-scale contractor marketplace for labeling and data generation. Strong on-demand labor for complex data tasks, but the model is work-for-hire with limited catalog or re-licensing economics.

Luel differentiates on rights trail plus speed for bespoke multimodal collections — and on the catalog re-licensing economics that the labeling incumbents are not optimized to build.
Luel's wedge[1]

Founder deep dive

A two-founder team that walked away from Berkeley to build the data layer.

The pair. William and Inigo both attended UC Berkeley and both dropped out. The founder dynamic — a Berkeley M.E.T. dropout (William) paired with a Berkeley CS dropout (Inigo) — suggests they met through the campus CS community before co-founding Luel. The split is clean: William runs as CEO, Inigo runs ops as COO.

William's path to the problem. Before Luel, William was a founding engineer at ezML and at Relixir (where he also ran GTM), shipping ML and data products at two early-stage companies. In parallel he co-authored an NDSS 2025 poster on LLM security and privacy at Northeastern's PEACH Lab — research exposure that maps directly onto the compliance and provenance surface Luel sells into. He also founded HackBlue, a cybersecurity hackathon, organizing students and practitioners around security and tooling work.

Inigo's path to the problem. Inigo is described in public sources as a former machine learning researcher and a Berkeley CS attendee, indicating technical familiarity with ML that informs Luel's dataset product and QA practices. He maintains a Hugging Face account (Inigology) and a Luel organization presence there, and has already published the "Ego-Realm" egocentric dataset sample — demonstrating the rights-cleared, production-ready multimodal data that Luel sells. His pre-Berkeley education was at The King's School, Canterbury.

Why this team for this problem. The problem is half data engineering and half operations — sourcing contributors globally, capturing consent at the source, and shipping artifacts a Fortune 500 legal team will sign. William's prior founding-engineer roles and security research background fit the technical and compliance side. Inigo's ML research background plus chef/creative profile fit the contributor-recruitment and content-collection side. The split-of-labor lines up cleanly with the two halves of the company.

On their YC partner. Luel's YC group partner is Harshita Arora — a signal of the partner team's read on the founders.

Founders

William Namgyal

William Namgyal

Repeat Founder

Co-Founder & CEO

Berkeley M.E.T. dropout and 2x founding engineer (ezML, Relixir) before co-founding Luel. Built ML and data products at early-stage startups, then served as founding engineer and GTM lead at Relixir. Co-authored an NDSS 2025 poster on LLM security and privacy as a research intern at Northeastern's PEACH Lab. Founded HackBlue, a cybersecurity hackathon. Now leads Luel as a compliance-forward marketplace and custom collection engine delivering licensed multimodal datasets to enterprise model-training teams.

Inigo Lenderking

Inigo Lenderking

Co-Founder & COO

Berkeley CS dropout and former machine learning researcher. As COO and co-founder of Luel, runs the two-sided marketplace and collection engine that connects AI teams to vetted contributors, delivering licensed, audit-ready video and audio datasets to spec. Active on Hugging Face (Inigology) and has published the Ego-Realm egocentric dataset sample to demonstrate Luel's rights-cleared, production-ready multimodal data. Educated at The King's School, Canterbury before Berkeley.

Founder signal

Repeat-founder DNA

William is a 2x founding engineer (ezML, Relixir) — has shipped early-stage products before, including running GTM at Relixir.

Compliance research background

William co-authored an NDSS 2025 poster on LLM security and privacy at Northeastern PEACH Lab — directly relevant to Luel's compliance product surface.

Community building

William founded HackBlue, a cybersecurity hackathon — early signal of recruiting and organizing technical communities.

Published dataset work

Inigo maintains an active Hugging Face presence (Inigology) and has shipped the "Ego-Realm" egocentric dataset sample — public demonstration of Luel's quality bar.

Shared Berkeley pedigree

Both attended UC Berkeley (William in M.E.T., Inigo in CS) before dropping out together to build Luel — clean cofounder dynamic with deep prior context.

YC group partner

Harshita Arora — a signal of partner-team conviction on the founders and the wedge.

Risks & mitigations

Risk

A regulatory shift toward broad fair-use for AI training would compress the rights premium that anchors Luel's pricing power.

Mitigation

Concentrate on privacy-sensitive and high-risk modalities — health, biometric, professional dialogues — and geographies where explicit consent remains mandatory regardless of fair-use outcomes. Continue to productize compliance with standardized license templates and usage scopes that buyers can drop into procurement.

Risk

Synthetic data substitution — frontier labs decide they can manufacture the multimodal data they need rather than license it.

Mitigation

Focus on the edge cases where synthetic is weakest: niche languages, real-world egocentric footage, specialist dialogues, hardware-specific robotics POV. Bundle evaluation datasets that let labs demonstrate real-world lift over synthetic baselines — making Luel the test set even if it is not always the training set.

Risk

Contributor labor compliance and reputational risk — incumbents like Scale AI have already drawn US Labor Department scrutiny.

Mitigation

Implement transparent pay policies, jurisdiction-aware contractor agreements, and consent revocation flows from day one. Build auditable provenance into every collection so third-party labor audits are a feature, not a fire drill.

Risk

Incumbent response — Scale, Appen, or Defined.ai expand into rights-cleared catalog and outspend a two-person team on enterprise sales.

Mitigation

Win narrow, high-value niches first; maintain the speed advantage that incumbents structurally can't match; secure non-exclusive rights for catalog compounding; integrate tightly with lab pre-training pipelines so the switching cost grows with each delivery.

Additional integrity surface. Data poisoning and extraction risks against open-web sources continue to elevate the value of curated, traceable provenance — Nightshade and related work demonstrate why adversarial checks and closed-loop collection matter.[20] [21] Luel's closed-loop collection model is a structural answer to that surface, but adversarial integrity remains an ongoing engineering investment, not a solved problem.

What we're watching

  • First named frontier-lab logo — explicit references to delivered, in-production datasets with measurable model-quality lift.
  • Catalog SKU count and re-license velocity — the inventory side of the business is where margins compound once collection costs are amortized.
  • EU AI Act and US AG enforcement signals through 2026 — every new procurement requirement is direct tailwind.
  • Hiring around contributor ops, QA tooling, and legal — the bottleneck shifts to operational scale once the marketplace flywheel turns.

References

  1. [1]Y Combinator — Luel company profile
  2. [2]Epoch AI — Will we run out of data?
  3. [3]Reuters — New York Times sues OpenAI and Microsoft over copyright
  4. [4]The Verge — OpenAI reportedly used YouTube video transcriptions for training
  5. [5]Google Robotics — RT-1 Robotics Transformer (real-world data collection)
  6. [6]Latent Space — RLHF 201 (Nathan Lambert)
  7. [7]a16z Policy — AI, copyright, and fair use (submission)
  8. [8]Council of the EU — AI Act final approval (press release)
  9. [9]MarketsandMarkets — AI Training Dataset Market (press release)
  10. [10]Allied Market Research — Data Collection and Labelling Market
  11. [11]Reuters — U.S. Labor Department investigating Scale AI
  12. [12]Reuters — Scale AI raises $1B at ~$13.8B valuation
  13. [13]Reuters — Surge AI seeks up to $1B raise at >$15B valuation
  14. [14]TechCrunch — DefinedCrowd raises $50.5M Series B
  15. [15]TechCrunch — DefinedCrowd raises additional $15M
  16. [16]VentureBeat — Sapien raises $5M seed (Train2Earn)
  17. [17]Reuters — OpenAI signs deal with Stack Overflow
  18. [18]Reuters — Google reaches content licensing deal with Reddit
  19. [19]Axel Springer — Partnership with OpenAI (press)
  20. [20]ArXiv — Nightshade: Prompt-specific poisoning attacks
  21. [21]MIT Technology Review — Artists use Nightshade to poison AI models