
Unsiloed AI
API for parsing multimodal unstructured data into LLM-ready JSON and Markdown.
$24k
MRR
Profitable at this scale · 14 paying customers
1M+
Pages / week
Across Fortune 150 banks + NASDAQ-listed cos.
88.0
#1 on olmOCR-Bench
Top strict pass rate of 19 systems · May 2026
Thesis
- 01
Deterministic, auditable parsing is mandatory for regulated AI. Banks, insurers, legal, and healthcare teams need reproducible, explainable outputs with layout preservation and confidence scoring. LLM-only pipelines are probabilistic, drift with model updates, and are hard to audit. Unsiloed produces deterministic, schema-aligned, citation-backed outputs — and the accuracy claim is now benchmarked: Parser v3.1 posted the top strict pass rate (88.0) on olmOCR-Bench in May 2026, ahead of GPT-5.5 (84.6), Reducto (66.0), and Unstructured (39.9).[1][20]
- 02
Vision-first beats LLM-first on speed and cost. Vision models parallelize on GPUs; 7B-scale LLMs in the extraction loop drive higher latency, nondeterminism, and per-page token cost. Unsiloed's ~$0.01/page economics undercut Reducto's ~2× per-page pricing.[15]
- 03
A proprietary corpus becomes a compounding moat. 1M+ real multimodal documents and domain ontologies for finance, legal, and healthcare. Customer-specific post-training on low-confidence fields and corrections feeds back into the model — a reinforcement loop that compounds over time.
- 04
The team is the canonical technical pair for this stack. Aman (ultra low-latency trading in C++/Rust; AI copilots for Goldman / Schwab) and Adnan (MIT; multimodal models at a Fortune 10; autonomous perception at Mercedes). Both IIT Kharagpur. Already shipping into Fortune 150 banks and NASDAQ-listed enterprises.[1]
Problem
AI teams spend 6+ months building document workflows. Fewer than 10% reach production.
Generic LLM parsers and OCR collapse on multimodal documents that contain text, tables, images, and charts. Poor parsing and suboptimal chunking cripple RAG pipelines and downstream automation. The proof-of-concept demo passes; the production rollout doesn't.[1]
Financial, insurance, legal, and healthcare documents are not text-only. They frequently contain charts, infographics, styled text, footnotes, merged cells, multi-page tables, color-encoded semantics, and irregular multi-column layouts. These structures carry meaning that generic LLM parsers routinely miss, conflate, or hallucinate.
More importantly, LLMs cannot parse these multimodal elements deterministically — making them unsuitable for high-stakes, auditable extraction. A bank's reconciliation pipeline can't tolerate non-determinism; an insurance claim adjudicator can't accept "the parser sometimes flips merged cells."
80%
Of enterprise data
Is unstructured · only a fraction is analyzed
<10%
Of doc workflows
Reach production after 6+ months of build
1M+
Documents in corpus
Proprietary, multimodal, domain-tuned
Why Now
AI document extraction is an industry, not a market.
Like AI code (Codex / Cursor / Replit) and AI legal (Harvey / Eve / Crosby), document extraction will support multiple horizontal infra players and vertical specialists. Unsiloed is positioned as the horizontal ingestion layer for regulated, multimodal workflows.
Industries, not markets. AI categories like code, legal, and document extraction are not winner-take-all — they support multiple horizontal infrastructure players and vertical specialists.
Anish Acharya[19]
General Partner · Andreessen Horowitz
The unstructured-to-AI layer is becoming core infra.
The data problem is enormous. 80% of enterprise data is unstructured; only a fraction is analyzed. Turning documents into AI-ready data is becoming as fundamental as databases were in the last era.[2]
IDP is growing at 26%+ CAGR. Fortune Business Insights pegs Intelligent Document Processing at $10.6B in 2025, growing to $91B by 2034 — with North America holding ~48% share and banking and financial services leading adoption.[3]
Capital has validated the category — fast. Reducto closed a $75M Series B led by a16z in October 2025 ($108M total), reporting 6× volume growth in five months and close to a billion pages processed monthly, with customers like Harvey, Rogo, and Scale AI.[6][24] LandingAI raised a Series B in September 2025 ($57M total) behind Andrew Ng's Agentic Document Extraction.[26] LlamaParse has now processed 1B+ documents for 300k+ users.[32] When a16z, Benchmark, and Menlo all fund the same layer within 18 months, the layer is real.
The teams that bet on LLM-only extraction in 2024 are circling back. Document AI spend is rising as enterprises move from pilots to production RAG and agents — and discovering they need the deterministic infra layer underneath.[4]
Industries, not markets. AI document extraction will support multiple horizontal infrastructure players and vertical specialists — just like AI code and AI legal.
Product & Technology
Segment visually → preserve structure → decode deterministically.
The full pipeline is engineered around the principle that LLMs should not be in the extraction loop for regulated workflows.
Multimodal strengths where generic OCR + LLM parsers fail.
Charts and infographics. Unsiloed reads tables, charts, and infographics directly — extracting values from axes, legends, and series. Generic OCR collapses to raw text; generic LLM parsers hallucinate the numbers. Unsiloed treats the chart as the structured object it actually is.
Long-tail layouts. A proprietary corpus of 1M+ real multimodal documents and domain ontologies for finance, legal, and healthcare enables higher fidelity on long-tail structures — nested tables, multi-page figures, format-encoded semantics — that generic models consistently miss.
Synthetic post-training. The team also post-trains on synthetically generated multimodal datasets that mimic rare layouts, edge cases, and domain-specific templates — expanding coverage where real-world labeled data is sparse.
Forward compatibility. The architecture is model-agnostic. It can incorporate emerging OCR-free vision-RAG (ColPali) and VLM components as they mature — without abandoning the deterministic decoding and confidence scaffolding that probabilistic LLM-only stacks fundamentally lack.[14]
olmOCR-Bench — strict pass rate, May 2026
Chart
Unsiloed Parser v3.1 leads 19 systems at 88.0 across 1,403 PDFs and 8,413 unit tests (olmocr==0.4.27 scorer) — ahead of frontier VLMs (GPT-5.5: 84.6, Claude Opus 4.7: 81.9, Gemini 3 Pro: 77.7) and funded direct competitors (LlamaParse: 73.5, Reducto: 66.0, Extend: 64.0). Re-scoring failures with an LLM-as-judge lifts Unsiloed to 94.8. Caveat: this is a vendor-run evaluation, though the scorer is deterministic and reproducible.[20]
Source · Unsiloed AI olmOCR-Bench publication, May 2026 [20] · Ai2 olmOCR-Bench [21]
A benchmark the field actually competes on.
olmOCR-Bench (from Ai2) has become the de facto public scoreboard for document parsing — Datalab, LlamaIndex, and the model labs all publish against it.[21][22][23] Unsiloed's weakest sub-category is old scans (52.9); its strongest are exactly the ones regulated buyers care about: tables (93.2), multi-column layouts (87.9), and headers/footers (94.6).[20]
The result that matters most for the thesis: the two best-funded direct competitors — Reducto ($108M raised) and Extend — scored 66.0 and 64.0 on the same run. Even discounting for vendor selection effects, a 20+ point gap on a deterministic scorer is not noise. It is the kind of gap that wins bake-offs.[20][24]
Traction
Already shipping into the buyers most others can't access.
$24k
MRR (~$300k ARR)
Profitable at this scale
14
Paying customers
Fortune 150 bank · NASDAQ-listed cos. · 10+ YC startups
100%
Daily API use
Every paying customer uses the API every day
Pipeline depth that unlocks 6- to 7-figure ACVs.
Volume. Millions of pages processed weekly. The API is in the hot path for production reconciliation, document review, and RAG ingestion at Fortune 150 banks and NASDAQ-listed enterprises.[1]
Pipeline. 120+ companies in pipeline. 15 ongoing pilots — including Rippling and a large public tech company. Single bottoms-up logos in finance and legal land at tens to hundreds of thousands per account; Fortune 500 deployments scale into 7-figure ACVs across business units.
Signed enterprise LOI. $500k LOI with a global bank. This is the early enterprise signal that the deterministic, auditable, air-gapped product positioning lands with the buyers Reducto and LlamaParse are chasing.[1]
Momentum since the memo was written. The company closed a $500K seed in September 2025[29], shipped Parser v3.1 to the #1 strict pass rate on olmOCR-Bench in May 2026[20], published an April 2026 head-to-head parser comparison that doubles as developer-facing GTM[33], and added a native Claude integration for parsing, extraction, classification, and splitting inside Claude document workflows.[29] All of this on four people and half a million dollars.
Market
Unstructured-to-AI is the next core infrastructure layer.
80–90% of enterprise data is unstructured, growing ~3× faster than structured data — and only a fraction is analyzed. Turning documents into AI-ready data is becoming as fundamental as databases were in the last era.[2][31] Every production RAG system, every vertical AI agent, every regulated automation pipeline needs the ingestion layer underneath.
IDP: $10.6B (2025) → $91B (2034) at a 26.2% CAGR, per Fortune Business Insights — with North America holding ~48% share and banking and financial services leading adoption, followed by healthcare and legal.[3] Document AI alone is roughly $12–13B in 2024 → ~$27B by 2030 as enterprises move from pilots to production.[4]
Intelligent Document Processing market projection
Chart
Published estimates: $10.57B (2025), $14.16B (2026), $91.02B (2034) at a 26.2% CAGR; intermediate years interpolated at the report's CAGR. North America held 47.6% share in 2025.[3]
Source · Fortune Business Insights, Intelligent Document Processing Market [3]
Competitive landscape
Four categories of competition. Unsiloed wins on determinism, throughput, and air-gapped deployment.
The market splits into LLM-centric specialists, OSS / DIY toolkits, hyperscaler APIs, and — increasingly — frontier VLMs used directly. As of the May 2026 olmOCR-Bench run, Unsiloed outscores all four.
Document-AI parsing — total capital raised
Chart
Reducto $108M[24] · Unstructured $65M[25] · LandingAI $57M[26] · LlamaIndex $27.5M[27] · Extend $17M[9] · Tensorlake $8M[28] · Unsiloed $0.5M[29]. The asymmetry cuts both ways: rivals can outspend Unsiloed on GTM, but Unsiloed topping the category benchmark on 1/200th of Reducto's capital is the capital-efficiency signal we underwrite at pre-seed.
Source · Company announcements, Tracxn, PitchBook — see refs [9] [24] [25] [26] [27] [28] [29]
Our APIs are already parsing hundreds of thousands of documents for startups and NASDAQ-listed enterprises, powering vertical AI solutions across industries.
Strategic advantages
Moat- Deterministic + confidence-scored. Matches the audit and governance posture regulated buyers actually require.
- Benchmark leadership. #1 strict pass rate on olmOCR-Bench (88.0, May 2026) — 22 points clear of Reducto on the same deterministic scorer.[20]
- Vision-first cost / throughput. ~$0.01/page vs. LLM-centric incumbents — unit economics compound with volume.[15]
- Proprietary corpus + domain decoders. 1M+ real multimodal documents and finance / legal / healthcare ontologies create a data moat that compounds with every customer.
- Air-gapped on-prem. The deployment posture that unlocks BFSI procurement — one that Reducto and LlamaParse don't lead with.
Founder deep dive
The canonical technical pair for vision-first document AI.
Founders
Risks & mitigations
What we're watching
References
- [1]YC Launch — Unsiloed AI: Make Unstructured Data LLM-Ready
- [2]Data Dynamics — Unstructured Data: The Blind Spot CISOs and CIOs Must Solve
- [3]Fortune Business Insights — Intelligent Document Processing Market
- [4]MarketsandMarkets — Document AI Market
- [5]SiliconANGLE — Unstructured raises $40M to make raw data LLM-ready
- [6]PR Newswire — Reducto raises $75M Series B
- [7]Reducto — Compare: Reducto vs Google Document AI
- [8]Reducto — Compare: Reducto vs LlamaParse
- [9]Extend — Raises $17M to build the document processing cloud
- [10]Y Combinator — Extend company profile
- [11]LlamaIndex — LlamaCloud / LlamaParse
- [12]AIM Media House — LlamaIndex is building AI agents that understand your data
- [13]IBM — Docling's rise: the IBM toolkit turning unstructured documents into LLM-ready data
- [14]Microsoft Azure — Introduction to OCR-free Vision RAG using ColPali
- [15]Mindee — LLM vs. OCR API: Cost comparison for document processing in 2025
- [16]Google Cloud — Document AI overview
- [17]AWS — Textract product documentation
- [18]Microsoft — Azure AI Document Intelligence (Form Recognizer) overview
- [19]Anish Acharya (a16z) — Industries, Not Markets (X)
- [20]Unsiloed AI — Unsiloed Achieves #1 Rank on olmOCR-Bench (May 2026)
- [21]Ai2 — olmOCR 2: Unit test rewards for document OCR
- [22]Datalab — Saturating the olmOCR Benchmark
- [23]LlamaIndex — olmOCR-Bench Review: Insights and Pitfalls on an OCR Benchmark
- [24]Reducto — Reducto Announces $108M in Funding to Define the Future of AI Document Intelligence
- [25]The SaaS News — Unstructured Raises $40 Million in Series B ($65M total raised)
- [26]Tracxn — LandingAI company profile ($57M raised; Series B, Sep 2025)
- [27]PR Newswire — LlamaIndex Secures $19M Series A ($27.5M total raised)
- [28]StartupHub.ai — Tensorlake company profile ($8M raised)
- [29]Tracxn — Unsiloed AI company profile ($500K seed, Sep 2025)
- [30]Launch YC — Chunkr: Open Source Document Parsing You Can Own (since pivoted to floatingpoint)
- [31]CIO Dive — Most unstructured enterprise data is siloed, report finds (Box / IDC)
- [32]LlamaIndex — LlamaParse: AI document parsing (1B+ documents processed, 300k+ users)
- [33]Unsiloed AI — What's the Best PDF Parser for RAG Pipelines? (April 2026)



