Orange Collective
Unsiloed AI

Unsiloed AI

API for parsing multimodal unstructured data into LLM-ready JSON and Markdown.

Unsiloed AI — Vision-First Document Extraction[1]

$24k

MRR

Profitable at this scale · 14 paying customers

1M+

Pages / week

Across Fortune 150 banks + NASDAQ-listed cos.

88.0

#1 on olmOCR-Bench

Top strict pass rate of 19 systems · May 2026

Thesis

Unsiloed AI is an API that converts complex multimodal documents (PDFs, scans, slides, charts) into structured, LLM-ready data with a vision-first pipeline — not an LLM-first one — built for speed, determinism, and enterprise deployment (incl. air-gapped on-prem).[1] The market is shaping up like AI code (Codex / Cursor / Replit) and AI legal (Harvey / Eve / Crosby) — an industry, not a winner-take-all market. Unsiloed is the horizontal ingestion layer for regulated, multimodal workflows.[19]
  1. 01

    Deterministic, auditable parsing is mandatory for regulated AI. Banks, insurers, legal, and healthcare teams need reproducible, explainable outputs with layout preservation and confidence scoring. LLM-only pipelines are probabilistic, drift with model updates, and are hard to audit. Unsiloed produces deterministic, schema-aligned, citation-backed outputs — and the accuracy claim is now benchmarked: Parser v3.1 posted the top strict pass rate (88.0) on olmOCR-Bench in May 2026, ahead of GPT-5.5 (84.6), Reducto (66.0), and Unstructured (39.9).[1][20]

  2. 02

    Vision-first beats LLM-first on speed and cost. Vision models parallelize on GPUs; 7B-scale LLMs in the extraction loop drive higher latency, nondeterminism, and per-page token cost. Unsiloed's ~$0.01/page economics undercut Reducto's ~2× per-page pricing.[15]

  3. 03

    A proprietary corpus becomes a compounding moat. 1M+ real multimodal documents and domain ontologies for finance, legal, and healthcare. Customer-specific post-training on low-confidence fields and corrections feeds back into the model — a reinforcement loop that compounds over time.

  4. 04

    The team is the canonical technical pair for this stack. Aman (ultra low-latency trading in C++/Rust; AI copilots for Goldman / Schwab) and Adnan (MIT; multimodal models at a Fortune 10; autonomous perception at Mercedes). Both IIT Kharagpur. Already shipping into Fortune 150 banks and NASDAQ-listed enterprises.[1]

Problem

AI teams spend 6+ months building document workflows. Fewer than 10% reach production.

Generic LLM parsers and OCR collapse on multimodal documents that contain text, tables, images, and charts. Poor parsing and suboptimal chunking cripple RAG pipelines and downstream automation. The proof-of-concept demo passes; the production rollout doesn't.[1]

Financial, insurance, legal, and healthcare documents are not text-only. They frequently contain charts, infographics, styled text, footnotes, merged cells, multi-page tables, color-encoded semantics, and irregular multi-column layouts. These structures carry meaning that generic LLM parsers routinely miss, conflate, or hallucinate.

More importantly, LLMs cannot parse these multimodal elements deterministically — making them unsuitable for high-stakes, auditable extraction. A bank's reconciliation pipeline can't tolerate non-determinism; an insurance claim adjudicator can't accept "the parser sometimes flips merged cells."

80%

Of enterprise data

Is unstructured · only a fraction is analyzed

<10%

Of doc workflows

Reach production after 6+ months of build

1M+

Documents in corpus

Proprietary, multimodal, domain-tuned

Why Now

AI document extraction is an industry, not a market.

Like AI code (Codex / Cursor / Replit) and AI legal (Harvey / Eve / Crosby), document extraction will support multiple horizontal infra players and vertical specialists. Unsiloed is positioned as the horizontal ingestion layer for regulated, multimodal workflows.

Industries, not markets. AI categories like code, legal, and document extraction are not winner-take-all — they support multiple horizontal infrastructure players and vertical specialists.

Anish Acharya

Anish Acharya[19]

General Partner · Andreessen Horowitz

The unstructured-to-AI layer is becoming core infra.

The data problem is enormous. 80% of enterprise data is unstructured; only a fraction is analyzed. Turning documents into AI-ready data is becoming as fundamental as databases were in the last era.[2]

IDP is growing at 26%+ CAGR. Fortune Business Insights pegs Intelligent Document Processing at $10.6B in 2025, growing to $91B by 2034 — with North America holding ~48% share and banking and financial services leading adoption.[3]

Capital has validated the category — fast. Reducto closed a $75M Series B led by a16z in October 2025 ($108M total), reporting 6× volume growth in five months and close to a billion pages processed monthly, with customers like Harvey, Rogo, and Scale AI.[6][24] LandingAI raised a Series B in September 2025 ($57M total) behind Andrew Ng's Agentic Document Extraction.[26] LlamaParse has now processed 1B+ documents for 300k+ users.[32] When a16z, Benchmark, and Menlo all fund the same layer within 18 months, the layer is real.

The teams that bet on LLM-only extraction in 2024 are circling back. Document AI spend is rising as enterprises move from pilots to production RAG and agents — and discovering they need the deterministic infra layer underneath.[4]

Industries, not markets. AI document extraction will support multiple horizontal infrastructure players and vertical specialists — just like AI code and AI legal.
Anish Acharya, General Partner · a16z[19]

Product & Technology

Segment visually → preserve structure → decode deterministically.

The full pipeline is engineered around the principle that LLMs should not be in the extraction loop for regulated workflows.

Layer 01

Region segmentation & layout

Specialized vision models detect text blocks, multi-page tables, figures, and charts. Heatmap-based chunking keeps semantically related content together across page breaks — preserving meaning the way a human reader would.

Layer 02

Dual-stream representation

Parallel streams preserve (a) textual content and (b) layout / format (hierarchy, indentation, alignment). Captures cues like right-aligned subtotals and merged cells that are critical to finance and legal extraction.

Layer 03

Domain-tuned decoders with RL

Output normalized JSON and Markdown with citations and per-field confidence scores. Domain-specific decoders trained with RL pipelines. Designed for downstream RAG and agent pipelines that need verifiable, schema-aligned output.

Layer 04

Deployment & security

Cloud API or fully air-gapped on-prem. SOC2-aligned posture. No human-in-the-loop on the vendor side. Confidence-gated human review on the customer side — exactly the audit trail regulated buyers require.

Multimodal strengths where generic OCR + LLM parsers fail.

Charts and infographics. Unsiloed reads tables, charts, and infographics directly — extracting values from axes, legends, and series. Generic OCR collapses to raw text; generic LLM parsers hallucinate the numbers. Unsiloed treats the chart as the structured object it actually is.

Long-tail layouts. A proprietary corpus of 1M+ real multimodal documents and domain ontologies for finance, legal, and healthcare enables higher fidelity on long-tail structures — nested tables, multi-page figures, format-encoded semantics — that generic models consistently miss.

Synthetic post-training. The team also post-trains on synthetically generated multimodal datasets that mimic rare layouts, edge cases, and domain-specific templates — expanding coverage where real-world labeled data is sparse.

Forward compatibility. The architecture is model-agnostic. It can incorporate emerging OCR-free vision-RAG (ColPali) and VLM components as they mature — without abandoning the deterministic decoding and confidence scaffolding that probabilistic LLM-only stacks fundamentally lack.[14]

olmOCR-Bench — strict pass rate, May 2026

Chart

Unsiloed Parser v3.1 leads 19 systems at 88.0 across 1,403 PDFs and 8,413 unit tests (olmocr==0.4.27 scorer) — ahead of frontier VLMs (GPT-5.5: 84.6, Claude Opus 4.7: 81.9, Gemini 3 Pro: 77.7) and funded direct competitors (LlamaParse: 73.5, Reducto: 66.0, Extend: 64.0). Re-scoring failures with an LLM-as-judge lifts Unsiloed to 94.8. Caveat: this is a vendor-run evaluation, though the scorer is deterministic and reproducible.[20]

Source · Unsiloed AI olmOCR-Bench publication, May 2026 [20] · Ai2 olmOCR-Bench [21]

A benchmark the field actually competes on.

olmOCR-Bench (from Ai2) has become the de facto public scoreboard for document parsing — Datalab, LlamaIndex, and the model labs all publish against it.[21][22][23] Unsiloed's weakest sub-category is old scans (52.9); its strongest are exactly the ones regulated buyers care about: tables (93.2), multi-column layouts (87.9), and headers/footers (94.6).[20]

The result that matters most for the thesis: the two best-funded direct competitors — Reducto ($108M raised) and Extend — scored 66.0 and 64.0 on the same run. Even discounting for vendor selection effects, a 20+ point gap on a deterministic scorer is not noise. It is the kind of gap that wins bake-offs.[20][24]

Traction

Already shipping into the buyers most others can't access.

$24k

MRR (~$300k ARR)

Profitable at this scale

14

Paying customers

Fortune 150 bank · NASDAQ-listed cos. · 10+ YC startups

100%

Daily API use

Every paying customer uses the API every day

Pipeline depth that unlocks 6- to 7-figure ACVs.

Volume. Millions of pages processed weekly. The API is in the hot path for production reconciliation, document review, and RAG ingestion at Fortune 150 banks and NASDAQ-listed enterprises.[1]

Pipeline. 120+ companies in pipeline. 15 ongoing pilots — including Rippling and a large public tech company. Single bottoms-up logos in finance and legal land at tens to hundreds of thousands per account; Fortune 500 deployments scale into 7-figure ACVs across business units.

Signed enterprise LOI. $500k LOI with a global bank. This is the early enterprise signal that the deterministic, auditable, air-gapped product positioning lands with the buyers Reducto and LlamaParse are chasing.[1]

Momentum since the memo was written. The company closed a $500K seed in September 2025[29], shipped Parser v3.1 to the #1 strict pass rate on olmOCR-Bench in May 2026[20], published an April 2026 head-to-head parser comparison that doubles as developer-facing GTM[33], and added a native Claude integration for parsing, extraction, classification, and splitting inside Claude document workflows.[29] All of this on four people and half a million dollars.

Market

Unstructured-to-AI is the next core infrastructure layer.

80–90% of enterprise data is unstructured, growing ~3× faster than structured data — and only a fraction is analyzed. Turning documents into AI-ready data is becoming as fundamental as databases were in the last era.[2][31] Every production RAG system, every vertical AI agent, every regulated automation pipeline needs the ingestion layer underneath.

IDP: $10.6B (2025) → $91B (2034) at a 26.2% CAGR, per Fortune Business Insights — with North America holding ~48% share and banking and financial services leading adoption, followed by healthcare and legal.[3] Document AI alone is roughly $12–13B in 2024 → ~$27B by 2030 as enterprises move from pilots to production.[4]

Intelligent Document Processing market projection

Chart

Published estimates: $10.57B (2025), $14.16B (2026), $91.02B (2034) at a 26.2% CAGR; intermediate years interpolated at the report's CAGR. North America held 47.6% share in 2025.[3]

Source · Fortune Business Insights, Intelligent Document Processing Market [3]

Bottoms-up wedge — AI startups + data-heavy mid-market

Finance and legal teams adopt first, at tens to hundreds of thousands per account. YC startup customers integrate the API in days. The bottoms-up motion compounds into reference accounts that warm the top-down enterprise sales motion.

Top-down — Fortune 500 logos at 7-figure ACVs

When scaled across business units, single Fortune 500 logos reach 7-figure ACVs. The $500k LOI is the early signal.[1] Determinism, on-prem deployment, and confidence scoring are exactly the requirements that get an Unsiloed contract through procurement.

Competitive landscape

Four categories of competition. Unsiloed wins on determinism, throughput, and air-gapped deployment.

The market splits into LLM-centric specialists, OSS / DIY toolkits, hyperscaler APIs, and — increasingly — frontier VLMs used directly. As of the May 2026 olmOCR-Bench run, Unsiloed outscores all four.

Reducto

$108M · Series B (Oct '25)

Enterprise document intelligence with hybrid CV+VLM; ~1B pages/month; customers include Harvey, Rogo, Scale AI. a16z-led Series B. LLM-heavy extraction increases cost and latency; premium pricing — Unsiloed's ~$0.01/page economics undercut by ~2×, and Reducto scored 66.0 vs. Unsiloed's 88.0 on the May 2026 olmOCR-Bench run.[24]

Extend

$17M · Seed+A

Full-stack document processing cloud — sandbox UI, eval / annotation, fine-tuning workflows. 95%+ accuracy claims. Earlier-stage; breadth over deep financial verticalization; on-prem maturity TBD.[9]

Unstructured.io

$40M · Series B

GenAI data layer; broad connectors and format coverage. OSS adoption, gov / defense inroads. General-purpose accuracy on complex layouts lower than specialized stacks like Unsiloed.[5]

LlamaParse (LlamaIndex)

$27.5M · Series A

Parser integrated with LlamaIndex RAG / agents; 1B+ documents processed for 300k+ users; great DX, low cost, tight RAG integration. Generalist accuracy on edge cases lags specialized vendors (73.5 on olmOCR-Bench) — exactly the regulated finance / legal documents Unsiloed targets.[27]

LandingAI (ADE)

$57M · Series B (Sep '25)

Andrew Ng's Agentic Document Extraction — strong brand and distribution into financial services. Agentic, LLM-in-the-loop architecture; scored 69.5 on the May 2026 olmOCR-Bench run, well behind specialized deterministic parsers.[26]

Datalab (Marker) / Tensorlake / Chunkr

Long tail

Datalab's Marker is the strongest independent challenger (83.2 on olmOCR-Bench). Tensorlake ($8M seed) pitches reliability-focused parsing. Chunkr — the open-source YC entrant — pivoted to post-training data (floatingpoint), an early signal that standalone OSS parsing is hard to build a venture business on.[28]

IBM Docling (OSS)

Linux Foundation

Open toolkit for local, layout-aware extraction. Free, runs locally, fast iteration, strong community. DIY integration and maintenance burden; limited domain-specific tuning and support.[13]

Hyperscalers — Google Document AI / AWS Textract / Azure Form Recognizer

Cloud APIs

Deep platform integration and global reach. General-purpose; struggle on complex multimodal finance / legal documents. Not specialized for regulated edge cases — and unable to ship air-gapped on-prem the way Unsiloed does.[7]

Document-AI parsing — total capital raised

Chart

Reducto $108M[24] · Unstructured $65M[25] · LandingAI $57M[26] · LlamaIndex $27.5M[27] · Extend $17M[9] · Tensorlake $8M[28] · Unsiloed $0.5M[29]. The asymmetry cuts both ways: rivals can outspend Unsiloed on GTM, but Unsiloed topping the category benchmark on 1/200th of Reducto's capital is the capital-efficiency signal we underwrite at pre-seed.

Source · Company announcements, Tracxn, PitchBook — see refs [9] [24] [25] [26] [27] [28] [29]

Our APIs are already parsing hundreds of thousands of documents for startups and NASDAQ-listed enterprises, powering vertical AI solutions across industries.
Unsiloed AI launch post[1]

Strategic advantages

Moat
  • Deterministic + confidence-scored. Matches the audit and governance posture regulated buyers actually require.
  • Benchmark leadership. #1 strict pass rate on olmOCR-Bench (88.0, May 2026) — 22 points clear of Reducto on the same deterministic scorer.[20]
  • Vision-first cost / throughput. ~$0.01/page vs. LLM-centric incumbents — unit economics compound with volume.[15]
  • Proprietary corpus + domain decoders. 1M+ real multimodal documents and finance / legal / healthcare ontologies create a data moat that compounds with every customer.
  • Air-gapped on-prem. The deployment posture that unlocks BFSI procurement — one that Reducto and LlamaParse don't lead with.

Founder deep dive

The canonical technical pair for vision-first document AI.

The shared foundation. Both Aman and Adnan are IIT Kharagpur alumni — one of the densest concentrations of systems and ML engineering talent in the world. They bring complementary skill sets to a problem that requires both extreme-performance systems engineering and applied multimodal ML research.

Aman — low-latency systems + AI copilots in regulated finance. Started building software systems for high-frequency trading at Teesta Investment after IIT Kharagpur — multi-threaded C++ and Rust optimized for ultra-low-latency execution moving billions on crypto exchanges. Then went founding engineer (#1) at a stealth SF startup building AI copilots for institutions like Goldman Sachs and Charles Schwab — exactly the regulated-finance design constraints Unsiloed now serves. He has lived inside the requirements: deterministic, auditable, integrated with legacy compliance flows.

Adnan — multimodal ML at Fortune 10 scale + autonomous perception. IIT Kharagpur → MIT Masters. Built multi-modal models deployed at a Fortune 10 company. Then was building autonomous navigation systems at Mercedes-Benz R&D — perception pipelines that must hold up under real-time, safety-critical constraints. This is the exact skill stack Unsiloed needs: vision-first models that are layout-aware, domain-tuned, and deployable in regulated environments.

The thesis they bring. Unsiloed's published positioning emphasizes that LLMs cannot deterministically parse multimodal documents — and that the right answer is specialized vision models combined with OCR-based models, dual-stream representation (data + layout), and domain-specific decoders trained with RL. This is not a wrapper company. It is a vision-model and infrastructure company, with founders whose careers were already pointed at this problem.[1]

Why now — and why them. Aman is publicly writing about vision models for enterprise documents (Forbes Business Council). Adnan is building the technical strategy for accuracy-sensitive deployments in finance, legal, and healthcare — including on-prem and air-gapped options that match the realities of BFSI procurement. Together they are the canonical pair for this exact category at this exact moment.

Founders

Aman Mishra

Aman Mishra

Co-founder & CEO

IIT Kharagpur (B.Tech, Industrial & Systems Engineering, CS minor). Previously built ultra low-latency C++/Rust trading systems moving billions at a hedge fund. Founding Engineer (#1) at an SF-based stealth AI copilot startup serving Goldman Sachs and Charles Schwab. Launched a P2P rental platform from his dorm room, scaling it to thousands of orders within 2 months. Forbes Business Council contributor on vision models for enterprise documents.

Adnan Abbas

Adnan Abbas

Co-founder & CTO

IIT Kharagpur (B.Tech) → MIT (Masters). Built multi-modal models deployed at a Fortune 10 company. Was building autonomous navigation systems at Mercedes-Benz R&D. Launched India's first Web 3.0 audio app while in college, scaling it to thousands of users within a month. Leads technical strategy for vision-first, layout-aware multimodal models — including on-prem / air-gapped enterprise options.

Risks & mitigations

Risk

Reducto's scale and GTM outspend in enterprise — $108M raised through its October 2025 Series B (a16z), ~1B pages/month, and marquee customers like Harvey, Rogo, and Scale AI.

Mitigation

Win bake-offs in finance and legal via deterministic accuracy, chart and table fidelity, and on-prem deployment — the May 2026 olmOCR-Bench gap (88.0 vs. 66.0) is the wedge. Leverage cost and throughput edge for high-volume deals — Unsiloed's ~$0.01/page economics undercut Reducto's ~2× per-page pricing.

Risk

Open-source commoditization from Unstructured.io and IBM Docling — DIY teams may settle for 'good enough' free tools.

Mitigation

Offer SLA'd, air-gapped enterprise deployments and maintain an accuracy lead on long-tail documents with proprietary data and domain ontologies (Unstructured scored 39.9 on the same olmOCR-Bench run). Chunkr's pivot away from OSS parsing suggests free tooling alone isn't holding the regulated segment. Reduce integration effort vs. DIY — Unsiloed ships hours-not-months to production.

Risk

Frontier VLMs are closing the gap — GPT-5.5 scored 84.6 and Claude Opus 4.7 scored 81.9 on olmOCR-Bench, within ~3–6 points of Unsiloed's 88.0. Datalab claims the benchmark is approaching saturation.

Mitigation

Keep the architecture model-agnostic and integrate emerging OCR-free vision-RAG and VLM components as they mature. The durable moat isn't the raw score — it's deterministic decoding, schema enforcement, per-field confidence, per-page cost at volume, and air-gapped deployment, none of which a frontier-model API call provides for regulated buyers.

Risk

Enterprise procurement friction — security reviews, compliance certifications, support SLAs, and global support coverage take months at Fortune 500 buyers.

Mitigation

Expand certifications (SOC2, ISO 27001, HIPAA), build 24/7 support tiers and field engineering, and showcase on-prem success and references in BFSI and legal. Use 15 active pilots + $500k LOI to seed reference accounts inside reluctant procurement orgs.

What we're watching

  • Conversion of the $500k LOI with a global bank into a production contract — and the speed at which it expands across business units.
  • Whether the 15 ongoing pilots (including Rippling and a large public tech co.) convert at 6- to 7-figure ACVs.
  • Vertical expansion beyond finance — early signal of healthcare or legal logos at parity accuracy.
  • Reducto's response: does it cut pricing, deepen on-prem, or push deeper into a specific vertical?
  • Third-party replication of the May 2026 olmOCR-Bench result — an independent run (Ai2, Datalab, or a customer bake-off) would convert a vendor benchmark into a category fact.
  • Whether frontier VLM gains (GPT-5.5 at 84.6 and climbing) compress the specialized-parser premium faster than Unsiloed converts determinism, cost, and on-prem into contracted revenue.

References

  1. [1]YC Launch — Unsiloed AI: Make Unstructured Data LLM-Ready
  2. [2]Data Dynamics — Unstructured Data: The Blind Spot CISOs and CIOs Must Solve
  3. [3]Fortune Business Insights — Intelligent Document Processing Market
  4. [4]MarketsandMarkets — Document AI Market
  5. [5]SiliconANGLE — Unstructured raises $40M to make raw data LLM-ready
  6. [6]PR Newswire — Reducto raises $75M Series B
  7. [7]Reducto — Compare: Reducto vs Google Document AI
  8. [8]Reducto — Compare: Reducto vs LlamaParse
  9. [9]Extend — Raises $17M to build the document processing cloud
  10. [10]Y Combinator — Extend company profile
  11. [11]LlamaIndex — LlamaCloud / LlamaParse
  12. [12]AIM Media House — LlamaIndex is building AI agents that understand your data
  13. [13]IBM — Docling's rise: the IBM toolkit turning unstructured documents into LLM-ready data
  14. [14]Microsoft Azure — Introduction to OCR-free Vision RAG using ColPali
  15. [15]Mindee — LLM vs. OCR API: Cost comparison for document processing in 2025
  16. [16]Google Cloud — Document AI overview
  17. [17]AWS — Textract product documentation
  18. [18]Microsoft — Azure AI Document Intelligence (Form Recognizer) overview
  19. [19]Anish Acharya (a16z) — Industries, Not Markets (X)
  20. [20]Unsiloed AI — Unsiloed Achieves #1 Rank on olmOCR-Bench (May 2026)
  21. [21]Ai2 — olmOCR 2: Unit test rewards for document OCR
  22. [22]Datalab — Saturating the olmOCR Benchmark
  23. [23]LlamaIndex — olmOCR-Bench Review: Insights and Pitfalls on an OCR Benchmark
  24. [24]Reducto — Reducto Announces $108M in Funding to Define the Future of AI Document Intelligence
  25. [25]The SaaS News — Unstructured Raises $40 Million in Series B ($65M total raised)
  26. [26]Tracxn — LandingAI company profile ($57M raised; Series B, Sep 2025)
  27. [27]PR Newswire — LlamaIndex Secures $19M Series A ($27.5M total raised)
  28. [28]StartupHub.ai — Tensorlake company profile ($8M raised)
  29. [29]Tracxn — Unsiloed AI company profile ($500K seed, Sep 2025)
  30. [30]Launch YC — Chunkr: Open Source Document Parsing You Can Own (since pivoted to floatingpoint)
  31. [31]CIO Dive — Most unstructured enterprise data is siloed, report finds (Box / IDC)
  32. [32]LlamaIndex — LlamaParse: AI document parsing (1B+ documents processed, 300k+ users)
  33. [33]Unsiloed AI — What's the Best PDF Parser for RAG Pipelines? (April 2026)