
<200ms
Inference latency
~10× faster than existing platforms
1,000+
Developers
At Feb 2026 YC launch
$58B
Computer vision TAM
By 2030 · Grand View Research
Thesis
- 01
Real-time vision is a whole-stack constraint, not a model toggle. Latency budgets are fragile and cumulative across transport, sampling, inference throughput, and output length. Production systems must explicitly manage these budgets. Overshoot's SDK surfaces the levers — sampling modes, output token caps, latency callbacks — and ties them to perceived latency.[4] [5]
- 02
VLM-first UX compresses iteration vs. bespoke CV. Traditional computer vision demanded labeling and training per use case. Vision-Language Models collapse most of that to a prompt with a schema. Overshoot's "prompt-as-program" interface aligns with how agentic, interactive applications are actually built today.[1] [2]
- 03
Hyperscalers validate the category but stop at model I/O. Gemini Live and OpenAI Realtime ship the model interface; developers still need transport orchestration, sampling policy, model routing, and reliability/observability to hit SLAs. Overshoot packages those missing layers — specifically for vision.[1] [2] [3]
- 04
The Hjouji brothers are the canonical team for this stack. Zakaria built GPU kernels at Meta AI and low-latency surge pricing systems at Uber. Younes was a founding engineer at Cosmonio (acquired by Intel) where he built a CV training and serving platform from scratch. Inference engines, low-latency systems, and applied computer vision — assembled in one cofounding pair.
Problem
AI can now see and understand the physical world. Building on top of it is still painful.
Vision-Language Models have unlocked real applications in physical security, safety, gaming, robotics, and consumer products. Soon, video agents will watch your home and your pet when you're away. The model side of that future is already here.
The infrastructure side is not. Developers building real-time vision applications face three compounding problems: slow inference, limited model availability, and infrastructure that breaks at scale.[13] Existing inference platforms were designed around text — they treat image and video as awkward attachments rather than first-class modalities, and they leave transport, sampling, and stream lifecycle entirely to the developer.
The result: every team building a video agent ends up re-inventing the same stack from WebRTC ingest down through sampling policy and output budgeting. The work isn't novel and it isn't differentiated. It's plumbing — and it's preventing the applications from ever shipping.
<200ms
Overshoot end-to-end
Live video → VLM response
10×
Latency improvement
Vs. existing inference platforms
3 lines
Of code
To connect live video to a VLM
Why Now
The model layer just got real. The infra layer hasn't caught up.
Three converging shifts make the whole-stack vision problem solvable for the first time — and create a discrete window before hyperscalers extend their APIs downward.
The base models work. The transport doesn't.
Hyperscalers standardized low-latency model I/O. Gemini Live[1] and OpenAI Realtime[2] shipped streaming multimodal interfaces in the last 12 months. The model boundary is now a solved problem with documented latency targets.
WebRTC is the de facto transport. WebRTC[4] and LiveKit[3] have hardened into the default real-time media stack. Battle-tested SFUs, agents frameworks,[14] and reconnect semantics now exist — but they're general-purpose, not vision-specific.
The layer in between is still missing. What hasn't been built is the piece that takes a live feed, samples it intelligently against a latency budget, routes it to the right VLM, enforces a schema on the output, and survives jitter. That's the gap Overshoot fills.[5] [13]
The window is dated — and it's open now.
Both ends of the stack went GA without the middle. OpenAI took its Realtime API to general availability on August 28, 2025 with gpt-realtime — adding image input alongside audio, but still no managed video pipeline.[16] Google made the Gemini Live API generally available on Vertex AI for continuous audio and video streams — again, at the model boundary only.[25] On the transport side, LiveKit closed a $100M Series C at a $1B valuation in January 2026, led by Index Ventures — institutional confirmation that real-time AI media infrastructure is a venture-scale layer.[24] The vision-specific orchestration layer between them remains unbuilt.
Inference economics turned in vision's favor. Equivalent-capability inference cost has fallen roughly 10× per year: GPT-3-class capability dropped from $60 per million tokens in late 2021 to $0.06 by late 2024 — 1,000× in three years — and GPT-4-class capability fell ~62× from its March 2023 launch price.[17] Epoch AI measured declines up to 40× per year for GPT-4-level performance on some task classes.[18] Continuous sampled video inference was cost-prohibitive at 2023 prices; at 2026 prices it's a line item. That's exactly when the layer that meters and budgets the spend becomes the natural buying point.
Cost of equivalent-capability inference is collapsing
Chart
$ per million tokens at fixed capability tiers, log scale. GPT-3-class: $60 (Nov 2021) → $0.06 (late 2024). GPT-4-class: $30 at launch (Mar 2023) → ~$0.50 (late 2024, ~62× cheaper). Sampled VLM calls on live video inherit the same curve.[17][18]
Source · a16z LLMflation analysis · Epoch AI inference price trends
Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
How It Works
Three lines of code. Sub-200ms responses. Schema-checked output.
Latency is a first-class API primitive.
Sampled inference by design. Most real-time vision workloads are event-driven, not continuous. Overshoot exposes targetFps, clip length, clip delay, and interval_seconds as explicit parameters — so developers trade thoroughness for latency budget at the API surface rather than discovering the limits in production.[13]
Model surface and routing. Overshoot hosts the largest collection of Vision-Language Models behind a single API, with a "gemini" passthrough backend when direct model access is preferred.[1] Schema enforcement supports structured outputs for downstream systems — no parsing, no half-formed JSON, no retries.[13]
Reliability primitives. Stream lifecycle and reconnect semantics, observability hooks, and latency-aware callbacks keep developers inside their latency budgets even on imperfect networks — the "last mile" work that vision teams otherwise own end-to-end.[5]
Zero infra headache. Developers connect live video feeds to VLMs with 3 lines of code and get responses in less than 200ms — roughly 10× faster than any existing inference platform. The interface itself is a tell: it exposes the real production constraints, not the demo path.[13]
Market
Enterprises already spend tens of billions turning video into operational signals.
Video analytics software is on track from ~$12.7B (2024) to ~$37.8B (2030).[6] Video surveillance — hardware, software, and services — moves from ~$73.8B to ~$147.7B over the same period.[7] Computer vision overall: ~$19.8B to ~$58.3B.[8]
These markets are already monetized. What's changing is how the applications get built. The previous generation required custom models and bespoke deployments per camera per use case. VLMs collapse that to a prompt — but only if the infrastructure underneath can handle live streams. Overshoot is the developer-infrastructure category that makes the next generation of these applications buildable in days instead of quarters.
Vision software markets roughly triple by 2030
Chart
Market size in $B, 2024 actual vs. 2030 projection. Video analytics: $12.7B → $37.8B.[6] Computer vision: $19.8B → $58.3B.[8] Video surveillance (hardware + software + services): $73.8B → $147.7B.[7] All three were built on the previous, bespoke-model generation of CV.
Source · Grand View Research market reports (2024)
The right pricing shape already exists.
Streaming workloads are event/sampling-driven, not continuous 24/7 — even when the camera is always on. AWS's Rekognition Streaming Video Events architecture and per-minute pricing[9] is the existence proof: revenue scales with minutes analyzed, not wall-clock stream time. Overshoot's event-driven sampling design lines up directly with that billing shape, which means margin discipline is built into the product, not bolted on.
Initial ICPs: physical security and monitoring, QA and inspection, robotics and tele-operations, and interactive consumer products. The common thread: latency is a hard requirement, cameras or WebRTC sources already exist, and the application logic is "show the VLM what's happening, structure the response, act on it."[1] [2]
Soon, video agents will watch your home and your pet when you're away. AI can see and understand the physical world. This unlocks new applications in physical security, safety, gaming, robotics and general consumer products.
Competitive landscape
Eight adjacent players. None purpose-built for live VLM inference.
Each adjacent category solves a real problem — but none of them solves Overshoot's. Transport without inference, batch without live, prebuilt detectors without prompting, models without a platform, model APIs without lifecycle.
Capital converged on every layer except live VLM inference
Chart
Most recent disclosed round per adjacent player, $M. Training-loop platforms (Roboflow[19]), batch video understanding (Twelve Labs[20], Coactive[21]), edge models (Moondream[22]), verticalized monitoring (Groundlight[23]), and transport (LiveKit[24]) have together raised ~$230M+ — none of it for a platform purpose-built for live VLM inference. The hyperscalers' own roadmaps (image input added, managed video pipeline still absent[16][25]) suggest that gap is the market's, not the thesis's.
Source · Company funding announcements, 2023–2026
Our moat is focus. Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
Traction
Developer pull at W26 launch.
1,000+ developers on the platform. 10× faster than anything else.
Overshoot came out of stealth with its YC launch on February 12, 2026, citing 1,000+ developers using the platform to ship video agents in gaming, robotics, and security.[15] Public docs lead with the "3 lines to live video → VLM" demo and LiveKit room ingest.[3] Responses arrive in under 200ms — roughly 10× faster than existing inference platforms for comparable workloads.[13]
Early adoption is concentrated in the ICPs the product is built for: video agents in gaming, robotics, and physical security. The interface itself is a tell — sampling knobs, output token budgets, latency callbacks. This is what an API designed by people who have run video inference at scale looks like, not what a demo path looks like.[13]
Founder deep dive
The exact two backgrounds you would assemble to build this.
Founders & team
Strategic advantages & gaps
Where the moat compounds — and where it has to keep being earned.
Advantages
Video-native architecture
Samples and clips are first-class objects, output tokens are explicitly budgeted, and transport is integrated for real-time SLAs.
Reliability & ergonomics
Stream lifecycle handling and latency-aware callbacks absorb the "last mile" work developers otherwise own end-to-end.
Model flexibility
Hosted VLMs plus Gemini-class passthrough preserve developer choice without losing the reliability and observability layer.
Gaps to earn
Continuous latency optimization
Keeping the 10× lead requires constant work on GPU packing, fairness, and head-of-line blocking. There is no resting state.
Hyperscaler overlap
As base APIs add video features, Overshoot has to stay clearly better on reliability, observability, and workflow shape — not just speed.
Enterprise deployment surfaces
Many security and inspection buyers require on-prem or edge. A hybrid story has to develop alongside the cloud-first SDK.
Risks & mitigations
What we're watching
References
- [1]Google — Gemini API: Live (Multimodal streaming)
- [2]OpenAI — Realtime API Guide
- [3]LiveKit — Docs (WebRTC real-time media platform)
- [4]WebRTC — Overview
- [5]Latent Space — The Realtime AI Playbook (latency budgets)
- [6]Grand View Research — Video Analytics Market
- [7]Grand View Research — Video Surveillance Market
- [8]Grand View Research — Computer Vision Market
- [9]AWS — Rekognition Streaming Video Events (architecture/pricing model)
- [10]Roboflow — Inference (deployment + streaming)
- [11]Twelve Labs — Product Overview
- [12]Coactive AI — Platform Overview
- [13]Overshoot — Website / Docs
- [14]LiveKit — Agents Framework
- [15]Y Combinator — Launch: Overshoot, AI Vision in Real-time (Feb 2026)
- [16]OpenAI — Introducing gpt-realtime and Realtime API GA (Aug 2025)
- [17]a16z — LLMflation: LLM inference cost is going down fast
- [18]Epoch AI — LLM inference prices have fallen rapidly but unequally across tasks
- [19]Roboflow — $40M Series B to Invest in Enterprise and Open Source Vision AI (Nov 2024)
- [20]Twelve Labs — $50M Series A co-led by NEA and NVIDIA's NVentures (Jun 2024)
- [21]Coactive AI — Series B Funding Round ($30M, May 2024)
- [22]FinSMEs — Moondream Raises $4.5M in Pre-Seed Funding (Oct 2024)
- [23]The Robot Report — Groundlight raises $10M for natural-language-powered computer vision (Apr 2023)
- [24]LiveKit — Series C: $100M at a $1B valuation (Jan 2026)
- [25]Google Cloud — Gemini Live API generally available on Vertex AI



