Overshoot — Investment Memo · Orange Collective

Thesis

Real-time vision is a whole-stack problem, not a model toggle. Hyperscalers standardize low-latency multimodal model I/O^[1]^[2] but stop at the API boundary — the remaining work of video transport, sampling, scheduling, and reliability for live feeds is its own infrastructure layer.^[3]^[5] Overshoot is building the video-native stack underneath every live-vision agent, from WebRTC ingest up through inference scheduling and schema-checked outputs.^[13]

01
Real-time vision is a whole-stack constraint, not a model toggle. Latency budgets are fragile and cumulative across transport, sampling, inference throughput, and output length. Production systems must explicitly manage these budgets. Overshoot's SDK surfaces the levers — sampling modes, output token caps, latency callbacks — and ties them to perceived latency.^[4] ^[5]
02
VLM-first UX compresses iteration vs. bespoke CV. Traditional computer vision demanded labeling and training per use case. Vision-Language Models collapse most of that to a prompt with a schema. Overshoot's "prompt-as-program" interface aligns with how agentic, interactive applications are actually built today.^[1] ^[2]
03
Hyperscalers validate the category but stop at model I/O. Gemini Live and OpenAI Realtime ship the model interface; developers still need transport orchestration, sampling policy, model routing, and reliability/observability to hit SLAs. Overshoot packages those missing layers — specifically for vision.^[1] ^[2] ^[3]
04
The Hjouji brothers are the canonical team for this stack. Zakaria built GPU kernels at Meta AI and low-latency surge pricing systems at Uber. Younes was a founding engineer at Cosmonio (acquired by Intel) where he built a CV training and serving platform from scratch. Inference engines, low-latency systems, and applied computer vision — assembled in one cofounding pair.

Problem

AI can now see and understand the physical world. Building on top of it is still painful.

Vision-Language Models have unlocked real applications in physical security, safety, gaming, robotics, and consumer products. Soon, video agents will watch your home and your pet when you're away. The model side of that future is already here.

The infrastructure side is not. Developers building real-time vision applications face three compounding problems: slow inference, limited model availability, and infrastructure that breaks at scale.^[13] Existing inference platforms were designed around text — they treat image and video as awkward attachments rather than first-class modalities, and they leave transport, sampling, and stream lifecycle entirely to the developer.

The result: every team building a video agent ends up re-inventing the same stack from WebRTC ingest down through sampling policy and output budgeting. The work isn't novel and it isn't differentiated. It's plumbing — and it's preventing the applications from ever shipping.

<200ms

Overshoot end-to-end

Live video → VLM response

10×

Latency improvement

Vs. existing inference platforms

3 lines

Of code

To connect live video to a VLM

Why Now

The model layer just got real. The infra layer hasn't caught up.

Three converging shifts make the whole-stack vision problem solvable for the first time — and create a discrete window before hyperscalers extend their APIs downward.

The base models work. The transport doesn't.

Hyperscalers standardized low-latency model I/O. Gemini Live^[1] and OpenAI Realtime^[2] shipped streaming multimodal interfaces in the last 12 months. The model boundary is now a solved problem with documented latency targets.

WebRTC is the de facto transport. WebRTC^[4] and LiveKit^[3] have hardened into the default real-time media stack. Battle-tested SFUs, agents frameworks,^[14] and reconnect semantics now exist — but they're general-purpose, not vision-specific.

The layer in between is still missing. What hasn't been built is the piece that takes a live feed, samples it intelligently against a latency budget, routes it to the right VLM, enforces a schema on the output, and survives jitter. That's the gap Overshoot fills.^[5] ^[13]

The window is dated — and it's open now.

Both ends of the stack went GA without the middle. OpenAI took its Realtime API to general availability on August 28, 2025 with gpt-realtime — adding image input alongside audio, but still no managed video pipeline.^[16] Google made the Gemini Live API generally available on Vertex AI for continuous audio and video streams — again, at the model boundary only.^[25] On the transport side, LiveKit closed a $100M Series C at a $1B valuation in January 2026, led by Index Ventures — institutional confirmation that real-time AI media infrastructure is a venture-scale layer.^[24] The vision-specific orchestration layer between them remains unbuilt.

Inference economics turned in vision's favor. Equivalent-capability inference cost has fallen roughly 10× per year: GPT-3-class capability dropped from $60 per million tokens in late 2021 to $0.06 by late 2024 — 1,000× in three years — and GPT-4-class capability fell ~62× from its March 2023 launch price.^[17] Epoch AI measured declines up to 40× per year for GPT-4-level performance on some task classes.^[18] Continuous sampled video inference was cost-prohibitive at 2023 prices; at 2026 prices it's a line item. That's exactly when the layer that meters and budgets the spend becomes the natural buying point.

Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.

— Overshoot launch post^[13]

How It Works

Three lines of code. Sub-200ms responses. Schema-checked output.

Latency is a first-class API primitive.

Sampled inference by design. Most real-time vision workloads are event-driven, not continuous. Overshoot exposes targetFps, clip length, clip delay, and interval_seconds as explicit parameters — so developers trade thoroughness for latency budget at the API surface rather than discovering the limits in production.^[13]

Model surface and routing. Overshoot hosts the largest collection of Vision-Language Models behind a single API, with a "gemini" passthrough backend when direct model access is preferred.^[1] Schema enforcement supports structured outputs for downstream systems — no parsing, no half-formed JSON, no retries.^[13]

Reliability primitives. Stream lifecycle and reconnect semantics, observability hooks, and latency-aware callbacks keep developers inside their latency budgets even on imperfect networks — the "last mile" work that vision teams otherwise own end-to-end.^[5]

Zero infra headache. Developers connect live video feeds to VLMs with 3 lines of code and get responses in less than 200ms — roughly 10× faster than any existing inference platform. The interface itself is a tell: it exposes the real production constraints, not the demo path.^[13]

Market

Enterprises already spend tens of billions turning video into operational signals.

Video analytics software is on track from ~$12.7B (2024) to ~$37.8B (2030).^[6] Video surveillance — hardware, software, and services — moves from ~$73.8B to ~$147.7B over the same period.^[7] Computer vision overall: ~$19.8B to ~$58.3B.^[8]

These markets are already monetized. What's changing is how the applications get built. The previous generation required custom models and bespoke deployments per camera per use case. VLMs collapse that to a prompt — but only if the infrastructure underneath can handle live streams. Overshoot is the developer-infrastructure category that makes the next generation of these applications buildable in days instead of quarters.

The right pricing shape already exists.

Streaming workloads are event/sampling-driven, not continuous 24/7 — even when the camera is always on. AWS's Rekognition Streaming Video Events architecture and per-minute pricing^[9] is the existence proof: revenue scales with minutes analyzed, not wall-clock stream time. Overshoot's event-driven sampling design lines up directly with that billing shape, which means margin discipline is built into the product, not bolted on.

Initial ICPs: physical security and monitoring, QA and inspection, robotics and tele-operations, and interactive consumer products. The common thread: latency is a hard requirement, cameras or WebRTC sources already exist, and the application logic is "show the VLM what's happening, structure the response, act on it."^[1] ^[2]

Competitive landscape

Eight adjacent players. None purpose-built for live VLM inference.

Each adjacent category solves a real problem — but none of them solves Overshoot's. Transport without inference, batch without live, prebuilt detectors without prompting, models without a platform, model APIs without lifecycle.

Capital converged on every layer except live VLM inference

Chart

Most recent disclosed round per adjacent player, $M. Training-loop platforms (Roboflow^[19]), batch video understanding (Twelve Labs^[20], Coactive^[21]), edge models (Moondream^[22]), verticalized monitoring (Groundlight^[23]), and transport (LiveKit^[24]) have together raised ~$230M+ — none of it for a platform purpose-built for live VLM inference. The hyperscalers' own roadmaps (image input added, managed video pipeline still absent^[16]^[25]) suggest that gap is the market's, not the thesis's.

Source · Company funding announcements, 2023–2026

Our moat is focus. Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.

— Overshoot launch post^[13]

Traction

Developer pull at W26 launch.

1,000+ developers on the platform. 10× faster than anything else.

Overshoot came out of stealth with its YC launch on February 12, 2026, citing 1,000+ developers using the platform to ship video agents in gaming, robotics, and security.^[15] Public docs lead with the "3 lines to live video → VLM" demo and LiveKit room ingest.^[3] Responses arrive in under 200ms — roughly 10× faster than existing inference platforms for comparable workloads.^[13]

Early adoption is concentrated in the ICPs the product is built for: video agents in gaming, robotics, and physical security. The interface itself is a tell — sampling knobs, output token budgets, latency callbacks. This is what an API designed by people who have run video inference at scale looks like, not what a demo path looks like.^[13]

Founder deep dive

The exact two backgrounds you would assemble to build this.

Zakaria — the inference and low-latency systems half. Zakaria left Morocco for the UK at 18 to study at the London School of Economics, where he graduated top of his class in the Department of Mathematics. He went on to MIT and is a co-author on academic work in social-bot detection and using neural networks to measure opinions in social networks. After academia, he spent seven years building low-latency, high-throughput systems at two of the hardest places to do that work: surge pricing algorithms at Uber and GPU kernel work at Meta AI. He has built and sold a software product before, won three prominent AI hackathons, and writes about indie product-making on Substack and Medium. The Chrome extension "Explain AI" is his.

Younes — the applied computer vision half. Younes earned a B.Sc. in Computer Science from Al Akhawayn University, graduating Summa Cum Laude and as an Honors Program Scholar. He was a founding engineer at Cosmonio, where he built a Computer Vision training and serving platform from scratch — the company was acquired by Intel, and Younes stayed on as a Computer Vision / AI Frameworks Engineer leading architectural transformations on scalable backend and cloud systems. He watched customers abandon traditional CV firsthand because it lacked the "general" intelligence that LLMs have today. That insight is the founding wedge for Overshoot.

Why this pairing matters. Building Overshoot requires depth across three areas that almost never coexist in one team: streaming media transport, low-latency inference engineering, and applied computer vision. The Hjouji brothers cover all three between them. Zakaria's surge work at Uber and GPU kernel work at Meta map directly to inference engines, sampling, and scheduling. Younes's Cosmonio/Intel work on a CV training and serving platform from scratch maps directly to the vision side. They have shipped large-scale systems and know where they break.

Family and trust. Zakaria and Younes are cousins from the same family in Morocco, sharing the same last name (El hjouji). That kind of pre-existing trust is a meaningful structural advantage in a domain where the technical surface is large and decisions need to be made fast.

Why now. Both founders' careers track the maturity curve of vision and inference. Younes watched traditional CV hit its ceiling at Cosmonio. Zakaria watched LLM inference economics get rewritten at Meta. The moment when VLMs got good enough to replace bespoke CV — and when WebRTC/LiveKit transport got mature enough to ride underneath — is the moment they started Overshoot.

Founders & team

Zakaria El hjouji

Repeat Founder

Founder

Co-founder and CEO of Overshoot. Prior to this, Zakaria spent 7 years building pricing algorithms at Uber and writing GPU kernels at Meta AI. Won 3 prominent AI hackathons. Built and sold a software product while in grad school at MIT. He graduated top of his class at the London School of Economics. Likes to nerd out on anything inference and VLM / video understanding.

Younes El hjouji

Founder

Ex-Intel Computer Vision AI Frameworks Engineer. CTO of Overshoot. Building the platform for realtime vision powered by LLMs.

Zakaria El hjouji

Co-founder & CEO

Co-founder and CEO of Overshoot. Spent seven years building pricing algorithms at Uber (surge — low-latency, high-throughput) and writing GPU kernels at Meta AI. Published researcher in bot detection and influence ML methods. Built and sold a software product while in grad school at MIT, won three prominent AI hackathons, and writes about indie product work on Substack and Medium. Graduated top of his class at the London School of Economics before MIT. Likes to nerd out on anything inference and VLM / video understanding.

Younes El hjouji

Co-founder & CTO

Co-founder and CTO of Overshoot. Previously a Computer Vision / AI Frameworks Engineer at Intel, and before that led architectural transformations and scalable backend/cloud work at COSMONiO — a computer vision platform acquired by Intel. Built a Computer Vision training and serving platform from scratch and watched customers abandon traditional CV because it lacked the general intelligence LLMs now have. B.Sc. in Computer Science from Al Akhawayn University, Summa Cum Laude. Active in developer tooling and open source (NPM Leaderboard among other projects).

Strategic advantages & gaps

Where the moat compounds — and where it has to keep being earned.

Advantages

Video-native architecture

Samples and clips are first-class objects, output tokens are explicitly budgeted, and transport is integrated for real-time SLAs.

Reliability & ergonomics

Stream lifecycle handling and latency-aware callbacks absorb the "last mile" work developers otherwise own end-to-end.

Model flexibility

Hosted VLMs plus Gemini-class passthrough preserve developer choice without losing the reliability and observability layer.

Gaps to earn

Continuous latency optimization

Keeping the 10× lead requires constant work on GPU packing, fairness, and head-of-line blocking. There is no resting state.

Hyperscaler overlap

As base APIs add video features, Overshoot has to stay clearly better on reliability, observability, and workflow shape — not just speed.

Enterprise deployment surfaces

Many security and inspection buyers require on-prem or edge. A hybrid story has to develop alongside the cloud-first SDK.

Risks & mitigations

What we're watching

Conversion from the 1,000+ developer waitlist/early-access cohort into paid production workloads with measurable minutes-analyzed.
First marquee design wins outside the YC W26 cohort — particularly in physical security, robotics tele-ops, and interactive consumer products where latency is a hard requirement.
How the SDK surface evolves as Gemini Live and OpenAI Realtime expand video — does Overshoot stay clearly better on reliability and observability, or does the gap compress?
Pricing model maturity — moving from waitlist/usage credits to per-minute-analyzed SLA-backed pricing comparable to Rekognition Streaming Video Events.

References