Overshoot

Overshoot

AI infrastructure for real-time vision applications.

Overshoot — Vision in Real Time[13]

<200ms

Inference latency

~10× faster than existing platforms

1,000+

Developers

Engaged at W26 launch

$58B

Computer vision TAM

By 2030 · Grand View Research

Thesis

Real-time vision is a whole-stack problem, not a model toggle. Hyperscalers standardize low-latency multimodal model I/O[1][2] but stop at the API boundary — the remaining work of video transport, sampling, scheduling, and reliability for live feeds is its own infrastructure layer.[3][5] Overshoot is building the video-native stack underneath every live-vision agent, from WebRTC ingest up through inference scheduling and schema-checked outputs.[13]
  1. 01

    Real-time vision is a whole-stack constraint, not a model toggle. Latency budgets are fragile and cumulative across transport, sampling, inference throughput, and output length. Production systems must explicitly manage these budgets. Overshoot's SDK surfaces the levers — sampling modes, output token caps, latency callbacks — and ties them to perceived latency.[4] [5]

  2. 02

    VLM-first UX compresses iteration vs. bespoke CV. Traditional computer vision demanded labeling and training per use case. Vision-Language Models collapse most of that to a prompt with a schema. Overshoot's "prompt-as-program" interface aligns with how agentic, interactive applications are actually built today.[1] [2]

  3. 03

    Hyperscalers validate the category but stop at model I/O. Gemini Live and OpenAI Realtime ship the model interface; developers still need transport orchestration, sampling policy, model routing, and reliability/observability to hit SLAs. Overshoot packages those missing layers — specifically for vision.[1] [2] [3]

  4. 04

    The Hjouji brothers are the canonical team for this stack. Zakaria built GPU kernels at Meta AI and low-latency surge pricing systems at Uber. Younes was a founding engineer at Cosmonio (acquired by Intel) where he built a CV training and serving platform from scratch. Inference engines, low-latency systems, and applied computer vision — assembled in one cofounding pair.

Problem

AI can now see and understand the physical world. Building on top of it is still painful.

Vision-Language Models have unlocked real applications in physical security, safety, gaming, robotics, and consumer products. Soon, video agents will watch your home and your pet when you're away. The model side of that future is already here.

The infrastructure side is not. Developers building real-time vision applications face three compounding problems: slow inference, limited model availability, and infrastructure that breaks at scale.[13] Existing inference platforms were designed around text — they treat image and video as awkward attachments rather than first-class modalities, and they leave transport, sampling, and stream lifecycle entirely to the developer.

The result: every team building a video agent ends up re-inventing the same stack from WebRTC ingest down through sampling policy and output budgeting. The work isn't novel and it isn't differentiated. It's plumbing — and it's preventing the applications from ever shipping.

<200ms

Overshoot end-to-end

Live video → VLM response

10×

Latency improvement

Vs. existing inference platforms

3 lines

Of code

To connect live video to a VLM

Why Now

The model layer just got real. The infra layer hasn't caught up.

Three converging shifts make the whole-stack vision problem solvable for the first time — and create a discrete window before hyperscalers extend their APIs downward.

The base models work. The transport doesn't.

Hyperscalers standardized low-latency model I/O. Gemini Live[1] and OpenAI Realtime[2] shipped streaming multimodal interfaces in the last 12 months. The model boundary is now a solved problem with documented latency targets.

WebRTC is the de facto transport. WebRTC[4] and LiveKit[3] have hardened into the default real-time media stack. Battle-tested SFUs, agents frameworks,[14] and reconnect semantics now exist — but they're general-purpose, not vision-specific.

The layer in between is still missing. What hasn't been built is the piece that takes a live feed, samples it intelligently against a latency budget, routes it to the right VLM, enforces a schema on the output, and survives jitter. That's the gap Overshoot fills.[5] [13]

Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
Overshoot launch post[13]

How It Works

Three lines of code. Sub-200ms responses. Schema-checked output.

Step 01

Connect a live source

WebRTC, RTSP, cameras, screens, or a LiveKit room. Overshoot can provision rooms and tokens automatically or attach to your own LiveKit deployment.

Step 02

Choose a sampling mode

Clip mode samples short windows at a tunable targetFps with configurable clip length and delay. Frame mode samples at an interval_seconds cadence. Event-driven by design, not brute-force full-rate analysis.

Step 03

Prompt the VLM

{source, model, prompt, onResult} as primitives. Output token budgets trade response speed vs. verbosity. Schema enforcement guarantees structured downstream consumption.

Latency is a first-class API primitive.

Sampled inference by design. Most real-time vision workloads are event-driven, not continuous. Overshoot exposes targetFps, clip length, clip delay, and interval_seconds as explicit parameters — so developers trade thoroughness for latency budget at the API surface rather than discovering the limits in production.[13]

Model surface and routing. Overshoot hosts the largest collection of Vision-Language Models behind a single API, with a "gemini" passthrough backend when direct model access is preferred.[1] Schema enforcement supports structured outputs for downstream systems — no parsing, no half-formed JSON, no retries.[13]

Reliability primitives. Stream lifecycle and reconnect semantics, observability hooks, and latency-aware callbacks keep developers inside their latency budgets even on imperfect networks — the "last mile" work that vision teams otherwise own end-to-end.[5]

Zero infra headache. Developers connect live video feeds to VLMs with 3 lines of code and get responses in less than 200ms — roughly 10× faster than any existing inference platform. The interface itself is a tell: it exposes the real production constraints, not the demo path.[13]

Market

Enterprises already spend tens of billions turning video into operational signals.

Video analytics software is on track from ~$12.7B (2024) to ~$37.8B (2030).[6] Video surveillance — hardware, software, and services — moves from ~$73.8B to ~$147.7B over the same period.[7] Computer vision overall: ~$19.8B to ~$58.3B.[8]

These markets are already monetized. What's changing is how the applications get built. The previous generation required custom models and bespoke deployments per camera per use case. VLMs collapse that to a prompt — but only if the infrastructure underneath can handle live streams. Overshoot is the developer-infrastructure category that makes the next generation of these applications buildable in days instead of quarters.

Video analytics

$12.7B → $37.8B

2024 → 2030[6]

Video surveillance

$73.8B → $147.7B

2024 → 2030[7]

Computer vision

$19.8B → $58.3B

2024 → 2030[8]

The right pricing shape already exists.

Streaming workloads are event/sampling-driven, not continuous 24/7 — even when the camera is always on. AWS's Rekognition Streaming Video Events architecture and per-minute pricing[9] is the existence proof: revenue scales with minutes analyzed, not wall-clock stream time. Overshoot's event-driven sampling design lines up directly with that billing shape, which means margin discipline is built into the product, not bolted on.

Initial ICPs: physical security and monitoring, QA and inspection, robotics and tele-operations, and interactive consumer products. The common thread: latency is a hard requirement, cameras or WebRTC sources already exist, and the application logic is "show the VLM what's happening, structure the response, act on it."[1] [2]

Soon, video agents will watch your home and your pet when you're away. AI can see and understand the physical world. This unlocks new applications in physical security, safety, gaming, robotics and general consumer products.
Overshoot launch post[13]

Competitive landscape

Six categories of competition. Overshoot is the only one purpose-built for live VLM inference.

Each adjacent category solves a real problem — but none of them solves Overshoot's. Transport without inference, batch without live, prebuilt detectors without prompting, model APIs without lifecycle.

Roboflow

Build/train/deploy

Broad CV developer platform with labeling, training, and serverless video streaming inference. Strong upstream tooling and community — but optimized for the traditional CV build/train/deploy loop rather than live VLM prompting with schema-checked outputs.[10]

LiveKit

Transport layer

Battle-tested open-source WebRTC SFU and an Agents framework for multimodality. Composable runtime — but transport and runtime, not inference orchestration. Overshoot integrates LiveKit as one of its supported ingests rather than competing with it.[3]

Twelve Labs

Batch video

Video understanding over stored media: semantic search, summarization, and Q&A across video libraries. Strong enterprise integrations — but the async/batch shape is fundamentally different from sub-second live stream inference.[11]

Coactive AI

Visual data platform

Enterprise visual data platform for indexing, searching, and understanding large image and video corpora. Indexing and analytics over batch corpora rather than low-latency live streams.[12]

AWS Rekognition

Managed detectors

Prebuilt detectors with a streaming video events API and transparent per-minute pricing. Excellent scale and integration — but task-specific detection rather than general VLM prompting, and limited dev ergonomics for live agents.[9]

OpenAI Realtime / Gemini Live

Model boundary

State-of-the-art base models with low-latency streaming interfaces for audio, images, and text. They standardize the model boundary — but stop at model I/O. No stream lifecycle management, no sampling policy, no observability for video apps.[1]

Our moat is focus. Image and video are fundamentally different modalities from text. By focusing on them, we are able to make strong technical leaps from codec, streaming protocols to inference engines.
Overshoot launch post[13]

Traction

Developer pull at W26 launch.

1,000+ developers on the platform. 10× faster than anything else.

Company materials cite 1,000+ developers engaged with the platform at YC W26 launch.[13] Public docs lead with the "3 lines to live video → VLM" demo and LiveKit room ingest.[3] Responses arrive in under 200ms — roughly 10× faster than existing inference platforms for comparable workloads.

Early adoption is concentrated in the ICPs the product is built for: video agents in gaming, robotics, and physical security. The interface itself is a tell — sampling knobs, output token budgets, latency callbacks. This is what an API designed by people who have run video inference at scale looks like, not what a demo path looks like.[13]

Founder deep dive

The exact two backgrounds you would assemble to build this.

Zakaria — the inference and low-latency systems half. Zakaria left Morocco for the UK at 18 to study at the London School of Economics, where he graduated top of his class in the Department of Mathematics. He went on to MIT and is a co-author on academic work in social-bot detection and using neural networks to measure opinions in social networks. After academia, he spent seven years building low-latency, high-throughput systems at two of the hardest places to do that work: surge pricing algorithms at Uber and GPU kernel work at Meta AI. He has built and sold a software product before, won three prominent AI hackathons, and writes about indie product-making on Substack and Medium. The Chrome extension "Explain AI" is his.

Younes — the applied computer vision half. Younes earned a B.Sc. in Computer Science from Al Akhawayn University, graduating Summa Cum Laude and as an Honors Program Scholar. He was a founding engineer at Cosmonio, where he built a Computer Vision training and serving platform from scratch — the company was acquired by Intel, and Younes stayed on as a Computer Vision / AI Frameworks Engineer leading architectural transformations on scalable backend and cloud systems. He watched customers abandon traditional CV firsthand because it lacked the "general" intelligence that LLMs have today. That insight is the founding wedge for Overshoot.

Why this pairing matters. Building Overshoot requires depth across three areas that almost never coexist in one team: streaming media transport, low-latency inference engineering, and applied computer vision. The Hjouji brothers cover all three between them. Zakaria's surge work at Uber and GPU kernel work at Meta map directly to inference engines, sampling, and scheduling. Younes's Cosmonio/Intel work on a CV training and serving platform from scratch maps directly to the vision side. They have shipped large-scale systems and know where they break.

Family and trust. Zakaria and Younes are cousins from the same family in Morocco, sharing the same last name (El hjouji). That kind of pre-existing trust is a meaningful structural advantage in a domain where the technical surface is large and decisions need to be made fast.

Why now. Both founders' careers track the maturity curve of vision and inference. Younes watched traditional CV hit its ceiling at Cosmonio. Zakaria watched LLM inference economics get rewritten at Meta. The moment when VLMs got good enough to replace bespoke CV — and when WebRTC/LiveKit transport got mature enough to ride underneath — is the moment they started Overshoot.

Founders & team

Zakaria El hjouji

Zakaria El hjouji

Co-founder & CEO

Co-founder and CEO of Overshoot. Spent seven years building pricing algorithms at Uber (surge — low-latency, high-throughput) and writing GPU kernels at Meta AI. Published researcher in bot detection and influence ML methods. Built and sold a software product while in grad school at MIT, won three prominent AI hackathons, and writes about indie product work on Substack and Medium. Graduated top of his class at the London School of Economics before MIT. Likes to nerd out on anything inference and VLM / video understanding.

Younes El hjouji

Younes El hjouji

Co-founder & CTO

Co-founder and CTO of Overshoot. Previously a Computer Vision / AI Frameworks Engineer at Intel, and before that led architectural transformations and scalable backend/cloud work at COSMONiO — a computer vision platform acquired by Intel. Built a Computer Vision training and serving platform from scratch and watched customers abandon traditional CV because it lacked the general intelligence LLMs now have. B.Sc. in Computer Science from Al Akhawayn University, Summa Cum Laude. Active in developer tooling and open source (NPM Leaderboard among other projects).

Strategic advantages & gaps

Where the moat compounds — and where it has to keep being earned.

Advantages

Video-native architecture

Samples and clips are first-class objects, output tokens are explicitly budgeted, and transport is integrated for real-time SLAs.

Reliability & ergonomics

Stream lifecycle handling and latency-aware callbacks absorb the "last mile" work developers otherwise own end-to-end.

Model flexibility

Hosted VLMs plus Gemini-class passthrough preserve developer choice without losing the reliability and observability layer.

Gaps to earn

Continuous latency optimization

Keeping the 10× lead requires constant work on GPU packing, fairness, and head-of-line blocking. There is no resting state.

Hyperscaler overlap

As base APIs add video features, Overshoot has to stay clearly better on reliability, observability, and workflow shape — not just speed.

Enterprise deployment surfaces

Many security and inspection buyers require on-prem or edge. A hybrid story has to develop alongside the cloud-first SDK.

Risks & mitigations

Risk

Model commoditization compresses the latency advantage — VLMs themselves get faster on every API.

Mitigation

The differentiation isn't the model — it's the layer around it: transport, sampling, scheduling, and observability. Overshoot maintains a multi-model surface (hosted VLMs plus Gemini passthrough) so they benefit from base-model improvements rather than getting squeezed by them.

Risk

Hyperscaler encroachment — Gemini Live, OpenAI Realtime, or AWS roll out fully video-native inference APIs.

Mitigation

Win developer mindshare now while the missing layers are most visible. Offer reliability SLAs, better tooling, and multi-model portability that single-vendor APIs structurally can't. Hyperscaler announcements have so far validated the category at the model boundary while leaving the streaming/lifecycle layer underspecified.

Risk

Infrastructure margin pressure — GPU costs and egress bandwidth at scale.

Mitigation

Event-driven sampling, output token budgeting, and adaptive scheduling cap the cost per minute analyzed. AWS Rekognition's per-minute-analyzed model is the existence proof that this billing shape works at scale.

Risk

Enterprise deployments often require on-prem or edge — pure cloud APIs hit a ceiling.

Mitigation

Roadmap support for hybrid and on-prem agents; WebRTC/LiveKit patterns already accommodate edge co-processing. The cofounders' Intel/Meta background gives them depth on packaging for diverse runtime environments.

What we're watching

  • Conversion from the 1,000+ developer waitlist/early-access cohort into paid production workloads with measurable minutes-analyzed.
  • First marquee design wins outside the YC W26 cohort — particularly in physical security, robotics tele-ops, and interactive consumer products where latency is a hard requirement.
  • How the SDK surface evolves as Gemini Live and OpenAI Realtime expand video — does Overshoot stay clearly better on reliability and observability, or does the gap compress?
  • Pricing model maturity — moving from waitlist/usage credits to per-minute-analyzed SLA-backed pricing comparable to Rekognition Streaming Video Events.

References

  1. [1]Google — Gemini API: Live (Multimodal streaming)
  2. [2]OpenAI — Realtime API Guide
  3. [3]LiveKit — Docs (WebRTC real-time media platform)
  4. [4]WebRTC — Overview
  5. [5]Latent Space — The Realtime AI Playbook (latency budgets)
  6. [6]Grand View Research — Video Analytics Market
  7. [7]Grand View Research — Video Surveillance Market
  8. [8]Grand View Research — Computer Vision Market
  9. [9]AWS — Rekognition Streaming Video Events (architecture/pricing model)
  10. [10]Roboflow — Inference (deployment + streaming)
  11. [11]Twelve Labs — Product Overview
  12. [12]Coactive AI — Platform Overview
  13. [13]Overshoot — Website / Docs
  14. [14]LiveKit — Agents Framework