StreamReady: Learning What to Answer and When in Long Streaming Videos

1 Center for Research in Computer Vision, University of Central Florida; 2 Microsoft Research.

CVPR 2026

Readiness-aware streaming video understanding. . Lefti>: In proactive streaming settings, questions can precede their supporting evidence, requiring the model to monitor the evolving video and answer once the relevant cues appear. Right: Under our readiness-aware formulation, effective accuracy jointly reflects answer correctness and timing via the Answer Readiness Score (ARS). Although all models achieve similar raw accuracy on this example, ARS reveals sharp performance drops for early (hallucinatory) or late (delayed) answers. In contrast, StreamReady responds within the evidence window, preserving high effective accuracy by answering at the appropriate moment.

Abstract

Streaming video understanding often involves time-sensitive scenarios where models need to answer exactly when the supporting visual evidence appears: answering before the evidence reflects speculation, answering after it has passed reduces real-time utility. To capture this behavior, we introduce a readiness-aware formulation of streaming video understanding with the Answer Readiness Score (ARS), a timing-aware objective with asymmetric early and late penalties. When combined with correctness, ARS defines an effective accuracy that measures not just whether a model is right, but whether it answers at the appropriate moment. Building on this formulation, we introduce StreamReady, a framework to unify temporal reasoning with on-time answering through a lightweight readiness mechanism that decides if sufficient evidence has been observed before responding. To evaluate this capability, we further introduce ProReady-QA, a benchmark with annotated answer evidence windows and proactive multi-turn questions across local and global contexts. StreamReady achieves superior performance on ProReady-QA, and consistently outperforms prior methods across eight additional streaming and offline long-video benchmarks, demonstrating robust and broadly generalizable video understanding capability

Architecture

Framework Overview. StreamReady encodes streaming videos into a visual memory tree and reasons through short and long-term branches. A learnable token, guided by a readiness head, gates the reasoning output until sufficient evidence is observed. Once ready, the long-term representation, enriched with contextual information from past QA pairs, is sent to the LLM for answering, enabling readiness-aware streaming behavior.

BibTeX

@inproceedings{azad2026streamready,
  title={Streamready: Learning what to answer and when in long streaming videos},
  author={Azad, Shehreen and Vineet, Vibhav and Rawat, Yogesh S},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={40494--40504},
  year={2026}
      }