HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

1 Center for Research in Computer Vision, University of Central Florida; 2 Microsoft Research.

CVPR 2025

HierarQ is a task-aware Hierarchical Q-Former based framework that processes videos auto-regressively without frame sampling. This setup preserves efficient processing while allowing HierarQ to maintain a task-focused, human-like cognitive approach, dynamically emphasizing relevant details based on task requirements.

Effectiveness of HierarQ in capturing task-relevant information. HierarQ adaptively focuses on task-relevant video segments, achieving a task-aware, comprehensive understanding. Here, color-coded frames are shown to demonstrate how entity-focused information complements the broader prompt-relevant context, enhancing overall video relevance and understanding.

Abstract

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lack task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM's context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierarchical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ's state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.

Architecture

(Left) Overview of our framework that sequentially processes video frames, modulating task-relevant entity and scene features with a two-stream feature modulator. The proposed HierarQ (Hierarchical Q-Former) with dedicated memory banks integrates these features, producing a refined understanding that is passed to an LLM for the final response. (Right) HierarQ models the hierarchical relationship between then Entity-level Q-Former and Scene-level Q-Former, using dedicated memory banks to integrate short-term details with long-term context for enhanced video understanding.

Quantitative Results

Medium to long video understanding performance on LVU (Left) and Breakfast, COIN (Right).

Long video question answering on MovieChat-1K (Left). Short video question answering performance on MSRVTT-QA (denoted by MSR-QA), MSVD-QA and ActivityNet-QA (denoted by ANet-QA) (Right).

Qualitative Results

Qualitative analysis of long-video question answering on MovieChat-1k. Here, HierarQ adaptively focuses on task-relevant video segments, achieving a task-aware, comprehensive understanding. Color-coded frames are shown to demonstrate how entity-focused information complements the broader prompt-relevant context, enhancing overall video relevance and understanding.

BibTeX

@article{azad2025hierarq,
        title={HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding},
        author={Azad, Shehreen and Vineet, Vibhav and Rawat, Yogesh Singh},
        journal={arXiv preprint arXiv:2503.08585},
        year={2025}
      }