Logo HLV-1K

A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

Introduction

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Leaderboard

Accuracy scores on HLV-1K are presented on frame-level, within-event-level, cross-event-level and long-term-level.

# Model LLM
Params
Frames Date Accuracy Scores (%)
Frame-level Within-event-level Cross-event-level Long-term-level Overall
1 LLaVA-Video 72B 120 2025-01-03 84.41 78.43 80.10 75.65 78.93
2 LLaVA-OneVision 72B 120 2025-01-03 80.33 75.06 77.25 68.74 74.01
3 Qwen2-VL 72B 120 2025-01-03 61.44 66.83 66.96 67.17 65.78
4 Kangaroo 8B 120 2025-01-03 75.23 63.57 65.04 54.60 62.71
5 Gemini 1.5 Pro - 120 2025-01-03 60.39 64.46 63.08 62.37 62.41
6 LongVA 7B 120 2025-01-03 67.89 59.12 61.37 59.67 61.74
7 InternVL2.5 8B 120 2025-01-03 60.72 65.02 62.73 59.34 61.24
8 GPT-4o - 120 2025-01-03 53.88 59.08 56.64 54.37 55.48
9 Claude 3.5 Sonnet - 20 2025-01-03 26.21 23.98 27.73 28.89 27.24

Green date indicates the newly added/updated models          - indicates closed-source models

Benchmark

Data Examples

grade-lv

Benchmark construction and examples.

Benchmark Statistics

grade-lv

HLV-1K: (a) Video category distribution, (b) Video duration distribution, and (c) Duration distribution of time-specific query.

grade-lv

HLV-1K: Distribution of benchmark annotations.

Benchmark Comparison

data-composition

Experiment Results

Different Question Types

grade-lv

Evaluation results of four representative MLLMs.

Citation


    @article{zou2025hlv1k,
      title={HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding},
      author={Heqing Zou, Tianze Luo, Guiyang Xie, Victor (Xiao Jie) Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang and Huaijian Zhang },
      year={2024}
    }