Logo LVBench

An Extreme Long Video Understanding Benchmark

Introduction

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos, including TV series, sports broadcasts, and everyday surveillance footage, and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. By leveraging a combination of manual annotations and model-assisted techniques, we have created a robust video understanding question-answer dataset. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations of various baseline models reveal that current multimodal large language models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension.

Leaderboard

Accuracy scores on LVBench.

# Model Throughput LLM
Params
Date Overall (%) ER (%) EU (%) KIR (%) TG (%) Rea (%) Sum (%)
1 Gemini 1.5 Pro 3600 - 2024-06-11 33.1 32.1 30.9 39.3 31.8 27 32.8
2 LLaVA-NeXT-Video-DPO (34B) 32 34B 2024-06-11 32.2 30.1 31.2 34.1 31.4 35 27.6
3 GPT-4o 10 - 2024-06-11 27 26.5 23.7 28.3 21.4 28 32.8
4 PLLaVA 34B 16 34B 2024-06-11 26.1 25.0 24.9 26.2 21.4 30.0 25.9
5 LWM >3600 7B 2024-06-11 25.5 24.7 24.8 26.5 28.6 30.5 22.4
6 LLaMA-VID >10800 13B 2024-06-11 23.9 25.4 21.7 23.4 26.4 26.5 17.2
7 MovieChat >10000 7B 2024-06-11 22.5 21.3 23.1 25.9 22.3 24.0 17.2
8 TimeChat >96 7B 2024-06-11 22.3 21.9 21.7 25.9 22.7 25.0 24.1

LVBench

Example

Statistics

data-composition

(Left) Video categories. Our dataset contains 6 major categories and 21 subcategories.
(Right) Performance radar chart of different models on LVBench.

Benchmark Comparison

data-composition

Comparison of different datasets. Open-domain represents whether the source of the video is diversified. Multi-type represents whether the types of questions are greater than 2 categories.

Experimental Results

Answer Distribution

grade-lv

Distribution of answers generated by different models.

Model vs Human

LVBench evaluation results across different video categories.

Citation

@misc{wang2024lvbench,
      title={LVBench: An Extreme Long Video Understanding Benchmark},
      author={Weihan Wang and Zehai He and Wenyi Hong and Yean Cheng and Xiaohan Zhang and Ji Qi and Shiyu Huang and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2024},
      eprint={2406.08035},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}