Skip to content

Latest commit

 

History

History
322 lines (258 loc) · 42.2 KB

File metadata and controls

322 lines (258 loc) · 42.2 KB

Awesome-LLMs-for-Video-Understanding Awesome

Yunlong Tang1,*, Jing Bi1,*, Siting Xu2,*, Luchuan Song1, Susan Liang1 , Teng Wang2,3 , Daoan Zhang1 , Jie An1 , Jingyang Lin1 , Rongyi Zhu1 , Ali Vosoughi1 , Chao Huang1 , Zeliang Zhang1 , Pinxin Liu1 , Mingqian Feng1 , Feng Zheng2 , Jianguo Zhang2 , Ping Luo3 , Jiebo Luo1, Chenliang Xu1,†. (*Core Contributors, †Corresponding Authors)

1University of Rochester, 2Southern University of Science and Technology, 3The University of Hong Kong

image

📢 News

[07/23/2024]

📢 We've recently updated our survey: “Video Understanding with Large Language Models: A Survey”!

✨ This comprehensive survey covers video understanding techniques powered by large language models (Vid-LLMs), training strategies, relevant tasks, datasets, benchmarks, and evaluation methods, and discusses the applications of Vid-LLMs across various domains.

🚀 What's New in This Update:
✅ Updated to include around 100 additional Vid-LLMs and 15 new benchmarks as of June 2024.
✅ Introduced a novel taxonomy for Vid-LLMs based on video representation and LLM functionality.
✅ Added a Preliminary chapter, reclassifying video understanding tasks from the perspectives of granularity and language involvement, and enhanced the LLM Background section.
✅ Added a new Training Strategies chapter, removing adapters as a factor for model classification.
✅ All figures and tables have been redesigned.

Multiple minor updates will follow this major update. And the GitHub repository will be gradually updated soon. We welcome your reading and feedback ❤️

Table of Contents

Why we need Vid-LLMs?

image

😎 Vid-LLMs: Models

image

📑 Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🗒️ Taxonomy

🕹️ Video Analyzer × LLM

LLM as Summarizer

Title Model Date Code Venue
Seeing the Unseen: Visual Metaphor Captioning for Videos GIT-LLaVA 06/2024 code arXiv
Zero-shot long-form video understanding through screenplay MM-Screenplayer 06/2024 project page CVPR
MoReVQA exploring modular reasoning models for video question answering MoReVQA 04/2024 project page CVPR
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM IG-VLM 03/2024 code arXiv
Language repository for long video understanding LangRepo 03/2024 code arXiv
Understanding long videos in one multimodal language model pass MVU 03/2024 code arXiv
Video ReCap recursive captioning of hour-long videos Video ReCap 02/2024 code CVPR
A Simple LLM Framework for Long-Range Video Question-Answering LLoVi 12/2023 code arXiv
Grounding-prompter prompting LLM with multimodal information for temporal sentence grounding in long videos Grounding-prompter 12/2023 code arXiv
Learning object state changes in videos an open-world perspective VIDOSC 12/2023 code CVPR
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? AntGPT 07/2023 code ICLR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
VLog: Video as a Long DocumentStar VLog 04/2023 code -
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR

LLM as Manager

Title Model Date Code Venue
DrVideo: Document Retrieval Based Long Video Understanding DrVideo 06/2024 code arXiv
OmAgent a multi-modal agent framework for complex video understanding with task divide-and-conquer OmAgent 06/2024 code arXiv
Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA LVNet 06/2024 code arXiv
VideoTree adaptive tree-based video representation for LLM reasoning on long videos VideoTree 05/2024 code arXiv
Harnessing Large Language Models for Training-free Video Anomaly Detection LAVAD 04/2024 code CVPR
TraveLER a multi-LMM agent framework for video question-answering TraveLER 04/2024 code arXiv
GPTSee enhancing moment retrieval and highlight detection via description-based similarity features GPTSee 03/2024 code arXiv
Reframe anything LLM agent for open world video reframing RAVA 03/2024 code arXiv
SCHEMA state CHangEs MAtter for procedure planning in instructional videos SCHEMA 03/2024 code ICLR
TV-TREES multimodal entailment trees for neuro-symbolic video reasoning TV-TREES 02/2024 code arXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv
VideoAgent long-form video understanding with large language model as agent VideoAgent 03/2024 code arXiv
VURF a general-purpose reasoning and self-refinement framework for video understanding VURF 03/2024 code arXiv
Why not use your textbook knowledge-enhanced procedure planning of instructional videos KEPP 03/2024 code CVPR
DoraemonGPT toward understanding dynamic scenes with large language models DoraemonGPT 01/2024 code arXiv
LifelongMemory: Leveraging LLMs for Answering Queries in Long-form Egocentric Videos LifelongMemory 12/2023 code arXiv
Zero-Shot Video Question Answering with Procedural Programs ProViQ 12/2023 code arXiv
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn AssistGPT 06/2023 code arXiv
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
ViperGPT: Visual Inference via Python Execution for Reasoning ViperGPT 03/2023 code arXiv
Hawk: Learning to Understand Open-World Video Anomalies Hawk 05/2024 code arXiv

🤖 LLM-based Video Agents

Title Model Date Code Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Socratic Models 04/2022 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
VLog: Video as a Long DocumentStar VLog 04/2023 code -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStar MISAR 10/2023 project page ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Grounding-Prompter 12/2023 - arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation NaVid 02/2024 project page - RSS
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv

👾 Vid-LLM Pretraining

Title Model Date Code Venue
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 - arXiv

👀 Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title Model Date Code Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star Video-LLaMA 06/2023 code arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStar VALLEY 06/2023 code -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStar Video-ChatGPT 06/2023 code arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStar Macaw-LLM 06/2023 code arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning Star LLMVA-GEBC 06/2023 code CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star mPLUG-video 06/2023 code arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStar MovieChat 07/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStar LLaMA-VQA 10/2023 code EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStar Video-LLaVA 11/2023 code arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStar Chat-UniVi 11/2023 code arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStar LLaMA-VID 11/2023 code arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens VISTA-LLAMA 12/2023 - arXiv
Audio-Visual LLM for Video Understanding - 12/2023 - arXiv
AutoAD: Movie Description in Context AutoAD 06/2023 code CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description AutoAD II 10/2023 - ICCV
AutoAD III: The Prequel -- Back to the Pixels AutoAD III 04/2024 - CVPR
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStar FAVOR 10/2023 code arXiv
VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMsStar VideoLLaMA2 06/2024 code arXiv

Fine-tuning with Insertive Adapters

Title Model Date Code Venue
Otter: A Multi-Modal Model with In-Context Instruction TuningStar Otter 06/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStar VideoLLM 05/2023 code arXiv

Fine-tuning with Hybrid Adapters

Title Model Date Code Venue
VTimeLLM: Empower LLM to Grasp Video MomentsStar VTimeLLM 11/2023 code arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video 11/2023 - arXiv

🦾 Hybrid Methods

Title Model Date Code Venue
VideoChat: Chat-Centric Video UnderstandingStar VideoChat 05/2023 code demo arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStar TimeChat 12/2023 code CVPR
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStar Video-GroundingDINO 12/2023 code arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot Video4096 05/2023 EMNLP

🦾 Training-free Methods

Title Model Date Code Venue
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models SlowFast-LLaVA 07/2024 - arXiv

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name Paper Date Link Venue
Charades Hollywood in homes: Crowdsourcing data collection for activity understanding 2016 Link ECCV
YouTube8M YouTube-8M: A Large-Scale Video Classification Benchmark 2016 Link -
ActivityNet ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding 2015 Link CVPR
Kinetics-GEBC GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval 2022 Link ECCV
Kinetics-400 The Kinetics Human Action Video Dataset 2017 Link -
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS

Captioning and Description

Name Paper Date Link Venue
Microsoft Research Video Description Corpus (MSVD) Collecting Highly Parallel Data for Paraphrase Evaluation 2011 Link ACL
Microsoft Research Video-to-Text (MSR-VTT) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language 2016 Link CVPR
Tumblr GIF (TGIF) TGIF: A New Dataset and Benchmark on Animated GIF Description 2016 Link CVPR
Charades Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding 2016 Link ECCV
Charades-Ego Actor and Observer: Joint Modeling of First and Third-Person Videos 2018 Link CVPR
ActivityNet Captions Dense-Captioning Events in Videos 2017 Link ICCV
HowTo100m HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 Link ICCV
Movie Audio Descriptions (MAD) MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions 2021 Link CVPR
YouCook2 Towards Automatic Learning of Procedures from Web Instructional Videos 2017 Link AAAI
MovieNet MovieNet: A Holistic Dataset for Movie Understanding 2020 Link ECCV
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
Video Timeline Tags (ViTT) Multimodal Pretraining for Dense Video Captioning 2020 Link AACL-IJCNLP
TVSum TVSum: Summarizing web videos using titles 2015 Link CVPR
SumMe Creating Summaries from User Videos 2014 Link ECCV
VideoXum VideoXum: Cross-modal Visual and Textural Summarization of Videos 2023 Link IEEE Trans Multimedia
Multi-Source Video Captioning (MSVC) VideoLLaMA2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs 2024 Link arXiv

Grounding and Retrieval

Name Paper Date Link Venue
Epic-Kitchens-100 Rescaling Egocentric Vision 2021 Link IJCV
VCR (Visual Commonsense Reasoning) From Recognition to Cognition: Visual Commonsense Reasoning 2019 Link CVPR
Ego4D-MQ and Ego4D-NLQ Ego4D: Around the World in 3,000 Hours of Egocentric Video 2021 Link CVPR
Vid-STG Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences 2020 Link CVPR
Charades-STA TALL: Temporal Activity Localization via Language Query 2017 Link ICCV
DiDeMo Localizing Moments in Video with Natural Language 2017 Link ICCV

Question Answering

Name Paper Date Link Venue
MSVD-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
MSRVTT-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
TGIF-QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering 2017 Link CVPR
ActivityNet-QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering 2019 Link AAAI
Pororo-QA DeepStory: Video Story QA by Deep Embedded Memory Networks 2017 Link IJCAI
TVQA TVQA: Localized, Compositional Video Question Answering 2018 Link EMNLP
MAD-QA Encoding and Controlling Global Semantics for Long-form Video Question Answering 2024 Link EMNLP
Ego-QA Encoding and Controlling Global Semantics for Long-form Video Question Answering 2024 Link EMNLP

Video Instruction Tuning

Pretraining Dataset
Name Paper Date Link Venue
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS
VALOR-1M VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 2023 Link arXiv
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 2023 Link arXiv
VAST-27M VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 2023 Link NeurIPS
Fine-tuning Dataset
Name Paper Date Link Venue
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning 2023 Link arXiv
VideoInstruct100K Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 2023 Link arXiv
TimeIT TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding 2023 Link CVPR

Video-based Large Language Models Benchmark

Title Date Code Venue
LVBench: An Extreme Long Video Understanding Benchmark 06/2024 code -
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models 11/2023 code -
Perception Test: A Diagnostic Benchmark for Multimodal Video Models 05/2023 code NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star 07/2023 code -
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star 11/2023 code NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding 12/2023 code -
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark 12/2023 code -
TempCompass: Do Video LLMs Really Understand Videos? Star 03/2024 code ACL 2024
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Star 06/2024 code -
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models Star 06/2024 code -

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

🌟 Star History

Star History Chart

♥️ Contributors