Skip to content

The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Notifications You must be signed in to change notification settings

OpenGVLab/PhyGenBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhyGenBench

Quick Start | HomePage | arXiv | Citation

This repository is the official implementation of PhyGenBench.

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng*, Jiaqi Liao*, Xinyu Tan, Wenqi Shao#, Quanfeng Lu, Kaipeng Zhang, Cheng Yu, Dianqi Li, Yu Qiao, Ping Luo#
* MFQ and LJQ contribute equally.
# SWQ ([email protected]) and LP are correponding authors.

💡 News

  • 2024/10/10: We release our paper at https://arxiv.org/abs/2410.05363

  • 2024/10/07: We release our homepage, where have more video examples for demonstration

  • 2024/10/07: We have released the codes and data.

🎩Introduction

We introduce PhyGenBench, a comprehensive Physics Generation Benchmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic physical phenomenons). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications.

overview

📖PhyGenEval

overview

we design a progressive strategy that starts with key physical phenomena, then moves through the sequence of several key phenomena, and finally evaluates the overall naturalness of the entire video. This hierarchical and refined approach reduces the difficulty compared to existing methods that directly uses VLMs to evaluate physical commonsense, enabling PhyGenEval to achieve results closely aligned with human judgements.

🏆 Leaderboard

Model Size Mechanics(↑) Optics(↑) Thermal(↑) Material(↑) Average(↑) Human(↑)
CogVideoX 2B 0.38 0.43 0.34 0.39 0.37 0.31
CogVideoX 5B 0.43 0.55 0.40 0.42 0.45 0.37
Open-Sora V1.2 1.1B 0.43 0.50 0.44 0.37 0.44 0.35
Lavie 860M 0.40 0.44 0.38 0.32 0.36 0.30
Vchitect 2.0 2B 0.41 0.56 0.44 0.37 0.45 0.36
Pika - 0.35 0.56 0.43 0.39 0.44 0.36
Gen-3 - 0.45 0.57 0.49 0.51 0.51 0.48
Kling - 0.45 0.58 0.50 0.40 0.49 0.44

🚀 Quick Start

File Structure

  • PhyGenBench, which includes the test prompt: prompts.json, which explicit_captions.json is to evaluate rewriting prompt in Discussion (Appendix. D)
  • PhyGenBench also includes the example questions we showcase in three different stages in PhyGenEval: single_question.json, multi_question.json, and video_question.json
  • PhyGenEval includes codes of the semantic evaluation method, as well as the codes of the three-stage physical commonsense evaluation methods: single, multi, video
  • result contains the evaluation results of Kling on PhyGenBench: kelingall.json
  • PhyVideos contains the videos to be tested. Please generate the videos according to prompts.json and place them here.
    • For example, for Kling, name the files as output_video_{index+1}.mp4, where index corresponds to the prompt number in prompts.json

Environment

git clone https://github.com/OpenGVLab/PhyGenBench
cd PhyGenBench
  • For our evaluation process, whether using GPT-4o or open-source models, both start at the First stage: Key Physical Phenomena Detection and use VQAScore.

    You can refer to the official repo for details.

  • Second Stage: Physics Order Verification: For our closed-source model, we use GPT-4o (only requires API configuration), and for the open-source model, we use LLava-Interleave-dpo-7B.

    Please refer to the official repo for details.

  • Third stage: Overall Naturalness Evaluation: For the closed-source model, we use GPT-4o (only requires API configuration), and for the open-source model, we use InternVideo2, which is the same model used by ChronoMagic-Bench.

    We use the same environment with ChronoMagic-Bench. The model ckp is at huggingface.

If you only want to use the closed-source model for testing, you only need to configure the VQAScore environment. If you want to perform an ensemble of both closed-source and open-source models, you need to configure VQAScore, LLava-Interleave, and InternVideo2 environments, and download the models.

Question Generation

First, we generate corresponding questions for Key Physical Phenomena Detection, Physics Order Verification, and Overall Naturalness Evaluation. To simplify the expression, we refer to them as the single stage, multi stage, and video stage based on the VLM used.

# single
python PhyGenEval/single/generate_question.py

# multi
python PhyGenEval/multi/generate_question.py

# video
python PhyGenEval/video/generate_question.py

PhyGenBench/single_question.json, PhyGenBench/multi_question.json, and PhyGenBench/video_question.json are questions we generated at different stages.

Three-tier Evaluation

Our evaluations all use only one A100-80G. When using it, we have marked the python files that need to be run. Please write the appropriate script file according to your system (slurm or ...)

Key Physical Phenomena Detection:

python PhyGenEval/single/vqascore.py

Physics Order Verification:

# the environment of vqascore make collide with environment with llava-interleave,
# so we first retrieval the keyframe and then do the multi-image qa

# first do the retrieval and denote the retrieval score
# the environment is same with vqascore

python PhyGenEval/multi/multiimage_clip.py

# then do the multi-image qa
# for gpt-4o
python PhyGenEval/multi/GPT4o.py

# for llava
cd PhyGenEval/multi/LLaVA-NeXT-interleave_inference
python llava/eval/model_vqa_multi.py

Overall Naturalness Evaluation

# for gpt4o
python PhyGenEval/video/GPT4o.py

# for internvideo2
cd PhyGenEval/video/MTScore
python InternVideo_physical.py

Overall Score Calculation

python PhyGenEval/overall.py

🎬Qualitative Analysis

overview

📒Note

  • The version of GPT-4o is gpt4o-0513
  • If using the GPT-4o for testing, you only need to configure VQAScore and prepare the API key. This method also provides results that are highly consistent with human feedback.

📧 Contact

If you have any questions, feel free to contact Fanqing Meng with [email protected]

About

The code and data of Paper: Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published