diff --git a/index.html b/index.html new file mode 100644 index 0000000..a24ec9b --- /dev/null +++ b/index.html @@ -0,0 +1,475 @@ + + + + + + + + + + MantisScore + + + + + + + + + + + + + + + + + + + + +
+
+
+
+
+

MantisScore: A Reliable Fine-grained Metric for Video Generation

+
+ + 1,2Xuan He*, + + + 1Dongfu Jiang*, + + + 1Ge Zhang, + + + 1Max Ku, + + + 1Achint Soni, + + + 1Sherman Siu, + + + 1Haonan Chen, + + + 1Abhranil Chandra, + + + + 1Ziyan Jiang, + + + 1Aaran Arulraj, + + + 3Kai Wang, + + + 1Quy Duc Do, + + + 1Yuansheng Ni, + + + 2Bohan Lyu, + + + 1Yaswanth Narsupalli, + + + 1Rongqi Fan, + + + 1Zhiheng Lyu, + + + 4Bill Yuchen Lin, + + + 1Wenhu Chen + + +
+ +
+

+ + + *Equal Contribution + +
+ +
+ + 1University of Waterloo, + 2Tsinghua University, + 3University of Toronto, + 4AI2 + +
+ + + + +
+ +
+ +
+ +
+
+
+
+
+ +
+
+ +
+
+

Abstract

+
+

+ + The recent years have witnessed great advances + in text-to-video generation. However, the video + evaluation metrics have lagged significantly behind, which fails to produce an accurate and + holistic measure of the generated videos' quality. The main barrier is the lack of high-quality + human rating data. In this paper, we release + VideoEval, the first large-scale multi-aspect + video evaluation dataset. VideoEval consists + of high-quality human-provided ratings for 5 + video evaluation aspects on the 37.6K videos + generated from 11 existing popular video generative models. We train MantisScore based + on VideoEval to enable automatic video quality assessment. Experiments show that the + Spearman correlation between MantisScore + and humans can reach 77.1 on VideoEval + test, beating the prior best metrics by about + 50 points. Further result on the held-out Eval020 Crafter, GenAI-Bench, and VBench, show that + MantisScore is highly generalizable and + still beating the prior best metrics by a remark023 able margin. We observe that using Mantis as + the based model consistently beats that using + Idefics2 and VideoLLaVA, and the regression026 based model can achieve better results than the + generative ones. Due to its high reliability, we + believe MantisScore can serve as a valuable + tool for accelerate video generation research. +

    + +
  1. VideoEval Dataset. We release the first large scale multi-dimension video evaluation dataset, + consisting of 37.6K text-to-video pairs with human annotated scores. +
  2. +
  3. MantisScore. We introduce MantisScore, and video quality evaluator trained on + VideoEval dataset with Mantis-8B-Idefics2 as base model. +
  4. +
  5. Evaluation. We test our model on VideoEval-test and 3 other Benchmarks: EvalCrafter, + GenAI-Bench and VBench, + compared with both feature-based metrics and MLLM prompting methods. +
  6. + +
+

+ +
+
+
+ +
+
+ + +
+ +
+
+

VideoEval Dataset

+
+
+
+ +
+
+
+ +

+

    + VideoEval contains a total of 37.6K text-to-video pairs from 11 popular video generative models, + with some real-world videos as data augmentation. + The videos are annotated by raters for five evaluation dimensions: Visual Quality, Temporal Consistency, Dynamic Degree, Text-to-Video Alignment and Factual Consistency, + in 1-4 scoring scale. Below we show two annotated video examples and the detailed description of our VideoEval dataset. + Please check out 🤗 Video-Eval on hugging face datasets for usage. +

    + +
    + +
    +
    +

    + +
    + +
    +
    +
+ + +
+ +
+
+

MantisScore

+
+
+
+ +
+
+
+

+ MantisScore is finetuned on VideoEval dataset's 37K training set taking + Mantis-8B-Idefics2 as base model. + We try generation scoring method and regression scoring method, the former one means model's answer is in a template + predefined for video quality evaluation while the latter one outputs 5 logits as evaluation scores in 5 dimensions. + + Besides, we also make ablation on base model, using Mantis-8B-Idefics2, Idefics2-8B and VideoLLaVA-7B + as base models to finetune. Mantis-8B-Idefics2 turns out to have the best performance on video quality evaluation. +

+
+
+
+ + +
+ + + + +
+ +
+
+

Evaluation Results

+
+
+
+ + +
+
+

VideoEval-test

+ +

+ We test our video evaluator MantisScore on VideoEval-test set, + Here is the results of some feature-based metrics like PIQE, CLIP-sim, X-CILIP-Sore etc, + MLLM-prompting methods like GPT-4o Gemini-1.5-Pro, etc and our MantisScore. + As seen in the table below, MantisScore surpass the best baseline by 54.1 in average on 5 aspects. + +

+ +
+ +
+
+ +
+
+ + +
+
+

EvalCrafter Benchmark

+ +

+ We select 3 dimensions from EvalCrafter that match our evaluation aspects + and collect 2500+ videos for test. MantisScore surpass all the baselines in 3 apsects + and EvalCrafter(GPT-4V) in Text-to-Video Alignment. + +

+ +
+ +
+
+ +
+
+ + +
+
+

GenAI-Bench and VBench

+ +

+ GenAI-Bench is a multimodal benchmark for MLLM's capability on preference comparison + for tasks like text-to-video generation, image-editing and others, while + VBench is a comprehensive multi-aspect benchmark suite for + video generative models. For GenAI-Bench we collect 2100+ videos and + for VBench select a subset from 5 aspects of VBench, like technical + quality, subject consistency, and so on, then subsample 100 unique prompts (2000 videos totally) for testing. + We use the averaged scores of the five aspects for MLLM prompting baselines and our models to + give the preference and calculate the pairwise accuracy as performance indicator. + +

+ +
+ +
+
+ +
+
+ +
+ + +
+
+

BibTeX

+ +
+ + + + +