diff --git a/index.html b/index.html new file mode 100644 index 0000000..a24ec9b --- /dev/null +++ b/index.html @@ -0,0 +1,475 @@ + + + + +
+ + + + ++ + The recent years have witnessed great advances + in text-to-video generation. However, the video + evaluation metrics have lagged significantly behind, which fails to produce an accurate and + holistic measure of the generated videos' quality. The main barrier is the lack of high-quality + human rating data. In this paper, we release + VideoEval, the first large-scale multi-aspect + video evaluation dataset. VideoEval consists + of high-quality human-provided ratings for 5 + video evaluation aspects on the 37.6K videos + generated from 11 existing popular video generative models. We train MantisScore based + on VideoEval to enable automatic video quality assessment. Experiments show that the + Spearman correlation between MantisScore + and humans can reach 77.1 on VideoEval + test, beating the prior best metrics by about + 50 points. Further result on the held-out Eval020 Crafter, GenAI-Bench, and VBench, show that + MantisScore is highly generalizable and + still beating the prior best metrics by a remark023 able margin. We observe that using Mantis as + the based model consistently beats that using + Idefics2 and VideoLLaVA, and the regression026 based model can achieve better results than the + generative ones. Due to its high reliability, we + believe MantisScore can serve as a valuable + tool for accelerate video generation research. +
+
+ MantisScore is finetuned on VideoEval dataset's 37K training set taking + Mantis-8B-Idefics2 as base model. + We try generation scoring method and regression scoring method, the former one means model's answer is in a template + predefined for video quality evaluation while the latter one outputs 5 logits as evaluation scores in 5 dimensions. + + Besides, we also make ablation on base model, using Mantis-8B-Idefics2, Idefics2-8B and VideoLLaVA-7B + as base models to finetune. Mantis-8B-Idefics2 turns out to have the best performance on video quality evaluation. +
++ We test our video evaluator MantisScore on VideoEval-test set, + Here is the results of some feature-based metrics like PIQE, CLIP-sim, X-CILIP-Sore etc, + MLLM-prompting methods like GPT-4o Gemini-1.5-Pro, etc and our MantisScore. + As seen in the table below, MantisScore surpass the best baseline by 54.1 in average on 5 aspects. + +
++ We select 3 dimensions from EvalCrafter that match our evaluation aspects + and collect 2500+ videos for test. MantisScore surpass all the baselines in 3 apsects + and EvalCrafter(GPT-4V) in Text-to-Video Alignment. + +
++ GenAI-Bench is a multimodal benchmark for MLLM's capability on preference comparison + for tasks like text-to-video generation, image-editing and others, while + VBench is a comprehensive multi-aspect benchmark suite for + video generative models. For GenAI-Bench we collect 2100+ videos and + for VBench select a subset from 5 aspects of VBench, like technical + quality, subject consistency, and so on, then subsample 100 unique prompts (2000 videos totally) for testing. + We use the averaged scores of the five aspects for MLLM prompting baselines and our models to + give the preference and calculate the pairwise accuracy as performance indicator. + +
+