Skip to content

v1.0

Compare
Choose a tag to compare
@innat innat released this 30 Sep 08:40
· 69 commits to main since this release

This is a keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The pre trained and fine tuned weights are ported from official pytorch model. Following are the list of all available model in .h5 format. It includes both pre-trained and fine-tuned models.

Naming style for these model is: TFVideoMAE_{size}_{dataset}_{input_frame}x{input_size}_FT/PT. Here, size represent base, small, large and huge for the available models. The PT or pre-trained is the video masked autoencoder model, trained with self-supervised manner and FT or fine-tuned is the encoder part of PT + task specific classification head. For the downstream task, the benchmark dataset are used, i.e. Kinetics-400, Something-Something-V2, and UCF101.

In keras implementation, these models are available in SavedModel and h5 format, check release page of v.1.1 for other checkpoints. Please note, the officially, for Kinetics-400, there is another huge model size variant is available. But the official PT version is corrumpted, MCG-NJU/VideoMAE#89. And the FT is size of above 2GB, makes it unable to upload here, but it can be found here.

Model Name arch params
TFVideoMAE_S_K400_16x224_FT.h5 encoder 22 MB
TFVideoMAE_S_K400_16x224_PT.h5 encoder + decoder 24 MB
TFVideoMAE_B_K400_16x224_FT.h5 encoder 87 MB
TFVideoMAE_B_K400_16x224_PT.h5 encoder + decoder 94 MB
TFVideoMAE_L_K400_16x224_FT.h5 encoder 304 MB
TFVideoMAE_L_K400_16x224_PT.h5 encoder + decoder 343 MB
TFVideoMAE_S_SSv2_16x224_FT.h5 encoder 22 MB
TFVideoMAE_S_SSv2_16x224_PT.h5 encoder + decoder 24 MB
TFVideoMAE_B_SSv2_16x224_FT.h5 encoder 86 MB
TFVideoMAE_B_SSv2_16x224_PT.h5 encoder + decoder 94 MB
TFVideoMAE_B_UCF_16x224_FT.h5 encoder 86 MB
TFVideoMAE_B_UCF_16x224_PT.h5 encoder + decoder 94 MB