Release v1.0 · innat/VideoMAE

This is a keras implementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The pre trained and fine tuned weights are ported from official pytorch model. Following are the list of all available model in .h5 format. It includes both pre-trained and fine-tuned models.

Naming style for these model is: TFVideoMAE_{size}_{dataset}_{input_frame}x{input_size}_FT/PT. Here, size represent base, small, large and huge for the available models. The PT or pre-trained is the video masked autoencoder model, trained with self-supervised manner and FT or fine-tuned is the encoder part of PT + task specific classification head. For the downstream task, the benchmark dataset are used, i.e. Kinetics-400, Something-Something-V2, and UCF101.

In keras implementation, these models are available in SavedModel and h5 format, check release page of v.1.1 for other checkpoints. Please note, the officially, for Kinetics-400, there is another huge model size variant is available. But the official PT version is corrumpted, MCG-NJU/VideoMAE#89. And the FT is size of above 2GB, makes it unable to upload here, but it can be found here.

Model Name	arch	params
TFVideoMAE_S_K400_16x224_FT.h5	encoder	22 MB
TFVideoMAE_S_K400_16x224_PT.h5	encoder + decoder	24 MB
TFVideoMAE_B_K400_16x224_FT.h5	encoder	87 MB
TFVideoMAE_B_K400_16x224_PT.h5	encoder + decoder	94 MB
TFVideoMAE_L_K400_16x224_FT.h5	encoder	304 MB
TFVideoMAE_L_K400_16x224_PT.h5	encoder + decoder	343 MB
TFVideoMAE_S_SSv2_16x224_FT.h5	encoder	22 MB
TFVideoMAE_S_SSv2_16x224_PT.h5	encoder + decoder	24 MB
TFVideoMAE_B_SSv2_16x224_FT.h5	encoder	86 MB
TFVideoMAE_B_SSv2_16x224_PT.h5	encoder + decoder	94 MB
TFVideoMAE_B_UCF_16x224_FT.h5	encoder	86 MB
TFVideoMAE_B_UCF_16x224_PT.h5	encoder + decoder	94 MB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0