Skip to content

Latest commit

 

History

History
117 lines (85 loc) · 11.2 KB

MODEL_ZOO.md

File metadata and controls

117 lines (85 loc) · 11.2 KB

Model Zoo

Some note:

  • Frame = input_frame x crop x clip
  • input_frame means how many frames are input for model per inference
  • crop means spatial crops (e.g., 3 for left/right/center)
  • clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

K710

Model Frame Checkpoints Config
UniFormerV2-B/16 8 SavedModel/h5 cfg
UniFormerV2-L/14 8 SavedModel/h5 cfg
UniFormerV2-L/14@336 8 SavedModel/h5 cfg

K400

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-B/16 CLIP-400M 8x3x4 84.4 SavedModel/h5 cfg
UniFormerV2-B/16 CLIP-400M+K710 8x3x4 85.6 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 8x3x4 88.8 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 16x3x4 89.1 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 32x3x2 89.3 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 32x3x2 89.7 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 64x3x2 90.0 SavedModel/h5 cfg

K600

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-B/16 CLIP-400M 8x3x4 85.0 SavedModel/h5 cfg
UniFormerV2-B/16 CLIP-400M+K710 8x3x4 86.1 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 8x3x4 89.0 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 16x3x4 89.4 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 32x3x2 89.5 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 32x3x2 89.9 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 64x3x2 90.1 SavedModel/h5 cfg

K700

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-B/16 CLIP-400M 8x3x4 75.8 SavedModel/h5 cfg
UniFormerV2-B/16 CLIP-400M+K710 8x3x4 76.3 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 8x3x4 80.8 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 16x3x4 81.2 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710 32x3x2 81.5 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 32x3x2 82.1 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710 64x3x2 82.7 SavedModel/h5 cfg

Moments in Time V1

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-B/16 CLIP-400M+K710+K400 8x3x4 42.6 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710+K400 8x3x4 47.0 SavedModel/h5 cfg
UniFormerV2-L/14@336 CLIP-400M+K710+K400 8x3x4 47.8 SavedModel/h5 cfg

Something-Something V2

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-B/16 CLIP-400M 16x3x1 69.5 SavedModel/h5 cfg
UniFormerV2-B/16 CLIP-400M 32x3x1 70.7 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M 16x3x1 72.1 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M 32x3x1 73.0 SavedModel/h5 cfg

ActivityNet

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-L/14 CLIP-400M+K710+K400 16x3x10 94.3 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710+K400 32x3x10 94.7 SavedModel/h5 cfg

HACS

Model Pretraining #Frame Top-1 Checkpoints Config
UniFormerV2-L/14 CLIP-400M+K710+K400 16x3x10 95.5 SavedModel/h5 cfg
UniFormerV2-L/14 CLIP-400M+K710+K400 32x3x10 95.4 SavedModel/h5 cfg

Weight Comparison

The torch uniformerv2 model can be loaded from the official repo. Following are some quick test of both implementation showing logit matching.

input = np.random.rand(4, 16, 224, 224, 3).astype('float32')
inputs = torch.tensor(input)
inputs = torch.einsum('nthwc->ncthw', inputs)
# inputs.shape: torch.Size([4, 3, 16, 224, 224])

# torch model
model_pt.eval()
x = model_torch(inputs.float())
x = x.detach().numpy()
x.shape # (4, 339) (Moments in Time dataset)

# keras model
y = model_keras(input, training=False)
y = y.numpy()
y.shape # (4, 339) (Moments in Time dataset)

np.testing.assert_allclose(x, y, 1e-4, 1e-4)
np.testing.assert_allclose(x, y, 1e-5, 1e-5)
# OK