Some note:
Frame = input_frame x crop x clip
input_frame
means how many frames are input for model per inferencecrop
means spatial crops (e.g., 3 for left/right/center)clip
means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)
Model | Frame | Checkpoints | Config |
---|---|---|---|
UniFormerV2-B/16 | 8 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | 8 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | 8 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-B/16 | CLIP-400M | 8x3x4 | 84.4 | SavedModel/h5 | cfg |
UniFormerV2-B/16 | CLIP-400M+K710 | 8x3x4 | 85.6 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 8x3x4 | 88.8 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 16x3x4 | 89.1 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 32x3x2 | 89.3 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 32x3x2 | 89.7 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 64x3x2 | 90.0 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-B/16 | CLIP-400M | 8x3x4 | 85.0 | SavedModel/h5 | cfg |
UniFormerV2-B/16 | CLIP-400M+K710 | 8x3x4 | 86.1 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 8x3x4 | 89.0 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 16x3x4 | 89.4 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 32x3x2 | 89.5 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 32x3x2 | 89.9 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 64x3x2 | 90.1 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-B/16 | CLIP-400M | 8x3x4 | 75.8 | SavedModel/h5 | cfg |
UniFormerV2-B/16 | CLIP-400M+K710 | 8x3x4 | 76.3 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 8x3x4 | 80.8 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 16x3x4 | 81.2 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710 | 32x3x2 | 81.5 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 32x3x2 | 82.1 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710 | 64x3x2 | 82.7 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-B/16 | CLIP-400M+K710+K400 | 8x3x4 | 42.6 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710+K400 | 8x3x4 | 47.0 | SavedModel/h5 | cfg |
UniFormerV2-L/14@336 | CLIP-400M+K710+K400 | 8x3x4 | 47.8 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-B/16 | CLIP-400M | 16x3x1 | 69.5 | SavedModel/h5 | cfg |
UniFormerV2-B/16 | CLIP-400M | 32x3x1 | 70.7 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M | 16x3x1 | 72.1 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M | 32x3x1 | 73.0 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-L/14 | CLIP-400M+K710+K400 | 16x3x10 | 94.3 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710+K400 | 32x3x10 | 94.7 | SavedModel/h5 | cfg |
Model | Pretraining | #Frame | Top-1 | Checkpoints | Config |
---|---|---|---|---|---|
UniFormerV2-L/14 | CLIP-400M+K710+K400 | 16x3x10 | 95.5 | SavedModel/h5 | cfg |
UniFormerV2-L/14 | CLIP-400M+K710+K400 | 32x3x10 | 95.4 | SavedModel/h5 | cfg |
The torch
uniformerv2 model can be loaded from the official repo. Following are some quick test of both implementation showing logit matching.
input = np.random.rand(4, 16, 224, 224, 3).astype('float32')
inputs = torch.tensor(input)
inputs = torch.einsum('nthwc->ncthw', inputs)
# inputs.shape: torch.Size([4, 3, 16, 224, 224])
# torch model
model_pt.eval()
x = model_torch(inputs.float())
x = x.detach().numpy()
x.shape # (4, 339) (Moments in Time dataset)
# keras model
y = model_keras(input, training=False)
y = y.numpy()
y.shape # (4, 339) (Moments in Time dataset)
np.testing.assert_allclose(x, y, 1e-4, 1e-4)
np.testing.assert_allclose(x, y, 1e-5, 1e-5)
# OK