Model Zoo

Some note:

Frame = input_frame x crop x clip
input_frame means how many frames are input for model per inference
crop means spatial crops (e.g., 3 for left/right/center)
clip means temporal clips (e.g., 4 means repeted sampling four clips with different start indices)

K710

Model	Frame	Checkpoints	Config
UniFormerV2-B/16	8	SavedModel/h5	cfg
UniFormerV2-L/14	8	SavedModel/h5	cfg
UniFormerV2-L/14@336	8	SavedModel/h5	cfg

K400

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-B/16	CLIP-400M	8x3x4	84.4	SavedModel/h5	cfg
UniFormerV2-B/16	CLIP-400M+K710	8x3x4	85.6	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	8x3x4	88.8	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	16x3x4	89.1	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	32x3x2	89.3	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	32x3x2	89.7	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	64x3x2	90.0	SavedModel/h5	cfg

K600

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-B/16	CLIP-400M	8x3x4	85.0	SavedModel/h5	cfg
UniFormerV2-B/16	CLIP-400M+K710	8x3x4	86.1	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	8x3x4	89.0	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	16x3x4	89.4	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	32x3x2	89.5	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	32x3x2	89.9	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	64x3x2	90.1	SavedModel/h5	cfg

K700

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-B/16	CLIP-400M	8x3x4	75.8	SavedModel/h5	cfg
UniFormerV2-B/16	CLIP-400M+K710	8x3x4	76.3	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	8x3x4	80.8	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	16x3x4	81.2	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710	32x3x2	81.5	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	32x3x2	82.1	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710	64x3x2	82.7	SavedModel/h5	cfg

Moments in Time V1

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-B/16	CLIP-400M+K710+K400	8x3x4	42.6	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710+K400	8x3x4	47.0	SavedModel/h5	cfg
UniFormerV2-L/14@336	CLIP-400M+K710+K400	8x3x4	47.8	SavedModel/h5	cfg

Something-Something V2

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-B/16	CLIP-400M	16x3x1	69.5	SavedModel/h5	cfg
UniFormerV2-B/16	CLIP-400M	32x3x1	70.7	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M	16x3x1	72.1	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M	32x3x1	73.0	SavedModel/h5	cfg

ActivityNet

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-L/14	CLIP-400M+K710+K400	16x3x10	94.3	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710+K400	32x3x10	94.7	SavedModel/h5	cfg

HACS

Model	Pretraining	#Frame	Top-1	Checkpoints	Config
UniFormerV2-L/14	CLIP-400M+K710+K400	16x3x10	95.5	SavedModel/h5	cfg
UniFormerV2-L/14	CLIP-400M+K710+K400	32x3x10	95.4	SavedModel/h5	cfg

Weight Comparison

The torch uniformerv2 model can be loaded from the official repo. Following are some quick test of both implementation showing logit matching.

input = np.random.rand(4, 16, 224, 224, 3).astype('float32')
inputs = torch.tensor(input)
inputs = torch.einsum('nthwc->ncthw', inputs)
# inputs.shape: torch.Size([4, 3, 16, 224, 224])

# torch model
model_pt.eval()
x = model_torch(inputs.float())
x = x.detach().numpy()
x.shape # (4, 339) (Moments in Time dataset)

# keras model
y = model_keras(input, training=False)
y = y.numpy()
y.shape # (4, 339) (Moments in Time dataset)

np.testing.assert_allclose(x, y, 1e-4, 1e-4)
np.testing.assert_allclose(x, y, 1e-5, 1e-5)
# OK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MODEL_ZOO.md

MODEL_ZOO.md

Model Zoo

K710

K400

K600

K700

Moments in Time V1

Something-Something V2

ActivityNet

HACS

Weight Comparison

Files

MODEL_ZOO.md

Latest commit

History

MODEL_ZOO.md

File metadata and controls

Model Zoo

K710

K400

K600

K700

Moments in Time V1

Something-Something V2

ActivityNet

HACS

Weight Comparison