This file documents a collection of models reported in our paper. The training time was measured on Big Basin servers with 8 NVIDIA V100 GPUs & NVLink.
The "Name" column contains a link to the config file. To train a model, run
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml
To evaluate a model with a trained/ pretrained model, run
python train_net.py --num-gpus 8 --config-file /path/to/config/name.yaml --eval-only MODEL.WEIGHTS /path/to/weight.pth
Our paper uses ImageNet-21K pretrained models that are not part of Detectron2 (ResNet-50-21K from MIIL and SwinB-21K from Swin-Transformer). Before training,
please download the models and place them under DETIC_ROOT/models/
, and following this tool to convert the format.
Name | Training time | mask mAP | mask mAP_novel | Download |
---|---|---|---|---|
Box-Supervised_C2_R50_640_4x | 17h | 30.2 | 16.4 | model |
Detic_C2_IN-L_R50_640_4x | 22h | 32.4 | 24.9 | model |
Detic_C2_CCimg_R50_640_4x | 22h | 31.0 | 19.8 | model |
Detic_C2_CCcapimg_R50_640_4x | 22h | 31.0 | 21.3 | model |
Box-Supervised_C2_SwinB_896_4x | 43h | 38.4 | 21.9 | model |
Detic_C2_IN-L_SwinB_896_4x | 47h | 40.7 | 33.8 | model |
-
The open-vocabulary LVIS setup is LVIS without rare class annotations in training. We evaluate rare classes as novel classes in testing.
-
The models with
C2
are trained using our improved LVIS baseline (Appendix D of the paper), including CenterNet2 detector, Federated Loss, large-scale jittering, etc. -
All models use CLIP embeddings as classifiers. This makes the box-supervised models have non-zero mAP on novel classes.
-
The models with
IN-L
use the overlap classes between ImageNet-21K and LVIS as image-labeled data. -
The models with
CC
use Conception Captions.CCimg
uses image labels extracted from the captions (using a naive text-match) as image-labeled data.CCcapimg
additionally uses the row captions (Appendix C of the paper). -
The Detic models are finetuned on the corresponding Box-Supervised models above (indicated by MODEL.WEIGHTS in the config files). Please train or download the Box-Supervised model and place them under
DETIC_ROOT/models/
before training the Detic models.
Name | Training time | mask mAP | mask mAP_rare | Download |
---|---|---|---|---|
Box-Supervised_C2_R50_640_4x | 17h | 31.5 | 25.6 | model |
Detic_C2_R50_640_4x | 22h | 33.2 | 29.7 | model |
Box-Supervised_C2_SwinB_896_4x | 43h | 40.7 | 35.9 | model |
Detic_C2_SwinB_896_4x | 47h | 41.7 | 41.7 | model |
Name | Training time | box mAP | box mAP_rare | Download |
---|---|---|---|---|
Box-Supervised_DeformDETR_R50_4x | 31h | 31.7 | 21.4 | model |
Detic_DeformDETR_R50_4x | 47h | 32.5 | 26.2 | model |
-
All Detic models use the overlap classes between ImageNet-21K and LVIS as image-labeled data;
-
The models with
C2
are trained using our improved LVIS baseline in the paper, including CenterNet2 detector, Federated loss, large-scale jittering, etc. -
The models with
DeformDETR
are Deformable DETR models. We train the models with Federated Loss.
Name | Training time | box mAP50 | box mAP50_novel | Download |
---|---|---|---|---|
BoxSup_CLIP_R50_1x | 12h | 39.3 | 1.3 | model |
Detic_CLIP_R50_1x_image | 13h | 44.7 | 24.1 | model |
Detic_CLIP_R50_1x_caption | 16h | 43.8 | 21.0 | model |
Detic_CLIP_R50_1x_caption-image | 16h | 45.0 | 27.8 | model |
-
All models are trained with ResNet50-C4 without multi-scale augmentation. All models use CLIP embeddings as the classifier.
-
We extract class names from COCO-captions as image-labels.
Detic_CLIP_R50_1x_image
uses the max-size loss;Detic_CLIP_R50_1x_caption
directly uses CLIP caption embedding within each mini-batch for classification;Detic_CLIP_R50_1x_caption-image
uses both losses. -
We report box mAP50 under the "generalized" open-vocabulary setting.
Name | Training time | Objects365 box mAP | OpenImages box mAP50 | Download |
---|---|---|---|---|
Box-Supervised_C2_SwinB_896_4x | 43h | 19.1 | 46.2 | model |
Detic_C2_SwinB_896_4x | 47h | 21.2 | 53.0 | model |
Detic_C2_SwinB_896_4x_IN-21K | 47h | 21.4 | 55.2 | model |
Box-Supervised_C2_SwinB_896_4x+COCO | 43h | 19.7 | 46.4 | model |
Detic_C2_SwinB_896_4x_IN-21K+COCO | 47h | 21.6 | 54.6 | model |
-
Box-Supervised_C2_SwinB_896_4x
andDetic_C2_SwinB_896_4x
are the same model in the Standard LVIS section, but evaluated with Objects365/ OpenImages vocabulary (i.e. CLIP embeddings of the corresponding class names as classifier). To run the evaluation on Objects365/ OpenImages, runpython train_net.py --num-gpus 8 --config-file configs/Detic_C2_SwinB_896_4x.yaml --eval-only DATASETS.TEST "('oid_val_expanded','objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/oid_clip_a+cname.npy','datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(500,365)" MODEL.MASK_ON False
-
Detic_C2_SwinB_896_4x_IN-21K
trains on the full ImageNet-22K. We additionally use a dynamic class sampling ("Modified Federated Loss" in Section 4.4) and use a larger data sampling ratio of ImageNet images (1:16 instead of 1:4). -
Detic_C2_SwinB_896_4x_IN-21K-COCO
is a model trained on combined LVIS-COCO and ImageNet-21K for better demo purposes. LVIS models do not detect persons well due to its federated annotation protocol. LVIS+COCO models give better visual results.
Name | Run time (ms) | LVIS box mAP | Download |
---|---|---|---|
Detic_C2_SwinB_896_4x_IN-21K+COCO (800x1333, no threshold) | 115 | 44.4 | model |
Detic_C2_SwinB_896_4x_IN-21K+COCO | 46 | 35.0 | model |
Detic_C2_ConvNeXtT_896_4x_IN-21K+COCO | 26 | 30.7 | model |
Detic_C2_R5021k_896_4x_IN-21K+COCO | 23 | 29.0 | model |
Detic_C2_R18_896_4x_IN-21K+COCO | 18 | 22.1 | model |
-
Detic_C2_SwinB_896_4x_IN-21K+COCO (800x1333, thresh 0.02)
is the entry on the [Cross-dataset evaluation](#Cross-dataset evaluation) section without the mask head. All other entries use a max-size of 640 and an output score threshold of 0.3 using the following command (e.g., with R50).python train_net.py --config-file configs/Detic_LCOCOI21k_CLIP_R5021k_640b32_4x_ft4x_max-size.yaml --num-gpus 2 --eval-only DATASETS.TEST "('lvis_v1_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/lvis_v1_clip_a+cname.npy',)" MODEL.TEST_NUM_CLASSES "(1203,)" MODEL.MASK_ON False MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_R5021k_640b32_4x_ft4x_max-size.pth INPUT.MIN_SIZE_TEST 640 INPUT.MAX_SIZE_TEST 640 MODEL.ROI_HEADS.SCORE_THRESH_TEST 0.3
-
All models are trained using the same training recipe except for different backbones.
-
The ConvNeXtT and Res50 models are initialized from their corresponding ImageNet-21K pretrained models. The Res18 model is initialized from its ImageNet-1K pretrained model.
-
The runtimes are measured on a local workstation with a Titan RTX GPU.