Fine-tuning YOLO-World is easy and we provide the samples for COCO object detection as a simple guidance.
Fine-tuning YOLO-World is cheap:
-
it does not require 32 GPUs for multi-node distributed training. 8 GPUs or even 1 GPU is enough.
-
it does not require the long schedule, e.g., 300 epochs or 500 epochs for training YOLOv5 or YOLOv8. 80 epochs or fewer is enough considering that we provide the good pre-trained weights.
The fine-tuning dataset should have the similar format as the that of the pre-training dataset.
We suggest you refer to docs/data
for more details about how to build the datasets:
-
if you fine-tune YOLO-World for close-set / custom vocabulary object detection, using
MultiModalDataset
with atext json
is preferred. -
if you fine-tune YOLO-World for open-vocabulary detection with rich texts or grounding tasks, using
MixedGroundingDataset
is preferred.
Please refer to the config for fine-tuning YOLO-World-L on COCO for more details.
- Basic config file:
If the fine-tuning dataset contains mask annotations:
_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_mask-refine_syncbn_fast_8xb16-500e_coco.py')
If the fine-tuning dataset doesn't contain mask annotations:
_base_ = ('../../third_party/mmyolo/configs/yolov8/yolov8_l_syncbn_fast_8xb16-500e_coco.py')
- Training Schemes:
Reducing the epochs and adjusting the learning rate
max_epochs = 80
base_lr = 2e-4
weight_decay = 0.05
train_batch_size_per_gpu = 16
close_mosaic_epochs=10
train_cfg = dict(
max_epochs=max_epochs,
val_interval=5,
dynamic_intervals=[((max_epochs - close_mosaic_epochs),
_base_.val_interval_stage2)])
- Datasets:
coco_train_dataset = dict(
_delete_=True,
type='MultiModalDataset',
dataset=dict(
type='YOLOv5CocoDataset',
data_root='data/coco',
ann_file='annotations/instances_train2017.json',
data_prefix=dict(img='train2017/'),
filter_cfg=dict(filter_empty_gt=False, min_size=32)),
class_text_path='data/texts/coco_class_texts.json',
pipeline=train_pipeline)
For further efficiency and simplicity, we can fine-tune an efficient version of YOLO-World without RepVL-PAN and the text encoder. The efficient version of YOLO-World has the similar architecture or layers with the orignial YOLOv8 but we provide the pre-trained weights on large-scale datasets. The pre-trained YOLO-World has strong generalization capabilities and is more robust compared to YOLOv8 trained on the COCO dataset.
You can refer to the config for Efficient YOLO-World for more details.
The efficient YOLO-World adopts EfficientCSPLayerWithTwoConv
and the text encoder can be removed during inference or exporting models.
model = dict(
type='YOLOWorldDetector',
mm_neck=True,
neck=dict(type='YOLOWorldPAFPN',
guide_channels=text_channels,
embed_channels=neck_embed_channels,
num_heads=neck_num_heads,
block_cfg=dict(type='EfficientCSPLayerWithTwoConv')))
It's easy:
./dist_train.sh <path/to/config> <NUM_GPUS> --amp
model | efficient neck | AP | AP50 | AP75 | weights |
---|---|---|---|---|---|
YOLO-World-S | ✖️ | 45.7 | 62.3 | 49.9 | comming |
YOLO-World-M | ✖️ | 50.7 | 67.2 | 55.1 | comming |
YOLO-World-L | ✖️ | 53.3 | 70.3 | 58.1 | comming |
YOLO-World-S | ✔️ | 45.9 | 62.3 | 50.1 | comming |
YOLO-World-M | ✔️ | 51.2 | 68.1 | 55.9 | comming |
YOLO-World-L | ✔️ | 53.3 | 70.1 | 58.2 | comming |