By Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, Yunhe Wang.
This repository is a PyTorch implementation of "Multimodal Token Fusion for Vision Transformers", in CVPR 2022.
For semantic segmentation task on NYUDv2 (official dataset), we provide a link to download the dataset here. The provided dataset is originally preprocessed in this repository, and we add depth data in it.
For image-to-image translation task, we use the sample dataset of Taskonomy, where a link to download the sample dataset is here.
Please modify the data paths in the codes, where we add comments 'Modify data path'.
python==3.6
pytorch==1.7.1
torchvision==0.8.2
numpy==1.19.2
First,
cd semantic_segmentation
Training script for segmentation with RGB and Depth input, the default setting uses RefineNet (ResNet101),
python main.py --backbone mit_b3 -c exp_name --lamda 1e-6 --gpu 0 1 2
Evaluation script,
python main.py --gpu 0 --resume path_to_pth --evaluate # optionally use --save-img to visualize results
Checkpoint models, training logs, mask ratios and the single-scale performance on NYUDv2 are provided as follows:
Method | Backbone | Pixel Acc. (%) | Mean Acc. (%) | Mean IoU (%) | Download |
---|---|---|---|---|---|
CEN | ResNet101 | 76.2 | 62.8 | 51.1 | Google Drive |
CEN | ResNet152 | 77.0 | 64.4 | 51.6 | Google Drive |
Ours | SegFormer-B3 | 78.7 | 67.5 | 54.8 | Google Drive |
First,
cd image2image_translation
Training script, from Shade and Texture to RGB,
python main.py --gpu 0 -c exp_name
This script will auto-evaluate on the validation dataset every 5 training epochs.
Predicted images will be automatically saved during training, in the following folder structure:
code_root/ckpt/exp_name/results
├── input0 # 1st modality input
├── input1 # 2nd modality input
├── fake0 # 1st branch output
├── fake1 # 2nd branch output
├── fake2 # ensemble output
├── best # current best output
│ ├── fake0
│ ├── fake1
│ └── fake2
└── real # ground truth output
Method | Task | FID | KID | Download |
---|---|---|---|---|
CEN | Texture+Shade->RGB | 62.6 | 1.65 | - |
Ours | Texture+Shade->RGB | 45.5 | 1.00 | Google Drive |
If you find our work useful for your research, please consider citing the following paper.
@inproceedings{wang2022tokenfusion,
title={Multimodal Token Fusion for Vision Transformers},
author={Wang, Yikai and Chen, Xinghao and Cao, Lele and Huang, Wenbing and Sun, Fuchun and Wang, Yunhe},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}
}