SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model [Arxiv]
With the development of large language models, many remarkable linguistic systems like ChatGPT have thrived and achieved astonishing success on many tasks, showing the incredible power of foundation models. In the spirit of unleashing the capability of foundation models on vision tasks, the Segment Anything Model (SAM), a vision foundation model for image segmentation, has been proposed recently and presents strong zero-shot ability on many downstream 2D tasks. However, whether SAM can be adapted to 3D vision tasks is still unknown, especially 3D object detection.
We explore adapting the zero-shot ability of SAM to 3D object detection in this project, and the project is still in progress.
We use pytorch==1.12.1, cuda==11.3
. We build this project based on MMDetection3D (ver. 1.1.0rc3) and segment-anything (commit 6fdee8f).
- install waymo-open-dataset:
pip install waymo-open-dataset-tf-2-6-0
- install MMDetection3D:
pip install -U openmim mim install 'mmengine==0.7.2' mim install 'mmcv==2.0.0' mim install 'mmdet==3.0.0' git clone https://github.com/open-mmlab/mmdetection3d.git cd mmdetection3d git checkout 341ff99 # mmdet3d 1.1.0rc3 pip install -v -e .
- install segment-anything:
pip install git+https://github.com/facebookresearch/segment-anything.git
- install other dependices:
pip install -r requirements.txt
Since our project explores the zero shot setting, we do not need to pre-process the training data. We rougly follow the data preparation set up in MMDetection3D Data Preparation Guide but do some minor modifications.
- organize the raw data like:
. └── data └── waymo └── waymo_format └── validation └── *.tfrecord └── gt.bin (optional)
- run command:
Note: Since evaluation on waymo dataset needs the ground truth bin file for validation set, you need to put the
CUDA_VISIBLE_DEVICES=-1 python tools/create_data.py waymo --root-path ./data/waymo/ --out-dir ./data/waymo/ --workers 128 --extra-tag waymo
.bin
file intodata/waymo/waymo_format
. If you do not have the access to it, you can add--gen-gt-bin
argument to the above command:this will automatically generateCUDA_VISIBLE_DEVICES=-1 python tools/create_data.py waymo --root-path ./data/waymo/ --out-dir ./data/waymo/ --workers 128 --extra-tag waymo --gen-gt-bin
gt.bin
file (may different from the official version in some respects) intodata/waymo/waymo_format
. - after the pre-processing, the data folder will be organized as:
. └── data └── waymo ├── kitti_format │ ├── ImageSets │ ├── training │ └── waymo_infos_val.pkl └── waymo_format ├── gt.bin └── validation
Because it's time-consuming to evaluate on the whole waymo validation set, we modify the create_data.py
to support pre-processing partial validation set. You can put any number of *.tfrecord
into data/waymo/waymo_format/validation/
and run command above, it will automatically generate the ImageSets/val.txt
and corresponding gt.bin
.
We use the pre-trained SAM in our project, so go to segment-anything model checkpoints to download weights and put them into projects/pretrain_weights
.
- generate the fake weights for loading (only a trick to run the
test.py
with a fake weights, and only need to run once).python projects/generate_fake_pth.py
- run the command to inference and evaluate the method:
python tools/test.py projects/configs/sam3d_intensity_bev_waymo_car.py fake.pth
-
Quantitative results: Tested on single NVIDIA GeForce RTX 4090 with
pytorch==1.12.1, cuda==11.3
, logMetric mAP mAPH RANGE_TYPE_VEHICLE_[0, 30)_LEVEL_1 19.51 13.30 RANGE_TYPE_VEHICLE_[0, 30)_LEVEL_2 19.05 12.98
Although our method is only an initial attempt, we believe it shows the great possibility and opportunity to unleash the potential of foundation models like SAM on 3D vision tasks, especially on 3D object detection. With technologies like few-shot learning and prompt engineering, we can take advantage of vision foundation models more effectively to better solve 3D tasks, especially considering the vast difference between scales of 2D and 3D data.
@article{zhang2023sam3d,
title={SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model},
author={Zhang, Dingyuan and Liang, Dingkang and Yang, Hongcheng and Zou, Zhikang and Ye, Xiaoqing and Liu, Zhe and Bai, Xiang},
journal={Science China Information Sciences},
year={2023}
}