This code aims at sound event detection. The dataset utilized in our experiments is from DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events), more specifically, from DCASE2021 task4 and DCASE2022 task4. We borrow some codes from pb_sed. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model (MLFL-CNN).
We're so glad if you're interested in using it for research purpose or DCASE participation. Please don't hesitate to contact us should you have any question.
The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely-adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such of thresholding and smoothing.
For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the EMA strategy and generates pseudo labels for weakly-labeled and unlabeled data.
For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used, which incorporates techniques such as focal loss and metric learning into a CRNN model to form the MLFL model, adopts mean-teacher for semi-supervised learning, and uses a tag-condition CNN model to predict final results using the output of MLFL.
Please don't hesitate to contact us should you have any question. You can email me at [email protected]
.
If you use our code, you are encouraged to cite the following papers, especially the first one.
- A Hybrid System Of Sound Event Detection Transformer And Frame-Wise Model For Dcase 2022 Task 4, Yiming Li and Zhifang Guo and Zhirong Ye and Xiangdong Wang and Hong Liu and Yueliang Qian and Rui Tao and Long Yan and Kazushige Ouchi.
- Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection, Zhirong Ye and Xiangdong Wang and Hong Liu and Yueliang Qian and Rui Tao and Long Yan and Kazushige Ouchi.
- SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer, Zhirong Ye and Xiangdong Wang and Hong Liu and Yueliang Qian and Rui Tao and Long Yan and Kazushige Ouchi.
- SOUND EVENT DETECTION USING METRIC LEARNING AND FOCAL LOSS FOR DCASE 2021 TASK 4, Gangyi Tian and Yuxin Huang and Zhirong Ye and Shuo Ma and Xiangdong Wang and Hong Liu and Yueliang Qian and Rui Tao and Long Yan and Kazushige Ouchi and Janek Ebbers and Reinhold Haeb-Umbach.
@techreport{Li2022d,
author = "Li, Yiming and Guo, Zhifang and Ye, Zhirong and Wang, Xiangdong and Liu, Hong and Qian, Yueliang and Tao, Rui and Yan, Long and Ouchi, Kazushige",
title = "A Hybrid System Of Sound Event Detection Transformer And Frame-Wise Model For Dcase 2022 Task 4",
url = "https://dcase.community/documents/challenge2022/technical_reports/DCASE2022_Li_98_t4.pdf",
year = "2022",
month = "June"
}
@article{
author = "Zhirong, Ye and Xiangdong, Wang and Hong, Liu and Yueliang, Qian and Rui, Tao and Long, Yan and Kazushige, Ouchi",
title = "Sound Event Detection Transformer: An Event-based End-to-End Model for Sound Event Detection",
year = "2021",
url = "https://arxiv.org/abs/2110.02011",
}
@article{
author = "Zhirong, Ye and Xiangdong, Wang and Hong, Liu and Yueliang, Qian and Rui, Tao and Long, Yan and Kazushige, Ouchi",
title = "SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer",
year = "2021",
url = "https://arxiv.org/abs/2111.15222",
}
@techreport{Tian2021,
author = "Tian, Gangyi and Huang, Yuxin and Ye, Zhirong and Ma, Shuo and Wang, Xiangdong and Liu, Hong and Qian, Yueliang and Tao, Rui and Yan, Long and Ouchi, Kazushige and Ebbers, Janek Haeb-Umbach, Reinhold",
title = "Sound Event Detection Using Metric Learning And Focal Loss For DCASE 2021 Task 4",
url = "https://dcase.community/documents/challenge2021/technical_reports/DCASE2021_Tian_130_t4.pdf",
year = "2021",
month = "June",
}