This repo is a collection of resources on dataset refinement for computer vision (CV), as a supplement for our survey. If you find any work missing or have any suggestions (papers, implementations and other resources), feel free to pull requests.
citation
@article{wan2023survey,
author = {Wan, Zhijing and Wang, Zhixiang and Chung, CheukTing and Wang, Zheng},
title = {A Survey of Dataset Refinement for Problems in Computer Vision Datasets},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {0360-0300},
url = {https://doi.org/10.1145/3627157},
doi = {10.1145/3627157},
journal = {ACM Comput. Surv.},
month = {oct}
}
Table of Contents
[2023/09/26] Our survey is accepted to ACM Computing Surveys! 😆
[2022/10/14] We have submitted our Dataset-Refinement-for-Computer-Vision survey on arXiv: A Survey of Dataset Refinement for Problems in Computer Vision Datasets. 😆 We will continue to polish this work. 💪
FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation.
Yuhang Zang, Chen Huang, Chen Change Loy.
ICCV 2021. [PDF] [Github] [Project]
Videolt: Large-scale long-tailed video recognition.
Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, Larry Davis.
ICCV 2021. [PDF] [Github] [Project]
Influence-Balanced Loss for Imbalanced Visual Classification.
Seulki Park, Jongin Lim, Younghan Jeon, Jin Young Choi.
ICCV 2021. [PDF] [Github]
Distribution Alignment: A Unified Framework for Long-tail Visual Recognition.
Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, Jian Sun.
CVPR 2021. [PDF] [Github]
The devil is in classification: A simple framework for long-tail instance segmentation.
Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew, Sheng Tang, Steven Hoi, Jiashi Feng.
ECCV 2020. [PDF] [Github]
Balanced Meta-Softmax for Long-Tailed Visual Recognition.
Jiawei Ren, Cunjun Yu, Shunan Sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, Hongsheng Li.
NeurIPS 2020. [PDF] [Github]
Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting.
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng.
NeurIPS 2019. [PDF] [Github]
Focal loss for dense object detection.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar.
ICCV 2017. [PDF]
C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling.
Chris Drummond, Robert C. Holte.
Workshop on learning from imbalanced datasets II 2003. [PDF]
SlimML: Removing non-critical input data in large-scale iterative machine learning.
Rui Han, Chi Harold Liu, Shilin Li, Lydia Y. Chen, Guoren Wang, Jian Tang, Jieping Ye.
TKDE 2019. [PDF]
A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty.
Sihao Yu, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Zizhen Wang, Xueqi Cheng.
CVPR2022. [PDF]
Dualgraph: A graph-based method for reasoning about label noise.
HaiYang Zhang, XiMing Xing, Liang Liu.
CVPR 2021. [PDF]
Learning to Reweight Examples for Robust Deep Learning.
Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun.
ICML 2018. [PDF]
CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
ECCV 2018. [PDF] [Github]
Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples.
Haw-Shiuan Chang, Erik Learned-Miller, Andrew McCallum.
NeurIPS 2017. [PDF]
Multiclass Learning With Partially Corrupted Labels.
Ruxin Wang, Tongliang Liu, Dacheng Tao.
TNNLS 2017. [PDF]
Self-Paced Learning for Latent Variable Models.
M. Kumar, Benjamin Packer, Daphne Koller.
NeurIPS 2010. [PDF]
Mutual Quantization for Cross-Modal Search with Noisy Labels.
Erkun Yang, Dongren Yao, Tongliang Liu, Cheng Deng.
CVPR22. [PDF]
Unicon: Combating label noise through uniform selection and contrastive learning.
Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, Mubarak Shah.
CVPR2022. [PDF] [Github]
Is your data relevant?: Dynamic selection of relevant data for federated learning.
Lokesh Nagalapatti, Ruhi Sharma Mittal, Ramasuri Narayanam.
AAAI 2022. [PDF]
NGC: a unified framework for learning with open-world noisy data.
Zhi-Fan Wu, Tong Wei, Jianwen Jiang, Chaojie Mao, Mingqian Tang, Yu-Feng Li.
ICCV 2021. [PDF]
Towards Understanding Deep Learning from Noisy Labels with Small-Loss Criterion.
Xian-Jin Gui, Wei Wang, Zhang-Hao Tian.
IJCAI 2021. [PDF]
Robust learning by self-transition for handling noisy labels.
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee.
SIGKDD 2021. [PDF] [Github]
Confident Learning: Estimating Uncertainty in Dataset Labels.
Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang.
Journal of Artificial Intelligence Research 2021. [PDF] [Github] [Project] [Blog]
Crssc: salvage reusable samples from noisy data for robust learning.
Zeren Sun, Xian-Sheng Hua, Yazhou Yao, Xiu-Shen Wei, Guosheng Hu, Jian Zhang.
ACM MM 2020. [PDF]
Less is better: Unweighted data subsampling via influence function.
Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, Shao-Lun Huang.
AAAI 2020. [PDF] [Github]
Curriculum Loss: Robust Learning and Generalization against Label Corruption.
Yueming Lyu, Ivor W. Tsang.
ICLR 2020. [PDF]
A topological filter for learning with label noise.
Pengxiang Wu, Songzhu Zheng, Mayank Goswami, Dimitris N. Metaxas, Chao Chen.
NeurIPS 2020. [PDF] [Github]
Identifying mislabeled data using the area under the margin ranking.
Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, Kilian Q. Weinberger.
NeurIPS 2020. [PDF] [Github] [Project]
O2u-net: A simple noisy label detection approach for deep neural networks.
Jinchi Huang, Lie Qu, Rongfei Jia, Binqiang Zhao.
ICCV 2019. [PDF]
Understanding and utilizing deep neural networks trained with noisy labels.
Pengfei Chen, Benben Liao, Guangyong Chen, Shengyu Zhang.
ICML 2019. [PDF] [Github]
Learning with bad training data via iterative trimmed loss minimization.
Yanyao Shen, Sujay Sanghavi.
ICML 2019. [PDF]
How does disagreement help generalization against label corruption?
Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W. Tsang, Masashi Sugiyama.
ICML 2019. [PDF]
SELFIE: Refurbishing Unclean Samples for Robust Deep Learning.
Hwanjun Song, Minseok Kim, Jae-Gil Lee.
ICML 2019. [PDF] [Github]
Self: Learning to filter noisy labels with self-ensembling
Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, Thomas Brox.
Arxiv 2019. [PDF]
Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels.
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, Li Fei-Fei.
ICML 2018. [PDF]
Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels.
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, Masashi Sugiyama.
NeurIPS 2018. [PDF]
Decoupling "when to update" from "how to update".
Eran Malach, Shai Shalev-Shwartz.
NeurIPS 2017. [PDF] [Github]
Learning with confident examples: Rank pruning for robust classification with noisy labels.
Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang.
Arxiv 2017. [PDF] [Github]
A domain robust approach for image dataset construction.
Yazhou Yao, Xian-sheng Hua, Fumin Shen, Jian Zhang, Zhenmin Tang.
ACM MM 2016. [PDF]
Adversarial filters of dataset biases.
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, Yejin Choi.
ICML 2020. [PDF] [Github]
Repair: Removing representation bias by dataset resampling.
Yi Li, Nuno Vasconcelos.
CVPR 2019. [PDF] [Github]
Resound: Towards action recognition without representation bias.
Yingwei Li, Yi Li, Nuno Vasconcelos.
ECCV 2018. [PDF]
Variance reduced training with stratified sampling for forecasting models.
Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, Dean Foster.
ICML 2021. [PDF]
Curriculum learning by dynamic instance hardness.
Tianyi Zhou, Shengjie Wang, Jeffrey Bilmes.
NeurIPS 2020. [PDF] [Github]
Ordered sgd: A new stochastic optimization framework for empirical risk minimization.
Kenji Kawaguchi, Haihao Lu.
AISTATS 2020. [PDF]
Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling.
Xinyu Peng, Li Li, Fei-Yue Wang.
TNNLS 2019. [PDF]
Accelerating deep learning by focusing on the biggest losers.
Angela H. Jiang, Daniel Lin-Kit Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai.
Arxiv 2019. [PDF]
Variance reduction in sgd by distributed importance sampling.
Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, Yoshua Bengio.
Arxiv 2015. [PDF]
Accelerating minibatch stochastic gradient descent using stratified sampling.
Peilin Zhao, Tong Zhang.
Arxiv 2014. [PDF]
Adaptive second order coresets for data-efficient machine learning.
Omead Pooladzandi, David Davini, Baharan Mirzasoleiman.
ICML 2022. [PDF] [Github]
Online coreset selection for rehearsal-based continual learning.
Jaehong Yoon, Divyam Madaan, Eunho Yang, Sung Ju Hwang.
ICLR 2022. [PDF]
AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning.
Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer.
Arxiv 2022. [PDF] [Github]
Glister: Generalization based data subset selection for efficient and robust learning.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer.
AAAI 2021. [PDF] [Github] [Project]
Grad-match: Gradient matching based data subset selection for efficient deep model training.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer.
ICML 2021. [PDF] [Github] [Project]
Retrieve: Coreset selection for efficient and robust semi-supervised learning.
Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, Rishabh Iyer.
NeurIPS 2021. [PDF] [Github] [Project]
Deep learning on a data diet: Finding important examples early in training.
Mansheej Paul, Surya Ganguli, Gintare Karolina Dziugaite.
NeurIPS 2021. [PDF]
Coresets for data-efficient training of machine learning models.
Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec.
ICML 2020. [PDF]
Selection via proxy: Efficient data selection for deep learning.
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia.
ICLR 2020. [PDF] [Github]
Coresets via bilevel optimization for continual learning and streaming.
Zalán Borsos, Mojmir Mutny, Andreas Krause.
NeurIPS 2020. [PDF] [Github]
Coresets for robust training of deep neural networks against noisy labels.
Baharan Mirzasoleiman, Kaidi Cao, Jure Leskovec.
NeurIPS 2020. [PDF] [Github]
Data shapley: Equitable valuation of data for machine learning.
Amirata Ghorbani, James Zou.
ICML 2019. [PDF] [Github]
An empirical study of example forgetting during deep neural network learning.
Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon.
ICLR 2019. [PDF] [Github]
Gradient based sample selection for online continual learning.
Rahaf Aljundi, Min Lin, Baptiste Goujaud, Yoshua Bengio.
NeurIPS 2019. [PDF] [Github]
E2-train: Training state-of-the-art cnns with over 80% energy savings.
Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, Zhangyang Wang.
NeurIPS 2019. [PDF] [Project]
Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need.
Vighnesh Birodkar, Hossein Mobahi, Samy Bengio.
Arxiv 2019. [PDF]
Selective experience replay for lifelong learning.
David Isele, Akansel Cosgun.
AAAI 2018. [PDF]
Learning what data to learn.
Yang Fan, Fei Tian, Tao Qin, Jiang Bian, Tie-Yan Liu.
Arxiv 2017. [PDF]
Training region-based object detectors with online hard example mining.
Abhinav Shrivastava, Abhinav Gupta, Ross Girshick.
CVPR 2016. [PDF]
Coresets for nonparametric estimation-the case of DP-means.
Olivier Bachem, Mario Lucic, Andreas Krause.
ICML 2015. [PDF]
Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories.
Fait Poms, Vishnu Sarukkai, Ravi Teja Mullapudi, Nimit S. Sohoni, William R. Mark, Deva Ramanan, Kayvon Fatahalian.
ICCV 2021. [PDF]
Glister: Generalization based data subset selection for efficient and robust learning.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer.
AAAI 2021. [PDF] [Github] [Project]
Deep batch active learning by diverse, uncertain gradient lower bounds.
Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, Alekh Agarwal.
ICLR 2020. [PDF]
Single shot active learning using pseudo annotators.
Yazhou Yang, Marco Loog.
Pattern Recognition 2019. [PDF]
The power of ensembles for active learning in image classification.
William H. Beluch, Tim Genewein, Andreas Nürnberger, Jan Mathias Köhler.
CVPR 2018. [PDF]
Active learning for convolutional neural networks: A core-set approach.
Ozan Sener, Silvio Savarese.
ICLR 2018. [PDF]
Meta-learning for batch mode active learning.
Sachin Ravi, Hugo Larochelle.
ICLR Workshop 2018. [PDF]
Optimization as a model for few-shot learning.
Sachin Ravi, Hugo Larochelle.
ICLR 2017. [PDF] [Github]
Learning how to active learn: A deep reinforcement learning approach.
Meng Fang, Yuan Li, Trevor Cohn.
EMNLP 2017. [PDF] [Github]
Deep active learning for image classification.
Hiranmayi Ranganathan, Hemanth Venkateswara, Shayok Chakraborty, Sethuraman Panchanathan.
ICIP 2017. [PDF]
A meta-learning approach to one-step active learning.
Gabriella Contardo, Ludovic Denoyer, Thierry Artieres.
Arxiv 2017. [PDF]
Submodularity in data subset selection and active learning.
Kai Wei, Rishabh Iyer, Jeff Bilmes.
ICML 2015. [PDF]
A convex optimization framework for active learning.
Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, S. Shankar Sasrty.
ICCV 2013. [PDF]
Active instance sampling via matrix partition.
Yuhong Guo.
NeurIPS 2010. [PDF]
Batch-mode active-learning methods for the interactive classification of remote sensing images.
Begüm Demir, Claudio Persello, Lorenzo Bruzzone.
IEEE Trans Geosci Remote Sens 2010. [PDF]
Multi-class active learning for image classification.
Ajay J. Joshi, Fatih Porikli, Nikolaos Papanikolopoulos.
CVPR 2009. [PDF]
Active learning using pre-clustering.
Hieu T. Nguyen, Arnold Smeulders.
ICML 2004. [PDF]
Dataset distillation by matching training trajectories.
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, Jun-Yan Zhu.
CVPR 2022. [PDF] [Github] [Project]
Dataset Condensation with Gradient Matching.
Bo Zhao, Konda Reddy Mopuri, Hakan Bilen.
ICLR 2021. [PDF]
Dataset distillation.
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, Alexei A. Efros.
Arxiv 2018. [PDF]
A survey on curriculum learning.
Xin Wang, Yudong Chen, Wenwu Zhu.
TPAMI 2022. [PDF]
A Survey on Active Deep Learning: From Model Driven to Data Driven.
Peng Liu, Lizhe Wang, Rajiv Ranjan, Guojin He, Lei Zhao.
ACM Comput. Surv. 2022. [PDF]
Learning from noisy labels with deep neural networks: A survey.
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee.
TNNLS 2022. [PDF] [Github]
A survey on bias in visual datasets.
Simone Fabbrizzi, Symeon Papadopoulos, Eirini Ntoutsi, Ioannis Kompatsiaris.
CVIU 2022. [PDF]
DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning.
Chengcheng Guo, Bo Zhao, Yanbing Bai.
Arxiv 2022. [PDF] [Github]
Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective.
Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee.
Arxiv 2021. [PDF]
Deep long-tailed learning: A survey.
Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, Jiashi Feng.
Arxiv 2021. [PDF] [Github]
A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations?.
Filipe R. Cordeiro, Gustavo Carneiro.
SIBGRAPI 2020. [PDF]
A review of instance selection methods.
J. Arturo Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martínez-Trinidad, Josef Kittler.
Artif Intell Rev 2010. [PDF]
If you find this repo or our paper is helpful for your research, please consider to cite:
@article{wan2023survey,
author = {Wan, Zhijing and Wang, Zhixiang and Chung, CheukTing and Wang, Zheng},
title = {A Survey of Dataset Refinement for Problems in Computer Vision Datasets},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {0360-0300},
url = {https://doi.org/10.1145/3627157},
doi = {10.1145/3627157},
journal = {ACM Comput. Surv.},
month = {oct}
}