Skip to content

A Survey of Dataset Refinement for Problems in Computer Vision Datasets

Notifications You must be signed in to change notification settings

ZhijingWan/DatasetRefinement-CV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 

Repository files navigation

DatasetRefinement-CV


arXiv PDF Project Page


This repo is a collection of resources on dataset refinement for computer vision (CV), as a supplement for our survey. If you find any work missing or have any suggestions (papers, implementations and other resources), feel free to pull requests.

citation
@article{wan2023survey,
author = {Wan, Zhijing and Wang, Zhixiang and Chung, CheukTing and Wang, Zheng},
title = {A Survey of Dataset Refinement for Problems in Computer Vision Datasets},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {0360-0300},
url = {https://doi.org/10.1145/3627157},
doi = {10.1145/3627157},
journal = {ACM Comput. Surv.},
month = {oct}
}
Table of Contents

News 🎉

[2023/09/26] Our survey is accepted to ACM Computing Surveys! 😆

[2022/10/14] We have submitted our Dataset-Refinement-for-Computer-Vision survey on arXiv: A Survey of Dataset Refinement for Problems in Computer Vision Datasets. 😆 We will continue to polish this work. 💪

Dataset Refinement for Robust Learning

On the Class-imbalanced Dataset

Data Sampling

FASA: Feature Augmentation and Sampling Adaptation for Long-Tailed Instance Segmentation.
Yuhang Zang, Chen Huang, Chen Change Loy.
ICCV 2021. [PDF] [Github] [Project]

Videolt: Large-scale long-tailed video recognition.
Xing Zhang, Zuxuan Wu, Zejia Weng, Huazhu Fu, Jingjing Chen, Yu-Gang Jiang, Larry Davis.
ICCV 2021. [PDF] [Github] [Project]

Influence-Balanced Loss for Imbalanced Visual Classification.
Seulki Park, Jongin Lim, Younghan Jeon, Jin Young Choi.
ICCV 2021. [PDF] [Github]

Distribution Alignment: A Unified Framework for Long-tail Visual Recognition.
Songyang Zhang, Zeming Li, Shipeng Yan, Xuming He, Jian Sun.
CVPR 2021. [PDF] [Github]

The devil is in classification: A simple framework for long-tail instance segmentation.
Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew, Sheng Tang, Steven Hoi, Jiashi Feng.
ECCV 2020. [PDF] [Github]

Balanced Meta-Softmax for Long-Tailed Visual Recognition.
Jiawei Ren, Cunjun Yu, Shunan Sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, Hongsheng Li.
NeurIPS 2020. [PDF] [Github]

Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting.
Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, Deyu Meng.
NeurIPS 2019. [PDF] [Github]

Focal loss for dense object detection.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar.
ICCV 2017. [PDF]

C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling.
Chris Drummond, Robert C. Holte.
Workshop on learning from imbalanced datasets II 2003. [PDF]

Subset Selection

SlimML: Removing non-critical input data in large-scale iterative machine learning.
Rui Han, Chi Harold Liu, Shilin Li, Lydia Y. Chen, Guoren Wang, Jian Tang, Jieping Ye.
TKDE 2019. [PDF]

On the Dataset with Noisy Labels

Data Sampling

A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty.
Sihao Yu, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Zizhen Wang, Xueqi Cheng.
CVPR2022. [PDF]

Dualgraph: A graph-based method for reasoning about label noise.
HaiYang Zhang, XiMing Xing, Liang Liu.
CVPR 2021. [PDF]

Learning to Reweight Examples for Robust Deep Learning.
Mengye Ren, Wenyuan Zeng, Bin Yang, Raquel Urtasun.
ICML 2018. [PDF]

CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images.
Sheng Guo, Weilin Huang, Haozhi Zhang, Chenfan Zhuang, Dengke Dong, Matthew R. Scott, Dinglong Huang.
ECCV 2018. [PDF] [Github]

Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples.
Haw-Shiuan Chang, Erik Learned-Miller, Andrew McCallum.
NeurIPS 2017. [PDF]

Multiclass Learning With Partially Corrupted Labels.
Ruxin Wang, Tongliang Liu, Dacheng Tao.
TNNLS 2017. [PDF]

Self-Paced Learning for Latent Variable Models.
M. Kumar, Benjamin Packer, Daphne Koller.
NeurIPS 2010. [PDF]

Subset Selection

Mutual Quantization for Cross-Modal Search with Noisy Labels.
Erkun Yang, Dongren Yao, Tongliang Liu, Cheng Deng.
CVPR22. [PDF]

Unicon: Combating label noise through uniform selection and contrastive learning.
Nazmul Karim, Mamshad Nayeem Rizve, Nazanin Rahnavard, Ajmal Mian, Mubarak Shah.
CVPR2022. [PDF] [Github]

Is your data relevant?: Dynamic selection of relevant data for federated learning.
Lokesh Nagalapatti, Ruhi Sharma Mittal, Ramasuri Narayanam.
AAAI 2022. [PDF]

NGC: a unified framework for learning with open-world noisy data.
Zhi-Fan Wu, Tong Wei, Jianwen Jiang, Chaojie Mao, Mingqian Tang, Yu-Feng Li.
ICCV 2021. [PDF]

Towards Understanding Deep Learning from Noisy Labels with Small-Loss Criterion.
Xian-Jin Gui, Wei Wang, Zhang-Hao Tian.
IJCAI 2021. [PDF]

Robust learning by self-transition for handling noisy labels.
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee.
SIGKDD 2021. [PDF] [Github]

Confident Learning: Estimating Uncertainty in Dataset Labels.
Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang.
Journal of Artificial Intelligence Research 2021. [PDF] [Github] [Project] [Blog]

Crssc: salvage reusable samples from noisy data for robust learning.
Zeren Sun, Xian-Sheng Hua, Yazhou Yao, Xiu-Shen Wei, Guosheng Hu, Jian Zhang.
ACM MM 2020. [PDF]

Less is better: Unweighted data subsampling via influence function.
Zifeng Wang, Hong Zhu, Zhenhua Dong, Xiuqiang He, Shao-Lun Huang.
AAAI 2020. [PDF] [Github]

Curriculum Loss: Robust Learning and Generalization against Label Corruption.
Yueming Lyu, Ivor W. Tsang.
ICLR 2020. [PDF]

A topological filter for learning with label noise.
Pengxiang Wu, Songzhu Zheng, Mayank Goswami, Dimitris N. Metaxas, Chao Chen.
NeurIPS 2020. [PDF] [Github]

Identifying mislabeled data using the area under the margin ranking.
Geoff Pleiss, Tianyi Zhang, Ethan Elenberg, Kilian Q. Weinberger.
NeurIPS 2020. [PDF] [Github] [Project]

O2u-net: A simple noisy label detection approach for deep neural networks.
Jinchi Huang, Lie Qu, Rongfei Jia, Binqiang Zhao.
ICCV 2019. [PDF]

Understanding and utilizing deep neural networks trained with noisy labels.
Pengfei Chen, Benben Liao, Guangyong Chen, Shengyu Zhang.
ICML 2019. [PDF] [Github]

Learning with bad training data via iterative trimmed loss minimization.
Yanyao Shen, Sujay Sanghavi.
ICML 2019. [PDF]

How does disagreement help generalization against label corruption?
Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor W. Tsang, Masashi Sugiyama.
ICML 2019. [PDF]

SELFIE: Refurbishing Unclean Samples for Robust Deep Learning.
Hwanjun Song, Minseok Kim, Jae-Gil Lee.
ICML 2019. [PDF] [Github]

Self: Learning to filter noisy labels with self-ensembling
Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, Thomas Brox.
Arxiv 2019. [PDF]

Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels.
Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, Li Fei-Fei.
ICML 2018. [PDF]

Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels.
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, Masashi Sugiyama.
NeurIPS 2018. [PDF]

Decoupling "when to update" from "how to update".
Eran Malach, Shai Shalev-Shwartz.
NeurIPS 2017. [PDF] [Github]

Learning with confident examples: Rank pruning for robust classification with noisy labels.
Curtis G. Northcutt, Tailin Wu, Isaac L. Chuang.
Arxiv 2017. [PDF] [Github]

A domain robust approach for image dataset construction.
Yazhou Yao, Xian-sheng Hua, Fumin Shen, Jian Zhang, Zhenmin Tang.
ACM MM 2016. [PDF]

🔝

Dataset Refinement for Fair Learning

Subset Selection

Adversarial filters of dataset biases.
Ronan Le Bras, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal, Yejin Choi.
ICML 2020. [PDF] [Github]

Repair: Removing representation bias by dataset resampling.
Yi Li, Nuno Vasconcelos.
CVPR 2019. [PDF] [Github]

Resound: Towards action recognition without representation bias.
Yingwei Li, Yi Li, Nuno Vasconcelos.
ECCV 2018. [PDF]

🔝

Dataset Refinement for Data-efficient Learning

Data Sampling

Variance reduced training with stratified sampling for forecasting models.
Yucheng Lu, Youngsuk Park, Lifan Chen, Yuyang Wang, Christopher De Sa, Dean Foster.
ICML 2021. [PDF]

Curriculum learning by dynamic instance hardness.
Tianyi Zhou, Shengjie Wang, Jeffrey Bilmes.
NeurIPS 2020. [PDF] [Github]

Ordered sgd: A new stochastic optimization framework for empirical risk minimization.
Kenji Kawaguchi, Haihao Lu.
AISTATS 2020. [PDF]

Accelerating Minibatch Stochastic Gradient Descent Using Typicality Sampling.
Xinyu Peng, Li Li, Fei-Yue Wang.
TNNLS 2019. [PDF]

Accelerating deep learning by focusing on the biggest losers.
Angela H. Jiang, Daniel Lin-Kit Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai.
Arxiv 2019. [PDF]

Variance reduction in sgd by distributed importance sampling.
Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, Yoshua Bengio.
Arxiv 2015. [PDF]

Accelerating minibatch stochastic gradient descent using stratified sampling.
Peilin Zhao, Tong Zhang.
Arxiv 2014. [PDF]

Subset Selection

Adaptive second order coresets for data-efficient machine learning.
Omead Pooladzandi, David Davini, Baharan Mirzasoleiman.
ICML 2022. [PDF] [Github]

Online coreset selection for rehearsal-based continual learning.
Jaehong Yoon, Divyam Madaan, Eunho Yang, Sung Ju Hwang.
ICLR 2022. [PDF]

AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning.
Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer.
Arxiv 2022. [PDF] [Github]

Glister: Generalization based data subset selection for efficient and robust learning.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer.
AAAI 2021. [PDF] [Github] [Project]

Grad-match: Gradient matching based data subset selection for efficient deep model training.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Abir De, Rishabh Iyer.
ICML 2021. [PDF] [Github] [Project]

Retrieve: Coreset selection for efficient and robust semi-supervised learning.
Krishnateja Killamsetty, Xujiang Zhao, Feng Chen, Rishabh Iyer.
NeurIPS 2021. [PDF] [Github] [Project]

Deep learning on a data diet: Finding important examples early in training.
Mansheej Paul, Surya Ganguli, Gintare Karolina Dziugaite.
NeurIPS 2021. [PDF]

Coresets for data-efficient training of machine learning models.
Baharan Mirzasoleiman, Jeff Bilmes, Jure Leskovec.
ICML 2020. [PDF]

Selection via proxy: Efficient data selection for deep learning.
Cody Coleman, Christopher Yeh, Stephen Mussmann, Baharan Mirzasoleiman, Peter Bailis, Percy Liang, Jure Leskovec, Matei Zaharia.
ICLR 2020. [PDF] [Github]

Coresets via bilevel optimization for continual learning and streaming.
Zalán Borsos, Mojmir Mutny, Andreas Krause.
NeurIPS 2020. [PDF] [Github]

Coresets for robust training of deep neural networks against noisy labels.
Baharan Mirzasoleiman, Kaidi Cao, Jure Leskovec.
NeurIPS 2020. [PDF] [Github]

Data shapley: Equitable valuation of data for machine learning.
Amirata Ghorbani, James Zou.
ICML 2019. [PDF] [Github]

An empirical study of example forgetting during deep neural network learning.
Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon.
ICLR 2019. [PDF] [Github]

Gradient based sample selection for online continual learning.
Rahaf Aljundi, Min Lin, Baptiste Goujaud, Yoshua Bengio.
NeurIPS 2019. [PDF] [Github]

E2-train: Training state-of-the-art cnns with over 80% energy savings.
Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, Zhangyang Wang.
NeurIPS 2019. [PDF] [Project]

Semantic Redundancies in Image-Classification Datasets: The 10% You Don't Need.
Vighnesh Birodkar, Hossein Mobahi, Samy Bengio.
Arxiv 2019. [PDF]

Selective experience replay for lifelong learning.
David Isele, Akansel Cosgun.
AAAI 2018. [PDF]

Learning what data to learn.
Yang Fan, Fei Tian, Tao Qin, Jiang Bian, Tie-Yan Liu.
Arxiv 2017. [PDF]

Training region-based object detectors with online hard example mining.
Abhinav Shrivastava, Abhinav Gupta, Ross Girshick.
CVPR 2016. [PDF]

Coresets for nonparametric estimation-the case of DP-means.
Olivier Bachem, Mario Lucic, Andreas Krause.
ICML 2015. [PDF]

🔝

Dataset Refinement for Label-efficient Learning

Active Learning

Low-Shot Validation: Active Importance Sampling for Estimating Classifier Performance on Rare Categories.
Fait Poms, Vishnu Sarukkai, Ravi Teja Mullapudi, Nimit S. Sohoni, William R. Mark, Deva Ramanan, Kayvon Fatahalian.
ICCV 2021. [PDF]

Glister: Generalization based data subset selection for efficient and robust learning.
Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer.
AAAI 2021. [PDF] [Github] [Project]

Deep batch active learning by diverse, uncertain gradient lower bounds.
Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, Alekh Agarwal.
ICLR 2020. [PDF]

Single shot active learning using pseudo annotators.
Yazhou Yang, Marco Loog.
Pattern Recognition 2019. [PDF]

The power of ensembles for active learning in image classification.
William H. Beluch, Tim Genewein, Andreas Nürnberger, Jan Mathias Köhler.
CVPR 2018. [PDF]

Active learning for convolutional neural networks: A core-set approach.
Ozan Sener, Silvio Savarese.
ICLR 2018. [PDF]

Meta-learning for batch mode active learning.
Sachin Ravi, Hugo Larochelle.
ICLR Workshop 2018. [PDF]

Optimization as a model for few-shot learning.
Sachin Ravi, Hugo Larochelle.
ICLR 2017. [PDF] [Github]

Learning how to active learn: A deep reinforcement learning approach.
Meng Fang, Yuan Li, Trevor Cohn.
EMNLP 2017. [PDF] [Github]

Deep active learning for image classification.
Hiranmayi Ranganathan, Hemanth Venkateswara, Shayok Chakraborty, Sethuraman Panchanathan.
ICIP 2017. [PDF]

A meta-learning approach to one-step active learning.
Gabriella Contardo, Ludovic Denoyer, Thierry Artieres.
Arxiv 2017. [PDF]

Submodularity in data subset selection and active learning.
Kai Wei, Rishabh Iyer, Jeff Bilmes.
ICML 2015. [PDF]

A convex optimization framework for active learning.
Ehsan Elhamifar, Guillermo Sapiro, Allen Yang, S. Shankar Sasrty.
ICCV 2013. [PDF]

Active instance sampling via matrix partition.
Yuhong Guo.
NeurIPS 2010. [PDF]

Batch-mode active-learning methods for the interactive classification of remote sensing images.
Begüm Demir, Claudio Persello, Lorenzo Bruzzone.
IEEE Trans Geosci Remote Sens 2010. [PDF]

Multi-class active learning for image classification.
Ajay J. Joshi, Fatih Porikli, Nikolaos Papanikolopoulos.
CVPR 2009. [PDF]

Active learning using pre-clustering.
Hieu T. Nguyen, Arnold Smeulders.
ICML 2004. [PDF]

🔝

Related Research

Dataset distillation/condensation

Dataset distillation by matching training trajectories.
George Cazenavette, Tongzhou Wang, Antonio Torralba, Alexei A. Efros, Jun-Yan Zhu.
CVPR 2022. [PDF] [Github] [Project]

Dataset Condensation with Gradient Matching.
Bo Zhao, Konda Reddy Mopuri, Hakan Bilen.
ICLR 2021. [PDF]

Dataset distillation.
Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba, Alexei A. Efros.
Arxiv 2018. [PDF]

Other Surveys

A survey on curriculum learning.
Xin Wang, Yudong Chen, Wenwu Zhu.
TPAMI 2022. [PDF]

A Survey on Active Deep Learning: From Model Driven to Data Driven.
Peng Liu, Lizhe Wang, Rajiv Ranjan, Guojin He, Lei Zhao.
ACM Comput. Surv. 2022. [PDF]

Learning from noisy labels with deep neural networks: A survey.
Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, Jae-Gil Lee.
TNNLS 2022. [PDF] [Github]

A survey on bias in visual datasets.
Simone Fabbrizzi, Symeon Papadopoulos, Eirini Ntoutsi, Ioannis Kompatsiaris.
CVIU 2022. [PDF]

DeepCore: A Comprehensive Library for Coreset Selection in Deep Learning.
Chengcheng Guo, Bo Zhao, Yanbing Bai.
Arxiv 2022. [PDF] [Github]

Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective.
Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee.
Arxiv 2021. [PDF]

Deep long-tailed learning: A survey.
Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, Jiashi Feng.
Arxiv 2021. [PDF] [Github]

A Survey on Deep Learning with Noisy Labels: How to train your model when you cannot trust on the annotations?.
Filipe R. Cordeiro, Gustavo Carneiro.
SIBGRAPI 2020. [PDF]

A review of instance selection methods.
J. Arturo Olvera-López, J. Ariel Carrasco-Ochoa, J. Francisco Martínez-Trinidad, Josef Kittler.
Artif Intell Rev 2010. [PDF]

If you find this repo or our paper is helpful for your research, please consider to cite:

@article{wan2023survey,
author = {Wan, Zhijing and Wang, Zhixiang and Chung, CheukTing and Wang, Zheng},
title = {A Survey of Dataset Refinement for Problems in Computer Vision Datasets},
year = {2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {0360-0300},
url = {https://doi.org/10.1145/3627157},
doi = {10.1145/3627157},
journal = {ACM Comput. Surv.},
month = {oct}
}

🔝

About

A Survey of Dataset Refinement for Problems in Computer Vision Datasets

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published