Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Best Practices Guidelines for Data Cleaning #7

Open
reblake opened this issue Jun 5, 2018 · 1 comment
Open

Develop Best Practices Guidelines for Data Cleaning #7

reblake opened this issue Jun 5, 2018 · 1 comment

Comments

@reblake
Copy link

reblake commented Jun 5, 2018

Using existing data for re-analysis or new synthesis research is challenging because of the large amounts of time and effort needed to clean the data. Having best practices guidelines would streamline this process for data users and reduce the time spent cleaning data before analysis can proceed. These guidelines would ideally include a script that addresses the most common data cleaning tasks/problems/issues, as well as a document (or maybe another script?) that gives more details of specific problems (ex: scraping data from the web, using NETCDF files, extracting data from Excel workbook). This would all be done in R.

Common issues include:
Capitalization, misspelling, white space, dots, missing values not represented by NA, abbreviated text, different names for the same sites/species, metadata in the data table, etc.

@jhp7e
Copy link

jhp7e commented Jun 11, 2018

https://academic.oup.com/bioscience/article/63/7/574/289294 may be helpful in identifying issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants