This is a tool created as a project for the Spring 2016 GSLIS Data Cleaning course.
The products of this tool will sit somewhere between auto-documentaiton and data profiling.
This tool was presented at PyData Chicago 2016. Talk recording: https://www.youtube.com/watch?v=Hb7nvHbwNAw&t=4s
Point the tool at a folder of files and it will create a markdown file with basic statistics about each column along with template areas for you to write a narrative about each column. You can then render that into HTML or simply include it in your data package as documentation.
data_profilepy3.py
is the version updated for python 3 and contains the most up to date code. Has the same use.
Consider the python 2 version deprecated.
Still mostly a proof of concept.
Path issues for windows. This was hopefully fixed.
Unknown bugs.
This was written using Python 2.7. Maybe it would work with Python 3 if I updated the print statements. Anyhow. Run it on the command line.
General use:
python data_profilepy3.py source output missing_code
Working example that will run within this directory:
python data_profilepy3.py vagrants/ vagrant-profiles/ [missing]
This works out to:
python data_profile.py
runs the scriptvagrants/
- Provide single file path or folder with many files
- Currently only built to work with CSV data
vagrant-profiles/
- This is the destination folder for the profile files
- Will either create the folder or overwrite the named contents
- Will create:
- one JSON file with all profile data
- one md file per source file with profile data
[missing]
- this is the missing code, use
''
for empty - optional, but presumes empty string if not provided
- cannot currently specify multiple missing values for single files
- this is the missing code, use
CC-BY
Fork, whack, republish, whatever. Just cite.
Feel free to work on functions or add ons that would work with your kind of data or another format.
This is github, afterall. Feel free to put in requests or issues and I'll take them into consideration. Let me know if you'd like to collaborate on the project as well. This is my first formal tool, so there are obvious limitations, etc.
Keep in mind, however, that this tool will be meant for an average researcher who would just want to download something and run it. They wouldn't necessary want to use pip
or conda
to install. This tool is in proof of concept mode, so criticisms are expected to be substantive and move the conversation forward.
The vagrant data used as example is from:
Crymble, Adam et al.. (2015). Vagrant Lives: 14,789 Vagrants Processed by Middlesex County, 1777-1786 (version 1.1). Zenodo. 10.5281/zenodo.31026.