Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial scripts to cleanup data #1

Merged
merged 11 commits into from
May 9, 2020
Merged

Add initial scripts to cleanup data #1

merged 11 commits into from
May 9, 2020

Conversation

mrcnc
Copy link
Member

@mrcnc mrcnc commented Apr 15, 2020

I've added 2 scripts that will clean data for 2 of the VIA LINK dashboards. It's rough, but it's a start. The main goal is to create something runnable that gets to the format they are looking for from the input files they are starting with.

Also, I used black to format the code for now so I don't have to decide on style...but I'm not committed to the style just yet so I didn't check it in as a dependency

cleanup_all_covid_calls.py Outdated Show resolved Hide resolved
@nihonjinrxs
Copy link
Member

Fine with whatever formatter & standard you're comfortable with. I just follow what the linter/formatter says.

@nihonjinrxs
Copy link
Member

Overall, looks really good. I left a few comments, and will wait to do a full review and testing until you feel like it's ready. Thanks for all the work on this!

@mrcnc
Copy link
Member Author

mrcnc commented Apr 18, 2020

Great suggestion on using click! I hadn't used it before and it was a little easier than argparse imo.

@mrcnc mrcnc marked this pull request as ready for review April 18, 2020 18:42
utils.py Outdated Show resolved Hide resolved
@mrcnc
Copy link
Member Author

mrcnc commented Apr 25, 2020

@nihonjinrxs I think this is sufficient for the initial version. If you can follow the readme to get it setup and running locally, I believe they will be able to as well.

@nihonjinrxs
Copy link
Member

Sorry it's taken me so long to get to this. I'll look at this first thing in the morning Thursday 4/30.

@mrcnc
Copy link
Member Author

mrcnc commented May 1, 2020

No apology necessary! Any help is greatly appreciated any time 😄

Also I'm thinking we'll probably do some "real" QA with "real" data soon enough and more PRs will follow.

Copy link
Member

@nihonjinrxs nihonjinrxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, so far, this seems like it's doing the correct things, based on my testing with dummy data. I'd like to test it on real data and have VIA LINK folks validate that it produces what they expect.

I had a bit of trouble with the setup/installation since I'm using Anaconda, and apparently the uszipcode library is not resolvable via conda's installer. Not sure how to go about fixing that, but may be worth adding a note to the README regarding that. The workaround there looks like:

  • remove the uszipcode==0.2.4 line from the requirements.txt file (temporarily)
  • create the conda environment: conda create -f=./requirements.txt -n c4nola-211-data-wrangling
  • activate the environment: conda activate c4nola-211-data-wrangling
  • install uszipcode via pip: pip install uszipcode==0.2.4
  • restore requirements.txt, either manually, or via git checkout -- requirements.txt

The only thing that makes me a bit nervous is the lack of tests. Do you think you'd have time to add some tests? I can also add some tomorrow and this weekend if that helps.

For the Excel file, I've found that mocking the file with an in-memory Excel file works best for testing these kinds of things. An example I found that shows pretty clearly how to do that is here: https://coderbook.com/@marcus/how-to-mock-and-unit-test-with-pandas/. See the "Mocking Pandas in Unit Tests" > "Mock Files with In-Memory File Objects" section.

That said, I think we're close to ready to merge. And I'm also OK with merging this to master now, and adding tests in a separate PR if you would prefer that.

@mrcnc mrcnc mentioned this pull request May 2, 2020
@mrcnc
Copy link
Member Author

mrcnc commented May 2, 2020

I wasn't going to add tests for this initial PR but think it's a good idea to add them so I created #3.

@mrcnc
Copy link
Member Author

mrcnc commented May 2, 2020

With respect to Conda, I'm thinking it's best to use Python's built in virtual environment capabilities instead of introducing another tool. Then the end users only have to install 1 thing. I definitely see the value in using Conda for other projects but I don't think it would benefit this project....especially since it seems you have to still use pip to download the uszipcode dependency 😉

cleanup_all_covid_calls.py Outdated Show resolved Hide resolved
cleanup_all_covid_calls.py Outdated Show resolved Hide resolved
@mrcnc mrcnc merged commit 7dc3c60 into master May 9, 2020
@mrcnc mrcnc deleted the develop branch May 9, 2020 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants