-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial scripts to cleanup data #1
Conversation
Fine with whatever formatter & standard you're comfortable with. I just follow what the linter/formatter says. |
Overall, looks really good. I left a few comments, and will wait to do a full review and testing until you feel like it's ready. Thanks for all the work on this! |
Great suggestion on using click! I hadn't used it before and it was a little easier than argparse imo. |
@nihonjinrxs I think this is sufficient for the initial version. If you can follow the readme to get it setup and running locally, I believe they will be able to as well. |
Sorry it's taken me so long to get to this. I'll look at this first thing in the morning Thursday 4/30. |
No apology necessary! Any help is greatly appreciated any time 😄 Also I'm thinking we'll probably do some "real" QA with "real" data soon enough and more PRs will follow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, so far, this seems like it's doing the correct things, based on my testing with dummy data. I'd like to test it on real data and have VIA LINK folks validate that it produces what they expect.
I had a bit of trouble with the setup/installation since I'm using Anaconda, and apparently the uszipcode
library is not resolvable via conda
's installer. Not sure how to go about fixing that, but may be worth adding a note to the README regarding that. The workaround there looks like:
- remove the
uszipcode==0.2.4
line from therequirements.txt
file (temporarily) - create the conda environment:
conda create -f=./requirements.txt -n c4nola-211-data-wrangling
- activate the environment:
conda activate c4nola-211-data-wrangling
- install
uszipcode
viapip
:pip install uszipcode==0.2.4
- restore
requirements.txt
, either manually, or viagit checkout -- requirements.txt
The only thing that makes me a bit nervous is the lack of tests. Do you think you'd have time to add some tests? I can also add some tomorrow and this weekend if that helps.
For the Excel file, I've found that mocking the file with an in-memory Excel file works best for testing these kinds of things. An example I found that shows pretty clearly how to do that is here: https://coderbook.com/@marcus/how-to-mock-and-unit-test-with-pandas/. See the "Mocking Pandas in Unit Tests" > "Mock Files with In-Memory File Objects" section.
That said, I think we're close to ready to merge. And I'm also OK with merging this to master now, and adding tests in a separate PR if you would prefer that.
I wasn't going to add tests for this initial PR but think it's a good idea to add them so I created #3. |
With respect to Conda, I'm thinking it's best to use Python's built in virtual environment capabilities instead of introducing another tool. Then the end users only have to install 1 thing. I definitely see the value in using Conda for other projects but I don't think it would benefit this project....especially since it seems you have to still use pip to download the |
I've added 2 scripts that will clean data for 2 of the VIA LINK dashboards. It's rough, but it's a start. The main goal is to create something runnable that gets to the format they are looking for from the input files they are starting with.
Also, I used black to format the code for now so I don't have to decide on style...but I'm not committed to the style just yet so I didn't check it in as a dependency