Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500 error on upload w/ missing col/row values #45

Closed
rebeccabilbro opened this issue Jan 18, 2016 · 8 comments
Closed

500 error on upload w/ missing col/row values #45

rebeccabilbro opened this issue Jan 18, 2016 · 8 comments

Comments

@rebeccabilbro
Copy link
Member

I'm getting an error when I attempt to upload datasets that have missing values in some of the columns/rows. Noticing this because a lot of gov't datasets use the first few rows of a table to provide metadata info.

@rebeccabilbro
Copy link
Member Author

good (terrible) example: https://www.ssa.gov/foia/html/FY08CSV.csv

@rebeccabilbro
Copy link
Member Author

another great one: http://www.planecrashinfo.com/1920/1920.htm

@rebeccabilbro
Copy link
Member Author

@bbengfort
Copy link
Member

Great examples - I definitely noticed all the error messages that came up as you were experimenting! The actual error seems to be a Unicode decoding error, which is actually potentially more serious. It leads to the question of whether or not these files are actually unicode encoded or if they have some other scheme (making things way more difficult).

@rebeccabilbro
Copy link
Member Author

Hmm, sounds like it's potentially related to my #43 then?

@bbengfort
Copy link
Member

Potentially, though encoding detection is more of an annoying task that's tough to figure out. You could use the file command in your terminal to see if your computer knows the encoding. It's definitely something I'll take a look at.

@rebeccabilbro
Copy link
Member Author

Found some more test data that might help with this issue, see: https://github.com/okfn/messytables/tree/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror

@lauralorenz
Copy link

Ok, so specifically for the files that @rebeccabilbro first linked, in terms of unicode decode errors, this problem has been solved, assumedly from the python 3.x upgrade in which the default python encoding is utf-8 instead of ascii so these utf-8 (or utf-8 subset) encoded files no longer caused unicode errors. Specifically I today tested the CSVs from https://www.ssa.gov/foia/html/FY08CSV.csv and https://catalog.data.gov/dataset/veterans-health-administration-2008-hospital-report-card-patient-satisfaction, and the HTML from http://www.planecrashinfo.com/1920/1920.htm, with both file storage and S3 backend and none of them caused an error.

How we want to deal with files in encodings that are not utf-8 encoded is a much broader question. For example a utf-16le encoded file (i.e. https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/horror/utf-16le_encoded.csv) won't work right now since utf-16le isn't a subset of utf-8, but I'm not sure yet if we really care. IMHO, it was unreasonable to expect ascii encoding, but is not unreasonable to expect utf-8/utf-8-subset encoding.

So, given all of that I am going to close this issue in terms of the scope of the initial bug. However I will make a note on the roadmap issue for consideration more generally about how we want to deal with non-utf-8-subset encodings in this project for the future. cc @rebeccabilbro @ojedatony1616 @bbengfort @looselycoupled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants