Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly interpret imported dates #29

Open
cecilia-donnelly opened this issue Aug 11, 2015 · 1 comment
Open

Correctly interpret imported dates #29

cecilia-donnelly opened this issue Aug 11, 2015 · 1 comment
Assignees

Comments

@cecilia-donnelly
Copy link

In cc7fdb2, I added a stopgap measure for interpreting dates in files being imported. As of that commit, the code checks known date columns for any instance of / and, if one is found, assumes that the file has dates in standard Excel format of mm/dd/yyyy. If one is not found, it assumes that the dates in the file are already correctly formatted as YYYY-MM-DD. Naturally, those are not the only two options!

The correct way to do it, as I discussed with @kfogel, is to check each date to see if it's obvious what the format is. If it is, then record that. If, after checking the dates, all of them are in the same format, then change them all to YYYY-MM-DD (or leave them as they are, if they already are in that format). If, on the other hand, the dates are inconsistently formatted, then reformat each one according to our best guess. If some are still ambiguous (e.g. 01/01/2015 could be mm/dd/yyyy or dd/mm/yyyy), then assume mm/dd/yyyy, since that is the Excel standard.

@kfogel
Copy link
Member

kfogel commented Aug 11, 2015

I think that's a shorthand version of the real algorithm? Let's spell it out:

For incoming Excel files, we need to unambiguously determine the actual date represented by each date string in a date column. (Internally, we'll store dates in YYYY-MM-DD format, but that's always been the case. IOW this issue isn't about reformatting, it's just about interpreting -- we're not changing how the dates are represented in the Excel file, we're just deciding how we determine what those date strings mean.)

Loop over all the dates in the Excel file. For each date, if it is unambiguous, then obviously interpret it as that date and make a note that one more date of that format has been seen (whatever it was, YYYY/MM/DD or M/D/YYYY or D/M/YY or whatever, etc). If the date is ambiguous, then hold on to it for now -- we'll fix it later.

Once the initial loop is done, we have the following things: a bunch of dates whose meaning is known exactly (the unambiguous ones), a bunch of dates whose meaning is not yet known (the ambiguous ones), and a list of date formats we have seen (the ones signaled from the unambiguous dates) with a count of how often each format was seen.

  • If that list of seen formats associated with their counts is zero elements long, then assume M/D/YYYY, because that's the excel standard, and handle M/D/YY in the obvious way, of course. (By the way I believe Excel does not pad with leading zeros for M and D when they are a single digit, FWIW.)
  • If that list of seen formats is exactly one element long, then we know what format to use for interpreting the ambiguous cases, so use it :-).
  • If the list of seen formats is two or more elements long, then if any of those formats are M/D/YYYY (in which I also include M/D/YY, as per above), use that, because it seems to be the Excel default. If none of them are, then just pick any seen format that is not less popular than any of other seen formats, and assume that format for the ambiguous dates. This last case is highly unlikely ever to happen, but if it does, we can handle it.

@cecilia-donnelly cecilia-donnelly self-assigned this Aug 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants