Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with nastring for non-numeric columns #31

Open
bkamins opened this issue Sep 10, 2018 · 4 comments
Open

Problem with nastring for non-numeric columns #31

bkamins opened this issue Sep 10, 2018 · 4 comments
Milestone

Comments

@bkamins
Copy link

bkamins commented Sep 10, 2018

Because the default nastring is NA there is a following problem:

  1. take a data structure that has e.g. String column with missing data in it;
  2. save it to disk using default parameters; missings get converted to NA on disk
  3. load it back and you have "NA" string where you earlier had missings

The same problem occurs with e.g. Char data.

While NA is a sensible default for numeric columns it is a bit confusing for non-numeric columns (and actually can lead to wrong results as it is fully possible to have NA string in data).

I think that it would be best to have an empty string for missings in non-numeric data.

@davidanthoff davidanthoff added this to the Backlog milestone Sep 10, 2018
@davidanthoff
Copy link
Member

I think that it would be best to have an empty string for missings in non-numeric data.

That would be for writing files, right? Do you think we need to also change something about reading?

@bkamins
Copy link
Author

bkamins commented Sep 10, 2018

Frankly - for reading I would never create a missing when reading a String but leave as is and let the user decide what to do.
It is perfectly possible that "NA" sting means something and is present if a CSV file. E.g. in Polish this is a valid word.

A second best solution would be to treat empty string as missing (although I can imagine situations where "" might mean something, e.g. it is perfectly valid to have the following vector in Julia ["", missing], but at least it is not that problematic).

However, I realize that all this is breaking so please decide what you think is best in the context of whole queryverse.

@davidanthoff
Copy link
Member

Well, now is the time to break things! I haven't released the julia 1.0 version officially, and I'm willing to break things with that transition, and then hopefully not again for a long time (until we see julia 2.0).

I think my own instinct would be to only return NA in the following situation: a column is string, and uses quotation marks throughout, and then has some rows where NA appears without quotes. For the other cases, I agree with you: if NA appears inside quotes, I think there can be no question that it should just be read as "NA", and if a column generally doesn't use quotes, then it probably also is better to return it as the "NA" string...

All of the reading logic is actually handled in TextParse.jl, so I'll have to figure out what the default there are...

@bkamins
Copy link
Author

bkamins commented Sep 10, 2018

Good point - if all is quoted and only NA is unquoted this a clear way do distinguish it. This is what write.csv in R does (although then read.csv reads back both of them as missing 😄).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants