Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make the default string dtype not have an NA object #82

Merged
merged 3 commits into from
Aug 18, 2023

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Aug 3, 2023

@rgommers gave me some feedback on a draft of the stringdtype NEP that it would increase chances of success and avoid contentious discussions if we avoid defining NA semantics as much as possible in the NEP. I had never actually implemented a real NA object so taking that advice, I decided to remove stringdtype.NA.

I'd still like to offer some level of NA support, but I'd like to make it opt-in and not the default, and more carefully separate NA checking from the rest of the NumPy API.

As a first pass, this PR removes stringdtype.NA and makes the default StringDType instance not have an na_object member and uses empty string as the default fill value for np.empty.

Because we need to allow na_object=None having a meaningful behavior distinct from the default behavior, I needed to rearrange the initializer for StringDType so that pickling works, since pickling doesn't support reconstructing an object using keyword arguments.

Along the way I noticed a couple of bugs that this fixes:

  • Handle possible NULL entries in add and multiply ufuncs. I implemented NA-preserving semantics if an NA is detected.
  • Fixes the default coerce setting for dtypes created by doing e.g. astype(StringDType).
  • I realized the string to unicode cast wasn't using the repr of the NA object in the cast and getting that repr needs the GIL, so I had to switch the string to unicode cast to use the get loop style ufunc setup so I could populate an auxdata holding the repr.

The rest of the changes are refactorings from adapting to the above changes. Unfortunately clang-format makes the diff harder to read than it needs to be...

Next week I'm going to do a second pass so that isnan is only True if the na_object is literally np.nan and make sure that np.nan works as a missing data sentinel in the tests, and add a helper to check for null values specifically in string arrays.

@ngoldbaum
Copy link
Member Author

I think this is getting closer now. I've fleshed out the missing data handling semantics a bit. I now explicitly handle NaN-like sentinels (pandas.NA, NaN), string sentinels, and others (e.g. None). I think with these three categories we cover all downstream usages. I also expanded the tests to cover a wider gamut of possible missing data sentinels people are likely to want to use or that I've found real-world usages of.

If there are any missing values in an array, the ufunc loops and cast loops all now have a lot more branching than they used to. It would likely be faster to refactor all the loops that branch on the kind of missing data to use the get_loop style ufunc or cast setup.

@ngoldbaum ngoldbaum merged commit ebbedec into numpy:main Aug 18, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant