make the default string dtype not have an NA object #82

ngoldbaum · 2023-08-03T21:36:11Z

@rgommers gave me some feedback on a draft of the stringdtype NEP that it would increase chances of success and avoid contentious discussions if we avoid defining NA semantics as much as possible in the NEP. I had never actually implemented a real NA object so taking that advice, I decided to remove stringdtype.NA.

I'd still like to offer some level of NA support, but I'd like to make it opt-in and not the default, and more carefully separate NA checking from the rest of the NumPy API.

As a first pass, this PR removes stringdtype.NA and makes the default StringDType instance not have an na_object member and uses empty string as the default fill value for np.empty.

Because we need to allow na_object=None having a meaningful behavior distinct from the default behavior, I needed to rearrange the initializer for StringDType so that pickling works, since pickling doesn't support reconstructing an object using keyword arguments.

Along the way I noticed a couple of bugs that this fixes:

Handle possible NULL entries in add and multiply ufuncs. I implemented NA-preserving semantics if an NA is detected.
Fixes the default coerce setting for dtypes created by doing e.g. astype(StringDType).
I realized the string to unicode cast wasn't using the repr of the NA object in the cast and getting that repr needs the GIL, so I had to switch the string to unicode cast to use the get loop style ufunc setup so I could populate an auxdata holding the repr.

The rest of the changes are refactorings from adapting to the above changes. Unfortunately clang-format makes the diff harder to read than it needs to be...

Next week I'm going to do a second pass so that isnan is only True if the na_object is literally np.nan and make sure that np.nan works as a missing data sentinel in the tests, and add a helper to check for null values specifically in string arrays.

ngoldbaum · 2023-08-11T20:34:28Z

I think this is getting closer now. I've fleshed out the missing data handling semantics a bit. I now explicitly handle NaN-like sentinels (pandas.NA, NaN), string sentinels, and others (e.g. None). I think with these three categories we cover all downstream usages. I also expanded the tests to cover a wider gamut of possible missing data sentinels people are likely to want to use or that I've found real-world usages of.

If there are any missing values in an array, the ufunc loops and cast loops all now have a lot more branching than they used to. It would likely be faster to refactor all the loops that branch on the kind of missing data to use the get_loop style ufunc or cast setup.

ngoldbaum added 3 commits August 3, 2023 15:28

make the default string dtype not have an NA object

e437ff8

make StringDType_richcompare do the right thing when na_object=nan

a7724a1

flesh out NULL handling

ddce4d3

ngoldbaum merged commit ebbedec into numpy:main Aug 18, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make the default string dtype not have an NA object #82

make the default string dtype not have an NA object #82

ngoldbaum commented Aug 3, 2023 •

edited

Loading

ngoldbaum commented Aug 11, 2023

make the default string dtype not have an NA object #82

make the default string dtype not have an NA object #82

Conversation

ngoldbaum commented Aug 3, 2023 • edited Loading

ngoldbaum commented Aug 11, 2023

ngoldbaum commented Aug 3, 2023 •

edited

Loading