-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(python): Handle generalized ufunc edge cases better #16086
fix(python): Handle generalized ufunc edge cases better #16086
Conversation
thanks for doing this! numba+Polars currently really feels like playing with fire, so I'm happy to see this @deanm0000 I think you've looked at related code recently? fancy taking a look if this interests you? |
@MarcoGorelli I put a couple notes, nothing earth shattering from me. |
OK I addressed the two comments. |
result[i] = mean + i | ||
|
||
|
||
def test_generalized_ufunc() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really testing Expr.array_ufunc if you use map_batches. I was thinking tests like df.select(gufunc_mean(pl.col('s')).alias('result')
That said, I don't think you need to put those tests in this PR as it's only tangentially related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
@MarcoGorelli I'd like to leave this one to you. I don't understand enough of numba to have an opinion here. |
Just to expand: this isn't really about Numba, Numba is merely a convenient way to write generalized ufunc for unit testing purposes. Standard ufuncs in NumPy operate on an value by value basis. So if you have missing data on Polars side, that's fine (assuming it's valid bit pattern of data, anyway): you do Generalized ufunc (https://numpy.org/doc/stable/reference/c-api/generalized-ufuncs.html) operate on the whole array though. And now missing data becomes a problem. E.g. if you calculate a mean, you don't want to include the missing data because those are garbage values (see the example in original issue). So, you shouldn't pass anything with missing data to a generalized ufunc. In addition, generalized ufuncs can have output sizes that aren't the same as the input size (normal ufuncs operate value by value, so by definition the output is same size as input). That means the trick (Technically, given a parser for the signature mini-language, you could figure out the size in many more cases, but this at least makes the code work, i.e. takes us from "wrong results or potential memory corruption" to "correct but not maximally efficient", so it's a good first step). |
As far as I can tell, this fixes two separate issues:
The first one seems quite high-priority, whereas the second one just errors when it shouldn't - good to fix, but definitely lower-prio than silently giving wrong results Is it possible to split this into two separate PRs, one for each issue which it fixes? |
The second one can lead to memory corruption, I think, which seems pretty high priority... (Polars thinks output is N length, ufunc writes out 2N values). I can split up this PR next week, if you think it's helpful. |
does it always throw a loud error like in #14811 though? If so, I'll maintain that it's lower priority than the missing values one. It's still important, but there's a lot of open issues and PRs, gotta prioritise somehow The missing values one looks much easier to review, too, so that one can be merged a lot quicker |
Going to split this into two new PRs, so closing. |
Fixes #14811