Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPy interop to do list - to_numpy #14334

Closed
13 of 14 tasks
stinodego opened this issue Feb 7, 2024 · 9 comments
Closed
13 of 14 tasks

NumPy interop to do list - to_numpy #14334

stinodego opened this issue Feb 7, 2024 · 9 comments
Assignees
Labels
A-interop-numpy Area: interoperability with NumPy accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@stinodego
Copy link
Member

stinodego commented Feb 7, 2024

We've made some improvements to our native to_numpy functionality recently. Making an issue to track what's still left:

Now that our to_numpy can handle things properly and zero copy where possible, I'm not sure the NumPy array interface protocol (#14214) is still useful.

@s-banach
Copy link
Contributor

On this subject, could you take another look at #7283 and decide whether it's potentially useful or a definite no-go?

@TNieuwdorp
Copy link
Contributor

@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: https://numpy.org/doc/stable/user/basics.rec.html

Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)

@stinodego
Copy link
Member Author

stinodego commented Apr 4, 2024

@stinodego I was looking at the Struct type to-do's and wondering if you guys have seen that Numpy has a similar structure for it: numpy.org/doc/stable/user/basics.rec.html

Being able to cast polars struct columns to numpy structured arrays would be helpful in our current project :)

We are aware! You can already do this from DataFrames by setting structured=True. So if you want to export a struct Series, you can do s.struct.unnest().to_numpy(structured=True)

@dpinol
Copy link
Contributor

dpinol commented Apr 22, 2024

It would also be nice to be able to get np arrays with dtype np.object for

pl.DataFrame({"A": [[1,2]], "B":1}, {"A": pl.Array(pl.Int64, 2), "B": pl.Int32}).to_numpy()

instead of this exception ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 2 and the array at index 1 has size 1

in the same way as this works

pl.DataFrame({"A": "as", "B":1}).to_numpy()
Out[21]: array([['as', 1]], dtype=object)

@stinodego
Copy link
Member Author

It would also be nice to be able to get np arrays with dtype np.object for ...

Yes, that should be part of our design for converting nested data.

@stinodego
Copy link
Member Author

Regarding the design for nested types, some of my thoughts:

For converting Series to NumPy...

  • Array types become an ND array with 2 or more dimensions. This is different from how PyArrow handles them (they create a 1D object array), but it allows us to do things zero-copy and it feels correct to respect the array dimensions here. However, it's slightly surprising that converting a Series to NumPy can result in an array with more than 1 dimension (and complicates things, see below).

  • If we accept that Series can have more than 1 dimension, I think Struct types should also become an ND array with 2 (or more) dimensions. A Struct Series should be unnested and call DataFrame.to_numpy. So, for example, it may become a 2D object array if it contains an Int8 and a String field, or it may become a 2D float64 array if it contains an Int32 and a Float64 field.

  • List types do not pose a problem as they become 1D object arrays.

For converting DataFrames to NumPy...

  • Array and Struct types are problematic as they can have multiple dimensions and as such are not simply stackable. If the dimensions across columns do not match, we have to convert these to 1D object arrays before stacking them. However, if the dimensions do match, it could be appropriate to create a 3D+ array. For example, if I have a DataFrame with two Array columns with shape (2, 5) with a numeric data type, we could create a 3D array with shape (2, 2, 5). But maybe this is getting too complicated and we should restrict DataFrames to produce 2D ndarrays?

Basically, I'm trying to figure out if it's worth going through the rabbithole of multidimensional arrays, or whether maybe we should keep it simple and have Series be 1D and DataFrames be 2D. That possibly involves changing the behavior for Array types.

@stinodego
Copy link
Member Author

stinodego commented May 22, 2024

Regarding nested types, I have decided that for now it will work as follows:

  • Series Arrays will be multidimensional
  • Series Structs will be 2 dimensional
  • DataFrames will always be 2 dimensional. Nested Array/Struct series are cast to 1D object arrays.

Everything on the TODO list here has been done, with the exception of masked array support. I will create a separate issue for that one.

@mcrumiller
Copy link
Contributor

@stinodego an approach that makes a lot of sense to me would be to maintain a 1-D array for all Series, use a multi-element dtype. Example:

import numpy as np

a = np.array([65535, 256], dtype=np.uint16)

# construct dtype with two u8 elements
dtype = np.dtype([
    ("first", np.uint8),
    ("second", np.uint8),
])

b = a.view(dtype)
# array([(255, 255), (0, 1)], dtype=[('first', 'u1'), ('second', 'u1')])

In this case, b is 1-D, with each element a tuple of length two.

This could also work for structs with mixed types:

dtype = np.dtype([
    ("name", "U1"),
    ("value", "uint8"),
])
a = np.array([("Ritchie", 100), ("Stijn", 100)], dtype=dtype)
a.shape
# (2,)

@stinodego
Copy link
Member Author

If that is the behavior you want, you can use DataFrame.to_numpy(structured=True).

This type of array is not fit for representing Array types though. Makes sense for Structs. But I don't want it to be the default, e.g. we still need a solution for when structured=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interop-numpy Area: interoperability with NumPy accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Archived in project
Development

No branches or pull requests

6 participants