Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cudf backend in dask_utils.dataframe_factory #107

Open
sjperkins opened this issue Feb 16, 2023 · 0 comments
Open

Implement cudf backend in dask_utils.dataframe_factory #107

sjperkins opened this issue Feb 16, 2023 · 0 comments
Assignees

Comments

@sjperkins
Copy link
Member

In various places throughout shade_ms/dask_utils.py we assume an underlying pandas Dataframe when creating a dask Dataframe.

def _create_dataframe(arrays, start, end, columns):
index = None if start is None else np.arange(start, end)
return pd.DataFrame({k: a.ravel() for k, a in zip(columns, arrays)},
index=index)

meta = pd.DataFrame(data={k: np.empty((0,), dtype=a.dtype)
for k, a in zip(columns, args)},
columns=columns)
# Create the actual Dataframe
return dd.DataFrame(graph, name, meta=meta, divisions=divisions)

It is now possible to specify a Dataframe backend when creating dask Dataframes https://medium.com/rapids-ai/easy-cpu-gpu-arrays-and-dataframes-run-your-dask-code-where-youd-like-e349d92351d?s=03.

>>> with dask.config.set({"dataframe.backend": "cudf"}):
…    data = {"a": range(10), "b": range(10)}
…    ddf = dd.from_dict(data, npartitions=2)
…    
>>> ddf
<dask_cudf.DataFrame | 2 tasks | 2 npartitions>

shadems should respect this option. The following code might be one way of doing so in dask_utils.py:

from importlib import import_module

backend = dask.config.get("dataframe.backend")
dataframe = backend.Dataframe(...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants