Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Enhancing polars support by introducing set_output #399

Open
wants to merge 111 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
539fd51
create test_polars.py file
julian-fong May 26, 2024
4712da9
updates
julian-fong May 28, 2024
fe5333b
initial commit
julian-fong May 30, 2024
5f578fd
added polars eager table to allowed mtypes in base regressor
julian-fong May 30, 2024
cf8a0d5
added draft version of testing fit and predict in polars dataframe
julian-fong May 30, 2024
9357486
fixed to use skpro check soft dependencies
julian-fong May 30, 2024
1a23ee0
updated tests
julian-fong Jun 2, 2024
89079f6
added test for predict_quantiles
julian-fong Jun 2, 2024
02f699f
fixed naming of pandas datafarmes
julian-fong Jun 2, 2024
c49ed0e
Merge branch 'sktime:main' into polars_support
julian-fong Jun 2, 2024
be084ef
added test for check_polars_table
julian-fong Jun 3, 2024
5c3697e
updates to pr
julian-fong Jun 7, 2024
32e700a
updated estimator to be a pytest fixture for one estimator
julian-fong Jun 10, 2024
0470817
Merge branch 'sktime:main' into polars_support
julian-fong Jun 11, 2024
497e1ef
bug fix
julian-fong Jun 11, 2024
8d3b541
update
julian-fong Jun 11, 2024
782e714
update
julian-fong Jun 11, 2024
39590f7
updates
julian-fong Jun 11, 2024
20643c5
updates
julian-fong Jun 11, 2024
05e96bf
updates
julian-fong Jun 11, 2024
ad697a3
updates
julian-fong Jun 11, 2024
00ac2bf
updates
julian-fong Jun 11, 2024
78d5d46
Merge branch 'sktime:main' into polars_support
julian-fong Jun 13, 2024
f464b7f
Merge branch 'sktime:main' into polars_support
julian-fong Jun 13, 2024
5eba103
Merge branch 'sktime:main' into polars_support
julian-fong Jun 14, 2024
227d623
updates to remove unnecessary skipifs and changed the estimator used …
julian-fong Jun 14, 2024
0b51616
Merge branch 'sktime:main' into polars_support
julian-fong Jun 21, 2024
cf227f4
write several functions
julian-fong Jun 20, 2024
895044e
added _config and a new method set_output
julian-fong Jun 20, 2024
ec36ded
added check_transform_config
julian-fong Jun 20, 2024
8b0beee
added check_transform_config
julian-fong Jun 20, 2024
109083d
added tests for set_output
julian-fong Jun 21, 2024
c8dc542
added tests
julian-fong Jun 21, 2024
2866a71
updated tests and set_output file
julian-fong Jun 21, 2024
2da52b9
fixed bug for tests with no polars dependency
julian-fong Jun 21, 2024
ca0a812
split tests into seperate test cases for polars and pandas
julian-fong Jun 21, 2024
a368f8d
Merge branch 'sktime:main' into polars_regression
julian-fong Jun 22, 2024
69deca1
added bool to check_transform_config
julian-fong Jun 25, 2024
605f65c
updated tests
julian-fong Jun 25, 2024
b9b72ce
updated tests
julian-fong Jun 25, 2024
f663379
Merge branch 'sktime:main' into polars_regression
julian-fong Jun 25, 2024
b1ddd50
Merge branch 'polars_regression' of https://github.com/julian-fong/sk…
julian-fong Jun 25, 2024
5fd28ac
fixed bool check in check_transform_config
julian-fong Jun 25, 2024
40d5728
inital commit for 2 new files, one temp, and other updates
julian-fong Jun 26, 2024
cde7ed0
Merge branch 'main' into pr/399
fkiraly Jun 26, 2024
1ae94bc
Merge branch 'polars_regression' of https://github.com/julian-fong/sk…
julian-fong Jun 26, 2024
e1dedfe
updates
julian-fong Jun 26, 2024
baa8d8d
changed None to 'default' and index_pandas to axis
julian-fong Jun 27, 2024
78e3041
updates
julian-fong Jul 1, 2024
8e94c93
created test cases for create_container
julian-fong Jul 1, 2024
6c17d12
created test cases for create_container
julian-fong Jul 1, 2024
2451415
updates
julian-fong Jul 2, 2024
25eccf9
updates to create_container and added output checks to _base
julian-fong Jul 2, 2024
587fa0c
disabling tests in create container as refactor needed
julian-fong Jul 2, 2024
b335de6
updated test container tests
julian-fong Jul 2, 2024
07b4a34
updated test container tests
julian-fong Jul 2, 2024
2d49cee
rounded values so they match for outputs
julian-fong Jul 2, 2024
6c34ac3
Merge branch 'sktime:main' into polars_regression
julian-fong Jul 16, 2024
bf05883
added tests for predict function and included output checks inside pr…
julian-fong Jul 17, 2024
5670bb8
Merge branch 'sktime:main' into polars_regression
julian-fong Jul 23, 2024
92a325f
Merge branch 'sktime:main' into polars_regression
julian-fong Jul 27, 2024
2cbb933
initial commit
julian-fong Aug 13, 2024
c94b612
intial commit
julian-fong Aug 14, 2024
17c3126
updated _convert
julian-fong Aug 14, 2024
fc46414
updated to from_pandas
julian-fong Aug 14, 2024
ae9ee7b
removed duplicative code
julian-fong Aug 14, 2024
98ea699
fixed naming convention for indices to use __index__{col_name}
julian-fong Aug 14, 2024
446e180
fixed name to only include original index name in returned dataframe
julian-fong Aug 14, 2024
c220f37
refactored current polars tests and fixed code
julian-fong Aug 15, 2024
e8cacd6
refactored lazy frames to use .collect_schema().names() to fix warning
julian-fong Aug 15, 2024
29d3f40
added conversion util for polars examples and removed commented code
julian-fong Aug 15, 2024
16e8f51
refactored check_polars_frame to ignore __index__ columns and edited …
julian-fong Aug 16, 2024
8f97233
bug fix
julian-fong Aug 16, 2024
3bf810c
updated n_features calculation
julian-fong Aug 16, 2024
2d4d2d1
added code to not include __index__ if df.index is trivial
julian-fong Aug 16, 2024
6498824
removed line
julian-fong Aug 16, 2024
2ae0cd2
updates
julian-fong Aug 18, 2024
0700f15
Merge branch 'polars_adapter_enhancements' into polars_regression
julian-fong Aug 18, 2024
d5cd1fd
Merge branch 'sktime:main' into polars_regression
julian-fong Aug 18, 2024
9ac9e86
Merge branch 'polars_regression' of https://github.com/julian-fong/sk…
julian-fong Aug 18, 2024
fa4afa7
removed create_container tests and updated set_output tests
julian-fong Aug 18, 2024
346997a
updates
julian-fong Aug 18, 2024
bb154ef
Merge branch 'sktime:main' into polars_regression
julian-fong Aug 18, 2024
caa8874
added function to convert multi-index columns to single and vice versa
julian-fong Aug 18, 2024
a427c08
updates
julian-fong Aug 18, 2024
d634a10
updates
julian-fong Aug 18, 2024
669200e
Merge branch 'polars_regression' of https://github.com/julian-fong/sk…
julian-fong Aug 18, 2024
a2ff2da
updates
julian-fong Aug 18, 2024
6f3ecf8
fixed bugs
julian-fong Aug 18, 2024
a76bbdb
fixed bugs
julian-fong Aug 18, 2024
99f898c
updated function to try and convert float strings to floats
julian-fong Aug 18, 2024
f8f7044
updates
julian-fong Aug 18, 2024
5f8849d
added temp check for predict_interval and predict_quantiles
julian-fong Aug 19, 2024
84808fd
added temp check for predict_var
julian-fong Aug 19, 2024
2c17cab
temporarily disabled tests in test_set_output
julian-fong Aug 19, 2024
b3d2243
temporarily disabled tests in test_set_output
julian-fong Aug 19, 2024
133434b
temporarily disabled tests in test_set_output
julian-fong Aug 19, 2024
70621fb
enabling tests
julian-fong Aug 19, 2024
ddbecdf
enabling tests
julian-fong Aug 19, 2024
d986683
enabling tests
julian-fong Aug 19, 2024
032b2f2
Merge branch 'sktime:main' into polars_regression
julian-fong Sep 7, 2024
a9cca9e
Merge branch 'main' into pr/399
fkiraly Sep 8, 2024
7870580
Merge branch 'sktime:main' into polars_regression
julian-fong Sep 10, 2024
4b28aaa
changed from transform to transform_output
julian-fong Sep 10, 2024
45ad110
bug fix
julian-fong Sep 10, 2024
f00a76b
updates
julian-fong Sep 10, 2024
59d747e
made utils private and decoupled _transform_output from _convert func…
julian-fong Sep 10, 2024
8a2e14d
added test support for predict_var
julian-fong Sep 10, 2024
bfb1ef6
removed support for predict_quantiles and predict_interval as they ar…
julian-fong Sep 10, 2024
572938a
commented out pred_var
julian-fong Sep 10, 2024
f985047
updates
julian-fong Sep 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions skpro/datatypes/_adapter/polars.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# copyright: sktime developers, BSD-3-Clause License (see LICENSE file)
"""Common utilities for polars based data containers."""
import numpy as np
import pandas as pd

from skpro.datatypes._common import _req
from skpro.datatypes._common import _ret as ret

Expand Down Expand Up @@ -93,6 +96,12 @@ def convert_polars_to_pandas_with_index(obj):
pd_df = pd_df.set_index(col, drop=True)
pd_df.index.name = col.split("__index__")[1]

# check to see if we need to convert single melted polars columns
# back to multi_index
if all([True if col.startswith("__") else False for col in pd_df.columns]):
multi_index_columns = transform_single_column_to_multiindex_columns(pd_df)
pd_df.columns = pd.MultiIndex.from_arrays(multi_index_columns)

return pd_df


Expand Down Expand Up @@ -140,6 +149,11 @@ def convert_pandas_to_polars_with_index(
else:
obj = obj.rename(columns={"index": "__index__"})

n_column_levels = check_n_level_of_dataframe(obj)

if n_column_levels > 1:
obj.columns = transform_pandas_multiindex_columns_to_single_column(obj)

pl_df = from_pandas(
data=obj,
schema_overrides=schema_overrides,
Expand All @@ -151,3 +165,83 @@ def convert_pandas_to_polars_with_index(
pl_df = pl_df.lazy()

return pl_df


def transform_pandas_multiindex_columns_to_single_column(X_input: pd.DataFrame):
"""Convert function to return a list containing melted columns.

Assumes a multi-index column pandas DataFrame
Parameters
----------
X : pandas DataFrame
pandas DataFrame containing a multi-index column (nlevels > 1)

Returns
-------
df_cols : a list object containing strings of all of the melted columns
"""
df_cols = []
for col in X_input.columns:
df_cols.append("__" + "__".join(str(x) for x in col if x != "") + "__")
# in case "__index__" is in one of the tuples inside X_input
df_cols = [col.replace("____", "__") for col in df_cols]

return df_cols


def transform_single_column_to_multiindex_columns(obj):
"""Convert function to return a list containing un-melted columns."""
obj_columns = obj.columns

df_cols = []
for col in obj_columns:
items = col.split("__")
items = [item for item in items if item]
df_cols.append(items)

# take the transpose of the list of lists
df_cols = np.array(df_cols).T.tolist()

for multi_index_array in df_cols:
# try to convert item to float if the string is a supposed float value
for i in range(len(multi_index_array)):
try:
multi_index_array[i] = float(multi_index_array[i])
except ValueError:
pass
return df_cols


def check_n_level_of_dataframe(X_input, axis=1):
"""Convert function to check the number of levels inside a pd/pl frame.

Parameters
----------
X_input : polars or pandas DataFrame
A given polars or pandas DataFrame. Note that the polars portion of this
code requires the soft dependencies polars and pyarrow to be installed
axis : [0,1]
Specify the index or columns of a pandas DataFrame. If 0, uses the index
If 1, uses the columns. This parameter is ignored if X_input is not
a pandas DataFrame.

Returns
-------
levels : int
An integer specifying the number of levels given a DataFrame
"""
import polars as pl

if axis not in [0, 1]:
raise ValueError(f"axis must be in [0,1] " f"found {axis}.")
levels = None
if isinstance(X_input, pd.DataFrame):
if axis == 0:
levels = X_input.index.nlevels
elif axis == 1:
levels = X_input.columns.nlevels

if isinstance(X_input, pl.DataFrame) or isinstance(X_input, pl.LazyFrame):
levels = 1

return levels
165 changes: 165 additions & 0 deletions skpro/datatypes/tests/test_polars.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@
import polars as pl

from skpro.datatypes import check_is_mtype, convert
from skpro.datatypes._adapter.polars import (
check_n_level_of_dataframe,
transform_pandas_multiindex_columns_to_single_column,
)
from skpro.datatypes._table._convert import (
convert_pandas_to_polars_eager,
convert_polars_to_pandas,
)

TEST_ALPHAS = [0.05, 0.1, 0.25]

Expand Down Expand Up @@ -46,6 +54,44 @@ def _pd_to_pl(df):
return convert(df, from_type="pd_DataFrame_Table", to_type="polars_eager_table")


@pytest.fixture
def load_pandas_multi_index_column_fixture():
arrays = [
["A", "A", "A", "A"],
["Foo", "Foo", "Bar", "Bar"],
["One", "Two", "One", "Two"],
]
columns = pd.MultiIndex.from_arrays(arrays)

# Create the DataFrame
data = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
pd_multi_column_fixture = pd.DataFrame(data, columns=columns)

return pd_multi_column_fixture


@pytest.fixture
def load_pandas_simple_column_fixture():
data = {"test_target": [10, 20, 30]}

# Create the DataFrame with a custom index
pd_simple_column_fixture = pd.DataFrame(
data, index=pd.Index(["row1", "row2", "row3"])
)

return pd_simple_column_fixture


@pytest.fixture
def load_polars_simple_fixture():
data = {"column1": [1, 2, 3], "column2": [4, 5, 6], "column3": [7, 8, 9]}

# Create the DataFrame
pl_simple_fixture = pl.DataFrame(data)

return pl_simple_fixture


@pytest.fixture
def polars_load_diabetes_polars(polars_load_diabetes_pandas):
X_train, X_test, y_train = polars_load_diabetes_pandas
Expand Down Expand Up @@ -167,3 +213,122 @@ def test_polars_eager_regressor_in_predict_quantiles(
assert y_pred_quantile.columns[0] == ("target", 0.05)
assert y_pred_quantile.columns[1] == ("target", 0.1)
assert y_pred_quantile.columns[2] == ("target", 0.25)


@pytest.mark.skipif(
not run_test_module_changed("skpro.datatypes")
or not _check_soft_dependencies(["polars", "pyarrow"], severity="none"),
reason="skip test if polars/pyarrow is not installed in environment",
)
def test_check_column_level_of_dataframe_pandas(
load_pandas_multi_index_column_fixture,
load_pandas_simple_column_fixture,
):
pd_multi_column_fixture = load_pandas_multi_index_column_fixture
pd_simple_column_fixture = load_pandas_simple_column_fixture

n_levels_multi_pd = check_n_level_of_dataframe(pd_multi_column_fixture)
n_levels_simple_pd = check_n_level_of_dataframe(pd_simple_column_fixture)
n_levels_simple_pd_index = check_n_level_of_dataframe(
pd_simple_column_fixture, axis=0
)

assert n_levels_multi_pd == 3
assert n_levels_simple_pd == 1
assert n_levels_simple_pd_index == 1


@pytest.mark.skipif(
not _check_soft_dependencies(["polars", "pyarrow"], severity="none"),
reason="skip test if polars/pyarrow is not installed in environment",
)
def test_check_column_level_of_dataframe_polars(
load_polars_simple_fixture,
):
pl_simple_column_fixture = load_polars_simple_fixture
n_levels_simple_pl = check_n_level_of_dataframe(pl_simple_column_fixture)
assert n_levels_simple_pl == 1


@pytest.mark.skipif(
not run_test_module_changed("skpro.datatypes")
or not _check_soft_dependencies(["polars", "pyarrow"], severity="none"),
reason="skip test if polars/pyarrow is not installed in environment",
)
def test_convert_multiindex_columns_to_single_column(
load_pandas_multi_index_column_fixture,
):
pd_multi_column_fixture1 = load_pandas_multi_index_column_fixture
df_list1 = transform_pandas_multiindex_columns_to_single_column(
pd_multi_column_fixture1
)
assert df_list1 == [
"__A__Foo__One__",
"__A__Foo__Two__",
"__A__Bar__One__",
"__A__Bar__Two__",
]

pd_multi_column_fixture2 = load_pandas_multi_index_column_fixture
df_list2 = transform_pandas_multiindex_columns_to_single_column(
pd_multi_column_fixture2
)
assert df_list2 == [
"__A__Foo__One__",
"__A__Foo__Two__",
"__A__Bar__One__",
"__A__Bar__Two__",
]


@pytest.mark.skipif(
not run_test_module_changed("skpro.datatypes")
or not _check_soft_dependencies(["polars", "pyarrow"], severity="none"),
reason="skip test if polars/pyarrow is not installed in environment",
)
def test_convert_single_column_to_multiindex_column(
estimator,
polars_load_diabetes_pandas,
):
X_train, X_test, y_train = polars_load_diabetes_pandas
estimator.fit(X_train, y_train)

# test for predict
y_pred = estimator.predict(X_test)
assert isinstance(y_pred, pd.DataFrame)

y_pred_pl = convert_pandas_to_polars_eager(y_pred)
y_pred_pd = convert_polars_to_pandas(y_pred_pl)
assert all(y_pred_pd.columns == y_pred.columns)
assert all(y_pred_pd.index == y_pred.index)
assert all(y_pred_pd.values == y_pred.values)

# test for interval
y_pred_interval = estimator.predict_interval(X_test)
assert isinstance(y_pred_interval, pd.DataFrame)

y_pred_interval_pl = convert_pandas_to_polars_eager(y_pred_interval)
y_pred_interval_pd = convert_polars_to_pandas(y_pred_interval_pl)
assert all(y_pred_interval_pd.columns == y_pred_interval.columns)
assert all(y_pred_interval_pd.index == y_pred_interval.index)
assert y_pred_interval_pd.equals(y_pred_interval)

# test for quantile
y_pred_quantile = estimator.predict_quantiles(X_test, alpha=TEST_ALPHAS)
assert isinstance(y_pred_quantile, pd.DataFrame)

y_pred_quantile_pl = convert_pandas_to_polars_eager(y_pred_quantile)
y_pred_quantile_pd = convert_polars_to_pandas(y_pred_quantile_pl)
assert all(y_pred_quantile_pd.columns == y_pred_quantile.columns)
assert all(y_pred_quantile_pd.index == y_pred_quantile.index)
assert y_pred_quantile_pd.equals(y_pred_quantile)

# test for var
y_pred_var = estimator.predict_interval(X_test)
assert isinstance(y_pred_var, pd.DataFrame)

y_pred_var_pl = convert_pandas_to_polars_eager(y_pred_var)
y_pred_var_pd = convert_polars_to_pandas(y_pred_var_pl)
assert all(y_pred_var_pd.columns == y_pred_var.columns)
assert all(y_pred_var_pd.index == y_pred_var.index)
assert y_pred_var_pd.equals(y_pred_var)
Loading
Loading