Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Enhancing polars support by introducing set_output #399

Open
wants to merge 111 commits into
base: main
Choose a base branch
from

Conversation

julian-fong
Copy link
Contributor

@julian-fong julian-fong commented Jun 21, 2024

Introduces files set_output inside skpro.utils and new tests file test_set_output inside the tests folder. As part of sktime/enhancement-proposals#34 and the notes written in my mentorship programme .

In this pr:

  • I have introduced basic functions to convert multi column pandas dataframes into single column pandas dataframes and vice versa, these are stored under the polars adapter file skpro.datatypes._adapter.polars. In the polars adapter file, convert_polars_to_pandas_with_index now checks to see if there was melted multi-index columns (these columns will be denoated via "foo__bar" convention) and convert_pandas_to_polars_with_index now checks to see if there are multi-index columns inside the pandas DataFrame (like in predict_interval and predict_quantile. If so, then we will melt down these multi-index columns into single-level columns before converting into a polars dataframe.
  • created skpro.utils.set_output.check_output_config to ensure that transformations set by the users are aligned with available skpro output data containers.
  • created skpro.utils.set_output.transform_output in order to convert the resulting DataFrame into user specified or default data containers. transform_output acts like a wrapper around the ordinary convert function, but instead it also checks whether to convert based upon the user specified mtype or to leverage the default original mtype seen in fit
  • I have introduced _config inside BaseProbaRegressor and a new function set_output which mirrors sklearn's set_output for familiarity.

Relates to #342 #449

}


def check_output_config(estimator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this function and the next one private, to avoid user calls and allowing us to make changes quickly



def transform_output(
obj, valid, from_type, default_to_type, default_scitype, output_config, store
Copy link
Collaborator

@fkiraly fkiraly Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function has many parameters - this is a code smell.

I would suggest to remove the convert-call, and only use this function to get the from/to type - that way, we also do not need to replace convert by transform_output in the regressor

@@ -38,6 +39,10 @@ class BaseProbaRegressor(BaseEstimator):
"C_inner_mtype": "pd_DataFrame_Table",
}

_config = {
"transform": "default"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe transform_output?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I would request two changes:

  • make the new utilities private
  • decouple transform_output from the output, and call it only if the setting is not the default

@fkiraly fkiraly added enhancement module:regression probabilistic regression module labels Sep 8, 2024
@fkiraly
Copy link
Collaborator

fkiraly commented Sep 13, 2024

let me know in case you need a review

@julian-fong
Copy link
Contributor Author

julian-fong commented Sep 13, 2024

Another review would be quite helpful.. I think I'll try and re-write the description to be more clear what the goal of the pull request is and whats being added in the PR.

I'll also note that the new code that supports mi-columns and si-columns/indxes for polars exists inside adapters.polars and not inside the table conversion utilities. Instead, the polars conversion utilities for table import the conversion methods from adapters.polars and uses them to convert the pandas dataframes and polars dataframes.

So if needed, I can separate the polars conversion methods inside adapters.polars into different functions:

  • One for si column pandas to polars (for the table scitype)
  • One for si column polars to pandas (for the table scitype)
  • A new conversion method for mi column pandas to polars (for any new mi abstract data types that we decide upon)
  • A new conversion methods for mi column pandas to polars i.e reverse mapping (for any new mi abstract data types that we decide upon)
    All of these methods will still be inside adapters.polars unless you have another recommendation on the placement of these methods. Either way, this should allow the scitype conversion methods to be firmly independent from each other.

I'll also note that although predict_var is only single column, there are still some failures in the CI, because one of the notebook-examples (the one with plot_crossplot_std)

@julian-fong
Copy link
Contributor Author

I remember asking you before why in the base code the output conversion only existed inside the predict function and nothing else. Could you kindly remind me why?

And given that reasoning, do you think it makes sense to have the other predict_* functions have this line of code also? or should it always be returned in the same data container as the input. If this is indeed the case, we can always allow data container type changes in these methods via set_output. But most likely in baby steps first since this is quite a challenging design question to solve !

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 14, 2024

From our discussion, summarizing what I think the key points for this PR, to avoid it moving around with minor changes:

  • a clear specification and use cases written down for set_output and the new config field, what it should do. Scenarios that we need to cover are pandas and polars inputs both.
  • this should, in particular, include a specification for what comes out of the proba methods if a polars output is produced
  • I would then add a section on how these outputs are getting mapped onto mtypes, and the datatypes module, a precise suggestion.

I am happy to write these, if you feel you are stuck - please let me know. Or, if they exist somewhere, links would be appreciated.

I'll also note that the new code that supports mi-columns and si-columns/indxes for polars exists inside adapters.polars and not inside the table conversion utilities.

I see - I would though not suggest to split here too much for now, instead we could just add the

I'll also note that although predict_var is only single column, there are still some failures in the CI, because one of the notebook-examples (the one with plot_crossplot_std)

This seems a bit suspicious, why would these changes impact the plotting function at all? I was under the impression that all existing code should run unchanged. Do you have an explanation for the failure?

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 14, 2024

I remember asking you before why in the base code the output conversion only existed inside the predict function and nothing else. Could you kindly remind me why?

That is because the different contracts of the underscore-methods, which are hopfully correctly stated in the extension template:

  • _predict gets input as X_inner_mtype and y_inner_mtype. It should return an output as y_inner_mtype type.
  • _predict_quantiles ets gets input as X_inner_mtype and y_inner_mtype. It needs to return the output in the specific pd.DataFrame based format, there is currently only a single choice for this.

Due to this, in the case of _predict, we need to convert the output too, if the returned y_pred should be of the same type as y seen in fit; but we do not need to do so for _predict_quantiles.

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 14, 2024

And given that reasoning, do you think it makes sense to have the other predict_* functions have this line of code also?

Yes, but weren't you already adding such a conversion in this PR? Which I think makes sense.

@julian-fong
Copy link
Contributor Author

julian-fong commented Sep 16, 2024

  • _predict_quantiles ets gets input as X_inner_mtype and y_inner_mtype. It needs to return the output in the specific pd.DataFrame based format, there is currently only a single choice for this.

Ah I see, so do you think in your opinion for predict_interval, predict_quantiles, and predict_var, that (for now) it should not trigger any convert function unless specified by the user via set_output? i.e

Scenario 1a: User passes in a pandas DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the output is returned as pd.DataFrame

Scenario 1b: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the output is returned as pd.DataFrame

Scenario 2a: User passes in a pandas DataFrame as X_inner_mtype and y_inner_mtype and utilizes set_output for polars, thus the output is returned as pl.DataFrame

Scenario 2b: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and utilizes set_output for pandas, thus the output is returned as pd.DataFrame

instead of automatically trying to convert it back to the input type of X_inner_mtype and y_inner_mtype i.e

Scenario 1: User passes in a pandas Series as y_inner_mtype and does not utilize set_output, thus the code will attempt to also output is returned as pd.Series due to the passed input y_inner_mtype

Scenario 2: User passes in a polars DataFrame as X_inner_mtype and y_inner_mtype and does not utilize set_output, thus the code will attempt to also output is returned as pl.DataFrame due to the passed input y_inner_mtype

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 16, 2024

I would do the second, not the first set of scenarios, because it is consistent with input/output behaviour as current. The quoted line is just explaining the status quo.

I think once we have the datatype extension though, it will not be difficult to switch between the two.

@julian-fong
Copy link
Contributor Author

I would do the second, not the first set of scenarios, because it is consistent with input/output behaviour as current. The quoted line is just explaining the status quo.

I think once we have the datatype extension though, it will not be difficult to switch between the two.

I see, then it would be a good idea to write out the various possible data container types for the proba scitype to facilitate this. Maybe polars and pandas dataframes would be enough for now?

@julian-fong
Copy link
Contributor Author

julian-fong commented Sep 17, 2024

Part of these key points can be found in the design doc

  • a clear specification and use cases written down for set_output and the new config field, what it should do. Scenarios that we need to cover are pandas and polars inputs both.

This is covered in section 5.2. The main use cases for set_output would be 'pandas' and 'polars', with a third option named 'default' which is to indicate default behaviour.

  • this should, in particular, include a specification for what comes out of the proba methods if a polars output is produced

Polars dataframe outputs are described in section 3 for proba methods predict_interval, predict_quantiles, and predict_var

  • I would then add a section on how these outputs are getting mapped onto mtypes, and the datatypes module, a precise suggestion.

I may need some help for how the outputs get mapped onto mtypes and scitypes, i've described a solution inside section 5.3. Some help regarding how to design the mapping method, how we can integrate data container support for the Proba scitype and various scenarios would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement module:regression probabilistic regression module
Projects
Status: PR in progress
Development

Successfully merging this pull request may close these issues.

3 participants