Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] mtype for multi-indexed pandas and polars dataframes #460

Open
fkiraly opened this issue Sep 10, 2024 · 4 comments
Open

[ENH] mtype for multi-indexed pandas and polars dataframes #460

fkiraly opened this issue Sep 10, 2024 · 4 comments
Labels
API design API design & software architecture feature request New feature or request module:datatypes datatypes module: data containers, checkers & converters module:regression probabilistic regression module

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Sep 10, 2024

Specification and API consolidation discussion related to polars support for probabilistic predictions.

PR #399 makes it clear that, for full polars support, we need an internal representation for polars returns of predict_interval and predict_quantiles, and some means to convert between pandas and polars representation.

The PR proposes to extend the Table mtype, though I think that is dangerous as it would redefine the abstract datatype to include multi-index columns, which then would either affect, by a "chain of architecture", all mtypes in the Table scitype, where it is unclear what would, for instance, have to happen to pd.Series or numpy based ones.

However, the best way to proceed does not seem clear to me - hence the discussion issue.

Personally, I see two options to maintain a clearer structure, both leaving Table mtypes unchanged:

  • extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.
  • build a new scitype, TableMI (or similarly named), which has as ADT tables that can have a column multi-index (and, potentially, a row multi-index, but perhaps only later as we do not need this now). For the start, it can have three mtype-s, polars eager and lazy, and pd.DataFrame based.

We would then use the new concrete data structure in a converter in predict_interval and predict_quantiles, after the output has been produced.

@fkiraly fkiraly added module:regression probabilistic regression module module:datatypes datatypes module: data containers, checkers & converters feature request New feature or request API design API design & software architecture labels Sep 10, 2024
@julian-fong
Copy link
Contributor

julian-fong commented Sep 10, 2024

I'm thinking about a potential third option, but need some more background on the 'proba' scitype

MTYPE_REGISTER_PROBA = [
    ("pred_interval", "Proba", "predictive intervals"),
    ("pred_quantiles", "Proba", "quantile predictions"),
    ("pred_var", "Proba", "variance predictions"),
    # ("pred_dost", "Proba", "full distribution predictions, tensorflow-probability"),
]

What are the possible datatypes that can be used for 'Proba' scitypes? i see pred_interval, pred_quantiles and pred_var as mtypes, but those don't indicate any explicit 'dataframe' types like the 'table' scitype does.

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 10, 2024

It's there as the expected outputs of the probabilitstic predict_ methods, just exactly following the specification of each expected output - I think this was not a good idea, since they do not have a common abstract data type, but I just used this for checking.

@julian-fong
Copy link
Contributor

julian-fong commented Sep 10, 2024

I think a potential idea would just have proba scitype be the 'tablemi' scitype, and then define a specific common abstract data type for it (I think it would be pretty simple, just extend the table to include mi-index and mi-columns).

If we can map a common abstract type similar to table we can reduce the hassle of introducing a new scitype and purpose 'proba' for this.

During set_output we can infer the proper scitype based on the function that is specified:

  1. User calls predict -> follow the table scitype
  2. User calls predict_proba/interval/var -> follow the 'proba' scitype (in this case it would be the multi-index indices and multi-index columns dataframes)
  • extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.

Would it be possible to leave it as 3 but extend the scope for each to include pandas and polars implicitly?

Would like to know your thoughts or if you see any obvious issues with this

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 11, 2024

Would like to know your thoughts or if you see any obvious issues with this

I think that would not work like this, because there is no common abstract data type in the Proba type, hence the conversions are unclear.

To see this, one would really have to write out the conversions and examples in all formality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design API design & software architecture feature request New feature or request module:datatypes datatypes module: data containers, checkers & converters module:regression probabilistic regression module
Projects
None yet
Development

No branches or pull requests

2 participants