[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460

fkiraly · 2024-09-10T17:15:27Z

Specification and API consolidation discussion related to polars support for probabilistic predictions.

PR #399 makes it clear that, for full polars support, we need an internal representation for polars returns of predict_interval and predict_quantiles, and some means to convert between pandas and polars representation.

The PR proposes to extend the Table mtype, though I think that is dangerous as it would redefine the abstract datatype to include multi-index columns, which then would either affect, by a "chain of architecture", all mtypes in the Table scitype, where it is unclear what would, for instance, have to happen to pd.Series or numpy based ones.

However, the best way to proceed does not seem clear to me - hence the discussion issue.

Personally, I see two options to maintain a clearer structure, both leaving Table mtypes unchanged:

extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.
build a new scitype, TableMI (or similarly named), which has as ADT tables that can have a column multi-index (and, potentially, a row multi-index, but perhaps only later as we do not need this now). For the start, it can have three mtype-s, polars eager and lazy, and pd.DataFrame based.

We would then use the new concrete data structure in a converter in predict_interval and predict_quantiles, after the output has been produced.

The text was updated successfully, but these errors were encountered:

julian-fong · 2024-09-10T20:12:40Z

I'm thinking about a potential third option, but need some more background on the 'proba' scitype

MTYPE_REGISTER_PROBA = [
    ("pred_interval", "Proba", "predictive intervals"),
    ("pred_quantiles", "Proba", "quantile predictions"),
    ("pred_var", "Proba", "variance predictions"),
    # ("pred_dost", "Proba", "full distribution predictions, tensorflow-probability"),
]

What are the possible datatypes that can be used for 'Proba' scitypes? i see pred_interval, pred_quantiles and pred_var as mtypes, but those don't indicate any explicit 'dataframe' types like the 'table' scitype does.

fkiraly · 2024-09-10T21:24:51Z

It's there as the expected outputs of the probabilitstic predict_ methods, just exactly following the specification of each expected output - I think this was not a good idea, since they do not have a common abstract data type, but I just used this for checking.

julian-fong · 2024-09-10T21:46:16Z

I think a potential idea would just have proba scitype be the 'tablemi' scitype, and then define a specific common abstract data type for it (I think it would be pretty simple, just extend the table to include mi-index and mi-columns).

If we can map a common abstract type similar to table we can reduce the hassle of introducing a new scitype and purpose 'proba' for this.

During set_output we can infer the proper scitype based on the function that is specified:

User calls predict -> follow the table scitype
User calls predict_proba/interval/var -> follow the 'proba' scitype (in this case it would be the multi-index indices and multi-index columns dataframes)

extend the existing Proba mtype with polars based ones. However, this would result in one polars based mtype per predict output, which seems now less clean a design than it originally seemed when there were only pandas based ones.

Would it be possible to leave it as 3 but extend the scope for each to include pandas and polars implicitly?

Would like to know your thoughts or if you see any obvious issues with this

fkiraly · 2024-09-11T07:27:38Z

Would like to know your thoughts or if you see any obvious issues with this

I think that would not work like this, because there is no common abstract data type in the Proba type, hence the conversions are unclear.

To see this, one would really have to write out the conversions and examples in all formality.

fkiraly added module:regression probabilistic regression module module:datatypes datatypes module: data containers, checkers & converters feature request New feature or request API design API design & software architecture labels Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460

[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460

fkiraly commented Sep 10, 2024 •

edited

Loading

julian-fong commented Sep 10, 2024 •

edited

Loading

fkiraly commented Sep 10, 2024

julian-fong commented Sep 10, 2024 •

edited

Loading

fkiraly commented Sep 11, 2024

[ENH] mtype for multi-indexed pandas and polars dataframes #460

[ENH] mtype for multi-indexed pandas and polars dataframes #460

Comments

fkiraly commented Sep 10, 2024 • edited Loading

julian-fong commented Sep 10, 2024 • edited Loading

fkiraly commented Sep 10, 2024

julian-fong commented Sep 10, 2024 • edited Loading

fkiraly commented Sep 11, 2024

[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460

[ENH] mtype for multi-indexed `pandas` and `polars` dataframes #460

fkiraly commented Sep 10, 2024 •

edited

Loading

julian-fong commented Sep 10, 2024 •

edited

Loading

julian-fong commented Sep 10, 2024 •

edited

Loading