Support type-based column selectors #3034

wolthom · 2022-04-01T20:07:51Z

Currently, when applying a transformation to all columns of a specific type (or subtypes of an abstract type), a pattern such as transform(df, names(df, Number) .=> f) is used.
Ideally, this could be achieved with a column-selector, e.g. transform(df, Cols(Number) .=> f).

While a minor convenience feature, this may make the column-selector API (even) more consistent and users don't have to repeat the name of the DataFrame multiple times.

The text was updated successfully, but these errors were encountered:

bkamins · 2022-04-01T20:35:11Z

Yes - I just need to think if there are any corner cases that would lead to problems. We could even potentially allow df[:, Number] it if does not lead to problems.

bkamins · 2022-04-02T10:04:38Z

OK - now I remember why we do not have this.

Except the names function all other column selectors currently get resolved in the context of AbstractIndex not AbstractDataFrame (i.e. have only access to column names, but to not have access to column contents).

So adding the requested functionality would require a significant redesign. This is of course doable.

@nalimilan - what do you think?

nalimilan · 2022-04-06T20:54:50Z

I agree it would be nice to be able to do transform(df, Cols(Number) .=> f) at least. But yeah the implementation may not be trivial. (This was discussed briefly at #2400.)

bkamins · 2022-04-06T21:01:47Z

@nalimilan - I can do it. The only issue is that the PR might end up being 1000 lines and touch many files so it will be hard to review (not sure yet - maybe it will be easier). Essentially we need to drop using AbstractIndex almost everywhere and instead pass around AbstractDataFrame. This is challenging because we need to correctly handle all types that DataFrames.jl defines (as deep down they all use AbstractIndex somewhere).

In other words the original design of DataFrames.jl assumes such functionality will not be needed (AbstractIndex is not aware of column element types) so we need to change fundamental element of the design here.

bkamins · 2022-05-08T12:29:12Z

@nalimilan - let us make a decision if we:

add it in 1.4 release.
postpone to later releases for a decision.
keep the things as they are (i.e. require names(df, "type") syntax).

I would like to finalize the scope of 1.4 release so that we can have it before JuliaCon.

bkamins · 2022-06-07T07:17:10Z

I move it to 1.5 release for a decision

bkamins · 2022-12-23T20:15:23Z

I was thinking about it. The issue is that AbstractIndex was designed as:

DataFrames.jl/src/other/index.jl

Line 1 in b240458

    
           # an AbstractIndex is a thing that can be used to look up ordered things by name, but that

so it - by design - only supports name lookup.

Now the issue is that to create a DataFrame, we have to construct its index before. So we even cannot naturally have a back-refrence to a data frame in index.

In summary this means that it is a major redesign of DataFrame, SubDataFrame, DataFrameRow, Index, and SubIndex if we wanted to allow for such a change. One particular consequence is that 1.5 release would be incompatible with 1.4 release on binary level (and people often serialize/jld data frames).

@nalimilan - the question is if we want to do it.

An alternative would be to special case such selector before passing it to index, but this will lead to ugly design (in many places we will have to apply a patch that is hard to maintain).

bkamins · 2023-02-05T08:40:47Z

After more thinking I am giving it a 1.x milestone. Maybe we will add it at some point, but it is not likely we will do it fast. For now users need to use names or work with eachcol to filter on element type of a column.

bkamins · 2023-08-17T21:57:44Z

In this issue let us track all request for basing column selection on column values (as column element type is just a special case).

In this post I discuss the choice in more detail.

If you feel we should add this functionality please vote up: 👍.
If you feel it is OK not to have a special syntax for it please vote down: 👎.

Thank you!

alfaromartino · 2023-08-18T15:41:04Z

My two cents about why I wouldn't recommend adding a new method:

The operation can be implemented in other ways already. The more methods to implement the same feature, the harder to read code written by third parties. This aspect affects new users, who would become really confused about what methods to learn when they're learning the language.
Somewhat related to 1, adding new syntax for DataFrames forces the new users to learn syntax specific to Dataframes (even if it's just to read other people's code). This is problematic if they're learning the Julia language in general.
From what's described, the implementation doesn't seem so easy and there are some issues involved. In a context where it's not trivial, I think implementing other features would be more beneficial. For example, any performance improvement seems more beneficial than implementing one more method for the same (e.g., I read somewhere about an improvement of groupby operations when there are a lot of small groups).

kdpsingh · 2023-08-22T16:24:52Z

I appreciate the thoughtful examples in the blog post! With the examples you’ve given there, I think I should be able to wrap this functionality within TidierData.jl. The only piece I’m concerned about is making sure I escape the data frame in the right place since I have a bunch of functions that parse and modify the expression along the way. Will let you know if I run into roadblocks.

tp2750 · 2023-08-24T12:24:05Z

Looks like an interesting feature.
I like it being an explicit functionality, as it makes it easier to find in the documentation. I was not able to find examples of value-based column selection in the DataFrames.jl documentation.

If there is no performance benefit of

select(df, Cols(startswith("a")) .& Vals(x -> any(ismissing(x))))

over

select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])

perhaps it might as well be done by a macro in DataFramesMeta?

math4mad · 2023-12-15T23:52:51Z

If work with PCA or cor(Matrix), better with Number Type,
how to define supertype ?

using  Pipe,Tidier

df =load_csv("airbnb_nyc_2019",false)
type_df=@pipe describe(df)|>select(_,[:variable,:eltype])
int_df=@chain type_df begin
    @filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
end

@filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
there are better way to define this type ?

kdpsingh · 2023-12-16T03:11:14Z

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

In general, how to identify super types?
Or how to get this code to work in TidierData.jl?
Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

math4mad · 2023-12-16T12:24:18Z

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

In general, how to identify super types?

Or how to get this code to work in TidierData.jl?

Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

just select columns containing Numerical super-type

bkamins · 2023-12-16T17:12:15Z

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

eltype(col) <: Number
all(x -> x isa Number, col)
eltype(col) <: Union{Missing, Number}
all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

math4mad · 2023-12-17T04:47:25Z

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

eltype(col) <: Number

all(x -> x isa Number, col)

eltype(col) <: Union{Missing, Number}

all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

at now I think would be option 2

bkamins added the feature label Apr 1, 2022

bkamins added this to the 1.4 milestone Apr 1, 2022

bkamins added the decision label Apr 2, 2022

bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022

bkamins mentioned this issue Dec 2, 2022

add an option to intersect arguments passed to Cols #3224

Merged

bkamins mentioned this issue Feb 2, 2023

Allow to pass multiple predicates in Cols and mix them with other selectors #3279

Merged

bkamins modified the milestones: 1.5, 1.x Feb 5, 2023

bkamins modified the milestones: 1.x, 1.7 Aug 17, 2023

bkamins modified the milestones: 1.7, 1.x Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support type-based column selectors #3034

Support type-based column selectors #3034

wolthom commented Apr 1, 2022

bkamins commented Apr 1, 2022

bkamins commented Apr 2, 2022

nalimilan commented Apr 6, 2022

bkamins commented Apr 6, 2022

bkamins commented May 8, 2022

bkamins commented Jun 7, 2022

bkamins commented Dec 23, 2022

bkamins commented Feb 5, 2023

bkamins commented Aug 17, 2023 •

edited

Loading

alfaromartino commented Aug 18, 2023

kdpsingh commented Aug 22, 2023

tp2750 commented Aug 24, 2023

math4mad commented Dec 15, 2023

kdpsingh commented Dec 16, 2023

math4mad commented Dec 16, 2023

bkamins commented Dec 16, 2023

math4mad commented Dec 17, 2023

Support type-based column selectors #3034

Support type-based column selectors #3034

Comments

wolthom commented Apr 1, 2022

bkamins commented Apr 1, 2022

bkamins commented Apr 2, 2022

nalimilan commented Apr 6, 2022

bkamins commented Apr 6, 2022

bkamins commented May 8, 2022

bkamins commented Jun 7, 2022

bkamins commented Dec 23, 2022

bkamins commented Feb 5, 2023

bkamins commented Aug 17, 2023 • edited Loading

alfaromartino commented Aug 18, 2023

kdpsingh commented Aug 22, 2023

tp2750 commented Aug 24, 2023

math4mad commented Dec 15, 2023

kdpsingh commented Dec 16, 2023

math4mad commented Dec 16, 2023

bkamins commented Dec 16, 2023

math4mad commented Dec 17, 2023

bkamins commented Aug 17, 2023 •

edited

Loading