Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support type-based column selectors #3034

Open
wolthom opened this issue Apr 1, 2022 · 17 comments
Open

Support type-based column selectors #3034

wolthom opened this issue Apr 1, 2022 · 17 comments

Comments

@wolthom
Copy link
Contributor

wolthom commented Apr 1, 2022

Currently, when applying a transformation to all columns of a specific type (or subtypes of an abstract type), a pattern such as transform(df, names(df, Number) .=> f) is used.
Ideally, this could be achieved with a column-selector, e.g. transform(df, Cols(Number) .=> f).

While a minor convenience feature, this may make the column-selector API (even) more consistent and users don't have to repeat the name of the DataFrame multiple times.

@bkamins bkamins added the feature label Apr 1, 2022
@bkamins bkamins added this to the 1.4 milestone Apr 1, 2022
@bkamins
Copy link
Member

bkamins commented Apr 1, 2022

Yes - I just need to think if there are any corner cases that would lead to problems. We could even potentially allow df[:, Number] it if does not lead to problems.

@bkamins
Copy link
Member

bkamins commented Apr 2, 2022

OK - now I remember why we do not have this.

Except the names function all other column selectors currently get resolved in the context of AbstractIndex not AbstractDataFrame (i.e. have only access to column names, but to not have access to column contents).

So adding the requested functionality would require a significant redesign. This is of course doable.

@nalimilan - what do you think?

@nalimilan
Copy link
Member

I agree it would be nice to be able to do transform(df, Cols(Number) .=> f) at least. But yeah the implementation may not be trivial. (This was discussed briefly at #2400.)

@bkamins
Copy link
Member

bkamins commented Apr 6, 2022

@nalimilan - I can do it. The only issue is that the PR might end up being 1000 lines and touch many files so it will be hard to review (not sure yet - maybe it will be easier). Essentially we need to drop using AbstractIndex almost everywhere and instead pass around AbstractDataFrame. This is challenging because we need to correctly handle all types that DataFrames.jl defines (as deep down they all use AbstractIndex somewhere).

In other words the original design of DataFrames.jl assumes such functionality will not be needed (AbstractIndex is not aware of column element types) so we need to change fundamental element of the design here.

@bkamins
Copy link
Member

bkamins commented May 8, 2022

@nalimilan - let us make a decision if we:

  1. add it in 1.4 release.
  2. postpone to later releases for a decision.
  3. keep the things as they are (i.e. require names(df, "type") syntax).

I would like to finalize the scope of 1.4 release so that we can have it before JuliaCon.

@bkamins
Copy link
Member

bkamins commented Jun 7, 2022

I move it to 1.5 release for a decision

@bkamins
Copy link
Member

bkamins commented Dec 23, 2022

I was thinking about it. The issue is that AbstractIndex was designed as:

# an AbstractIndex is a thing that can be used to look up ordered things by name, but that

so it - by design - only supports name lookup.

Now the issue is that to create a DataFrame, we have to construct its index before. So we even cannot naturally have a back-refrence to a data frame in index.

In summary this means that it is a major redesign of DataFrame, SubDataFrame, DataFrameRow, Index, and SubIndex if we wanted to allow for such a change. One particular consequence is that 1.5 release would be incompatible with 1.4 release on binary level (and people often serialize/jld data frames).

@nalimilan - the question is if we want to do it.

An alternative would be to special case such selector before passing it to index, but this will lead to ugly design (in many places we will have to apply a patch that is hard to maintain).

@bkamins
Copy link
Member

bkamins commented Feb 5, 2023

After more thinking I am giving it a 1.x milestone. Maybe we will add it at some point, but it is not likely we will do it fast. For now users need to use names or work with eachcol to filter on element type of a column.

@bkamins bkamins modified the milestones: 1.x, 1.7 Aug 17, 2023
@bkamins
Copy link
Member

bkamins commented Aug 17, 2023

In this issue let us track all request for basing column selection on column values (as column element type is just a special case).

In this post I discuss the choice in more detail.

If you feel we should add this functionality please vote up: 👍.
If you feel it is OK not to have a special syntax for it please vote down: 👎.

Thank you!

@alfaromartino
Copy link
Contributor

My two cents about why I wouldn't recommend adding a new method:

  1. The operation can be implemented in other ways already. The more methods to implement the same feature, the harder to read code written by third parties. This aspect affects new users, who would become really confused about what methods to learn when they're learning the language.

  2. Somewhat related to 1, adding new syntax for DataFrames forces the new users to learn syntax specific to Dataframes (even if it's just to read other people's code). This is problematic if they're learning the Julia language in general.

  3. From what's described, the implementation doesn't seem so easy and there are some issues involved. In a context where it's not trivial, I think implementing other features would be more beneficial. For example, any performance improvement seems more beneficial than implementing one more method for the same (e.g., I read somewhere about an improvement of groupby operations when there are a lot of small groups).

@kdpsingh
Copy link

I appreciate the thoughtful examples in the blog post! With the examples you’ve given there, I think I should be able to wrap this functionality within TidierData.jl. The only piece I’m concerned about is making sure I escape the data frame in the right place since I have a bunch of functions that parse and modify the expression along the way. Will let you know if I run into roadblocks.

@tp2750
Copy link

tp2750 commented Aug 24, 2023

Looks like an interesting feature.
I like it being an explicit functionality, as it makes it easier to find in the documentation. I was not able to find examples of value-based column selection in the DataFrames.jl documentation.

If there is no performance benefit of

select(df, Cols(startswith("a")) .& Vals(x -> any(ismissing(x))))

over

select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])

perhaps it might as well be done by a macro in DataFramesMeta?

@math4mad
Copy link

If work with PCA or cor(Matrix), better with Number Type,
how to define supertype ?

using  Pipe,Tidier

df =load_csv("airbnb_nyc_2019",false)
type_df=@pipe describe(df)|>select(_,[:variable,:eltype])
int_df=@chain type_df begin
    @filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
end

@filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
there are better way to define this type ?

@kdpsingh
Copy link

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

  • In general, how to identify super types?
  • Or how to get this code to work in TidierData.jl?
  • Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

@math4mad
Copy link

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

  • In general, how to identify super types?
  • Or how to get this code to work in TidierData.jl?
  • Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

just select columns containing Numerical super-type

@bkamins
Copy link
Member

bkamins commented Dec 16, 2023

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

  1. eltype(col) <: Number
  2. all(x -> x isa Number, col)
  3. eltype(col) <: Union{Missing, Number}
  4. all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

@math4mad
Copy link

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

  1. eltype(col) <: Number
  2. all(x -> x isa Number, col)
  3. eltype(col) <: Union{Missing, Number}
  4. all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

at now I think would be option 2

@bkamins bkamins modified the milestones: 1.7, 1.x Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants