Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make stack create CategoricalVector{String} vector from column names #1947

Closed
bkamins opened this issue Sep 5, 2019 · 8 comments · Fixed by #2049
Closed

Make stack create CategoricalVector{String} vector from column names #1947

bkamins opened this issue Sep 5, 2019 · 8 comments · Fixed by #2049
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Sep 5, 2019

Currently it is Vector{Symbol}. See JuliaInterop/RCall.jl#331 for a discussion (but I think this would be also useful in general).

@bkamins bkamins added this to the 1.0 milestone Sep 5, 2019
@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2019

@nalimilan - given #2029 what do you think we should do here?
We can use PooledVector instead of CategoricalVector if we want to cut down the dependency on CategoricalArrays.jl. I planned to implement this change this week.

@nalimilan
Copy link
Member

Ah, so indeed that would be a reason to keep depending on CategoricalArrays. CategoricalVector{String} would be better than PooledVector{String} here to preserve the ordering of columns.

For reference, in the tidyverse, tidyr::gather uses factor_key = FALSE by default, while its predecessors reshape2::melt and utils::stack did the reverse. The discussion about the change at tidyverse/tidyr#96. I must say it's a bit surprising as the only justification seems to be that joining printed a silly warning about mixing factors and characters. I guess a more general reason could be to keep things simple by default. OTOH I could find several examples using factor_key=TRUE on the Web, so it looks like it's used.

@bkamins
Copy link
Member Author

bkamins commented Dec 2, 2019

In R I think that the problem is that "factor" is a bit broken. We do not have this problem in CategoricalArrays.jl.

Now as:

julia> @time using DataFrames
  1.334656 seconds (1.77 M allocations: 106.438 MiB, 0.28% gc time)

I am OK with this loading time, so as noted earlier I have no problem with keeping CategoricalArrays.jl (removing it would cut it to probably something around 0.7 seconds at best, so the perception of the lag would be of similar order I think).

@nalimilan - so it is a call to you to make a decision here 😄.

@quinnj - maybe you want to comment here or in #2029 how you see the CategoricalArrays.jl case (keep it or drop it?), especially in relation to CSV.jl. Thank you!

@nalimilan
Copy link
Member

I think people complain not only about the loading time, but also about the fact that loading CategoricalArrays invalidates lots of Base functions, which introduces many delays when calling the function for the first time in the session. But yeah, it sounds annoying to have to drop CategoricalArrays just because of that.

@quinnj
Copy link
Member

quinnj commented Dec 2, 2019

Yeah, I like the functionality of CategoricalArrays.jl, but I wish it didn't cause so much compiler trouble; it'd be great if we could figure all that out quickly to make it an easier/lighter/more friendly dependency.

@nalimilan
Copy link
Member

Unfortunately I don't know how...

@bkamins
Copy link
Member Author

bkamins commented Dec 9, 2019

@quinnj - do you have some time to spare and lay out what are the core problems that CategoricalArrays.jl causes for compiler (i.e. I know the design of the package but I never studied what exactly causes the most pain). Thank you!

@quinnj
Copy link
Member

quinnj commented Dec 10, 2019

From what I understand, it comes down to some problematic convert definitions, like convert(::Type{Any}, x::CategoricalValue), which, when defined, invalidate basically any other convert method, triggering them all to need to recompile. That's why I opened JuliaLang/julia#32613 as a way to allow a wrapper type like CategoricalValue to not need such problematic convert methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants