Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply datalimits to only some groups #412

Open
andreasnoack opened this issue Jul 6, 2022 · 2 comments
Open

Apply datalimits to only some groups #412

andreasnoack opened this issue Jul 6, 2022 · 2 comments

Comments

@andreasnoack
Copy link
Contributor

Problem description

I have some data that I'd like to histogram over two grouping factors. The histograms should be compared vertically, so I'd like to apply datalimits only when grouping on variable in the example below. When datalimits is applied on both grouping factors, the histograms are harder to compare because of the outliers.

Figure

Skærmbillede 2022-07-06 kl  10 57 37

Source

## Generate groups
testdf = crossjoin(
    DataFrame(id = 1:100),
    DataFrame(f1 = ["Method A", "Method B"]),
    DataFrame(f2 = ["Type 1", "Type 2"])
)

## Generate data
transform!(testdf, "f2" => ByRow(t -> randn() + 2*(t == "Type 1")) => "value")

## Generate outliers
transform!(testdf, ["id", "f1", "value"] => ByRow((_id, _f1, _value) -> _id == 1 && _f1 == "Method B" ? _value + 20 : _value) => "value")

## Generate histograms
data(testdf) * mapping("value", row="f1", col="f2") * histogram(datalimits=extrema, bins=20) |> draw 

Proposed solution

I'd like to be able to use only the factor f2 for the grouping when computing the extrema in datalimits. Unfortunately, I'm not sure how the better way of specifying the preferred behavior would look like, though. Any ideas?

@jkrumbiegel
Copy link
Member

I've looked into this briefly and I didn't see an easy way to pull the information about the grouping into the datalimits computation, the positional arguments of the ProcessedLayer are already split into 2x2 = 4 length vectors. Would need a larger internal refactor I think.

However, I did think of another solution, which is not as elegant maybe, but simple and functional. You layer two data sets where you filter for the groups you want, then you don't specify datalimits to sync each of them:

(data(testdf[testdf.f2 .== "Type 1", :]) + data(testdf[testdf.f2 .== "Type 2", :])) * mapping("value", row="f1", col="f2") * histogram(bins=20) |> draw 
image

@jkrumbiegel
Copy link
Member

And the filtering can actually be simplified or made more generic like this:

mapreduce(data, +, groupby(testdf, :f2)) * mapping("value", row="f1", col="f2") * histogram(bins=20) |> draw 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants