Apply datalimits to only some groups #412

andreasnoack · 2022-07-06T09:02:30Z

Problem description

I have some data that I'd like to histogram over two grouping factors. The histograms should be compared vertically, so I'd like to apply datalimits only when grouping on variable in the example below. When datalimits is applied on both grouping factors, the histograms are harder to compare because of the outliers.

Figure

Source

## Generate groups
testdf = crossjoin(
    DataFrame(id = 1:100),
    DataFrame(f1 = ["Method A", "Method B"]),
    DataFrame(f2 = ["Type 1", "Type 2"])
)

## Generate data
transform!(testdf, "f2" => ByRow(t -> randn() + 2*(t == "Type 1")) => "value")

## Generate outliers
transform!(testdf, ["id", "f1", "value"] => ByRow((_id, _f1, _value) -> _id == 1 && _f1 == "Method B" ? _value + 20 : _value) => "value")

## Generate histograms
data(testdf) * mapping("value", row="f1", col="f2") * histogram(datalimits=extrema, bins=20) |> draw

Proposed solution

I'd like to be able to use only the factor f2 for the grouping when computing the extrema in datalimits. Unfortunately, I'm not sure how the better way of specifying the preferred behavior would look like, though. Any ideas?

The text was updated successfully, but these errors were encountered:

jkrumbiegel · 2024-08-26T09:04:46Z

I've looked into this briefly and I didn't see an easy way to pull the information about the grouping into the datalimits computation, the positional arguments of the ProcessedLayer are already split into 2x2 = 4 length vectors. Would need a larger internal refactor I think.

However, I did think of another solution, which is not as elegant maybe, but simple and functional. You layer two data sets where you filter for the groups you want, then you don't specify datalimits to sync each of them:

(data(testdf[testdf.f2 .== "Type 1", :]) + data(testdf[testdf.f2 .== "Type 2", :])) * mapping("value", row="f1", col="f2") * histogram(bins=20) |> draw

jkrumbiegel · 2024-08-26T09:06:47Z

And the filtering can actually be simplified or made more generic like this:

mapreduce(data, +, groupby(testdf, :f2)) * mapping("value", row="f1", col="f2") * histogram(bins=20) |> draw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply datalimits to only some groups #412

Apply datalimits to only some groups #412

andreasnoack commented Jul 6, 2022

jkrumbiegel commented Aug 26, 2024

jkrumbiegel commented Aug 26, 2024

Apply datalimits to only some groups #412

Apply datalimits to only some groups #412

Comments

andreasnoack commented Jul 6, 2022

Problem description

Figure

Source

Proposed solution

jkrumbiegel commented Aug 26, 2024

jkrumbiegel commented Aug 26, 2024