Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent reuse of result in summarize() #72

Closed
krlmlr opened this issue Nov 9, 2023 · 1 comment · Fixed by #106
Closed

Inconsistent reuse of result in summarize() #72

krlmlr opened this issue Nov 9, 2023 · 1 comment · Fixed by #106
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Nov 9, 2023

@hadley: reusing summary variables may be a mistake, more often than not. This is not handled in duckplyr yet. Should we implement the dangerous behavior or find a better way in dplyr? We could require the reuse of a computed summary variable to use an adverb, for example:

options(conflicts.policy = list(warn = FALSE))
library(duckplyr)

data <- tibble(a = c(1L, 1:2), b = 1:3, c = 4:6)
data
#> # A tibble: 3 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     1     4
#> 2     1     2     5
#> 3     2     3     6

data |>
  summarize(
    .by = a,
    b = sum(b),
    c = sum(b * c),
  )
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     3    27
#> 2     2     3    18

data |>
  as_duckplyr_df() |>
  summarize(
    .by = a,
    b = sum(b),
    c = sum(b * c),
  )
#> materializing:
#> ---------------------
#> --- Relation Tree ---
#> ---------------------
#> Aggregate [a, sum(b), sum(*(b, c))]
#>   r_dataframe_scan(0x12cbe0ec8)
#> 
#> ---------------------
#> -- Result Columns  --
#> ---------------------
#> - a (INTEGER)
#> - b (HUGEINT)
#> - c (HUGEINT)
#> 
#> # A tibble: 2 × 3
#>       a     b     c
#>   <int> <dbl> <dbl>
#> 1     2     3    18
#> 2     1     3    14

sum(1:2 * 4:5)
#> [1] 14
sum(sum(1:2) * 4:5)
#> [1] 27

Created on 2023-11-09 with reprex v2.0.2

@hadley
Copy link
Member

hadley commented Nov 10, 2023

I don't think we want to make any big design changes based on this unusual edge case. I would suggest duckdplyr warns when it encounters this situation and then we can use information from that to determine if we can pull it from dplyr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants