-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Time utils #528
base: dev
Are you sure you want to change the base?
Time utils #528
Conversation
Dmitry added a basic function before = 99999 isn't needed anymore for lg window, Inf is default
maybe bug: optimization for daily -> weekly: two ideas:
|
We're also missing 7d time_value-sliding (rather than the "tiling"/downsampling/... here) aggregation of archives to archives (without thinning to weekly time_value resolution), but that has more interface concerns like new col names. But maybe we should think a little about that so that if/when we add that the interfaces mesh well. |
Glad to see that
This didn't work on dev w/out it, but given that this will probably be merging after the slide stuff I'll keep that in mind.
if I understand correctly, by default Sunday = 1, which I think is the only choice needed to specify the day of the week convention uniquely. Week number is a whole different can of worms where I'm still confused if lubridate's epiweek actually matches MMWRweek matches the more maintained epiweek package.
This is tricky, because cross language that's actually not the same; I ended up helping someone debug some results from scoringutils that came down to the fact that their system was using Hebrew instead of English. I suspect prototype days are probably the best way to go about this; "2020-09-27" is unambiguously Sunday, for example.
yeah its scope should probably be narrowed. I developed it on the weekly->daily case; the longer the gaps though, the more questionable it gets as a method. Though I think you may be misinterpreting the
I'll make sure to add a test for this; iirc because
I'm not sure how this would work, because for each time value at a given ref_time_value, we need to return a value? Summarize would only return one value per group; are you suggesting adding a week indicator first, then grouping by that? I'm using I considered uncompressing the epi_archive and then doing the computation, but did this version because that was ~5m rows; I didn't know at the time that this method was also going to generate a lot of rows anyway. It would probably be faster to generate once rather than repeatedly append.
Not sure I follow really. Do you mean e.g. frollmean and friends? |
~~
[put ideas into line comment w/ suggestions]
dates <- seq(as.Date("1900-01-01"), as.Date("2500-01-01"), by = "day"); all.equal(MMWRweek::MMWRweek(dates)$MMWRweek, lubridate::epiweek(dates))
[1] TRUE oh, apparently there's also a
Agreed, I'm not suggesting using "Sunday".
Inferring sounds good.
I didn't see this, sorry.
Yeah, make week column, group by epikey x week column, summarize. [Would probably need some completion and/or filtering at the beginning (and/or stuff within the summarize op) to make sure we get the right implicit and/or explicit NAs out that we want given gaps, partial weeks at beginning, partial weeks at end.]
Probably, but the main thing is to try to avoid avoiding epix_slide overhead [and some annoying logic], and to adapt to the version / version-epikey-specific latency. Example of the latter: if there is a 1000d-latency day but most are 5d latency then using the before = trick is still going to have a bunch of extra munging done (before = 1006 [& chopping off any partial week at the beginning], even though before = 11 [& chopping any starting partial] would work for most versions). (It could also avoid redundant calculations on some no-diff versions in certain edge cases with epix_slide_default_ref_time_values.) [Note: partial week chopping correctness here is a bit gnarly to think about; I'm hoping the above is correct if we require that "real" partial weeks are summed to NA (no might look something like ranges <- DT[,list(mintv = min(time_value), maxtv = max(time_value)), by = setdiff(key(DT), "time_value")]
requests <- ranges %>% rowwise() %>% reframe(geo_value, time_value = seq(mintv - min_days_to_get_to_start_of_week(mintv, day_of_week_end), maxtv, by = "day"), version) %>% setDT()
# ^ or something fully data.table-based, and maybe non-`Date`-based.
# ^ note this probably breaks if you have dtplyr loaded, we would need to do a better way. With dtplyr unloaded, it's temporarily breaking DT ownership expectations and keeping around some attributes which MIGHT make `setDT` acknowledge it's still sorted according to the same key as the origiinal & set that as the key without checking/re-sorting.
# ^ bug: wrong if we have any other_keys. would need a bit of tidyeval stuff
DT[requests][, .last_date_of_week := calc_last_date_of_same_week(time_value)][, .... sum/mean(.....) ....., by = list(epikey, .last_date_of_week, version)] followed by [applying whatever math needed to convert [Yet another perf idea, also relevant for other similar archive -> archive slides: looping over versions sequentially & actually applying the diffs sequentially rather than using epix_as_of a bunch of times and having to do many rolling joins or doing the massive rolling join |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall seems sensible and handy to have, but we should think a bit about design - like Logan says, we can either extend sum_groups_epi_df or make a complement to it (expand_groups_epi_df?). i suggest we do a use case analysis here: prototype example workflows we want and see if we can make a common interface.
day_of_week = 4L, | ||
day_of_week_end = 6L) { | ||
agg_method <- arg_match(agg_method) | ||
if (agg_method == "total") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (agg_method == "total") { | |
if (agg_method == "sum") { |
keys <- grep("time_value", key_colnames(epi_arch), invert = TRUE, value = TRUE) | ||
too_many_tibbles <- epix_slide( | ||
epi_arch, | ||
before = 99999999L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inf is default
before = 99999999L, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this should also give some perf boost)
ref_time_values = ref_time_values, | ||
function(x, group, ref_time) { | ||
x %>% | ||
group_by(across(all_of(keys))) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after new updates you can just do and get rid of the keys assignment above
group_by(across(all_of(keys))) %>% | |
group_epi_df(exclude = "time_value") %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glad to hear it! I was getting very tired of recreating keys
all over the place
completed <- to_complete_datatable %>% | ||
full_join( | ||
completed_time_values, | ||
by = join_by(season, geo_value, version, time_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
season
seems particular to whatever you were working on. this probably needs to relate somehow to groups
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, should probably just be keys
. Relatedly, we probably need a function to create keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is epiprocess::key_colnames. unless you mean something like epiprocess::add_key which would probably require a join and recreating the epi_df under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key_colnames
includes time_value
, unless there's an option I missed. So every time I have to remove time_value
; Dan even added a kill_time_value
function in epipredict
because it's such a recurring pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed kill_time_value and instead you can now just do key_colnames(edf, exclude = "time_value")
.
Time resolution upsampling and downsampling definitely felt like a separate feature, so we punted on it in #519. Mostly what we're doing in Logan's comments about disallowing user-supplied groups seem wise. I would suggest we first try to group epi_dfs by default by |
|
As for concerns about different conventions for wday(), it's good to be aware of these issues, but as long as we describe our convention in the docs and are consistent with whatever date-time library we use (whether as.POSIX or lubridate or other), I don't see what else there is to do.
The fully general solution here would require a general week format library - |
I realize David already has all the options I was talking about but forgot about while writing data.tableish appr: |
select(-all_of(agg_columns)) %>% | ||
rename_with(~ gsub("slide_value_", "", .x)) %>% | ||
# only keep 1/week | ||
filter(wday(time_value) == day_of_week_end) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter(wday(time_value) == day_of_week_end) %>% | |
filter(as.POSIXlt(time_value)$wday == day_of_week_end) %>% |
If you want to ensure 0 = Sun, 6 = Sat. Could maybe call this lt_day_of_week_end
, but as long as we avoid wday
in the arg name (which I didn't realize you already did) without a clarifying prefix/suffix then that sounds good.
OR
filter(wday(time_value) == day_of_week_end) %>% | |
filter(wday(time_value, week_start = 1) == day_of_week_end) %>% |
If you want to ensure 1 = Mon, 6 = Sat, 7 = Sun.
OR
filter(wday(time_value) == day_of_week_end) %>% | |
filter(wday(time_value, week_start = 7) == day_of_week_end) %>% |
If you want to ensure 1 = Sun, 7 = Sat. [Update default day_of_week_end
to match docs then. Note week_start = 7
is the default default but we need to pass it because user could have set the lubridate option to change the default.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on you pointing out the default is customizable, I was just planning on doing the 3rd one. Most of our collaborators/expected users are in the US, so defaulting to Sunday being the first day of the week is probably safest.
# only keep 1/week | ||
filter(wday(time_value) == day_of_week_end) %>% | ||
# switch time_value to the designated day of the week | ||
mutate(time_value = time_value - 7L + day_of_week) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue?: this might need to be
mutate(time_value = time_value - 7L + day_of_week) %>% | |
mutate(time_value = time_value - (day_of_week_end - day_of_week) %% 7L) %>% |
Probably easiest to construct a test case with day_of_week_end
not Saturday & see if anything gives the right result. [And also one with day_of_week
equal to day_of_week_end
; I think I messed up in an earlier version of this suggestion for that case, but hopefully fixed.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a little test, seems to back up current suggestion above.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
crossing(end_date = as.Date("2020-01-01") + 0:6, day_of_week = 1:7) %>%
mutate(day_of_week_end = lubridate::wday(end_date, week_start = 7L)) %>%
mutate(
method1 = end_date - 7L + day_of_week,
method2 = end_date - 7L + (day_of_week - day_of_week_end) %% 7L,
method3 = end_date - (day_of_week_end - day_of_week) %% 7L
) %>%
pivot_longer(method1:method3, names_to = "method", values_to = "date") %>%
mutate(day_of_week_actual = lubridate::wday(date, week_start = 7L),
right_day_of_week = day_of_week == day_of_week_actual,
right_week = as.integer(end_date - date) %in% 0:6) %>%
summarize(.by = "method", issues = sum(!right_day_of_week | !right_week))
#> # A tibble: 3 × 2
#> method issues
#> <chr> <int>
#> 1 method1 42
#> 2 method2 7
#> 3 method3 0
Created on 2024-09-28 with reprex v2.1.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this definite seems right, thanks for the catch
Checklist
Please:
PR).
brookslogan, nmdefries.
DESCRIPTION
. Always incrementthe patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
process.
Change explanations for reviewer
This adds a couple of functions that were enough of a headache that having them as utils with tests would be nice. Some of these are adapted from work that Richard did. These are:
daily_to_weekly
an archive to archive transformation that converts daily data to weekly data, with thetime_value
being a single day of the week.convert_to_period_upsample
goes from e.g. weekly data to daily data, either splitting data equally between days (so dividing by 7 in the case of weekly-> daily) or not (useful for e.g. % of cases).Then there's a family of epiweek/season utilities. Hopefully those are all clear from the names.
@lcbrooks @dshemetov @nmdefries before I do things like add tests and docs, I figured getting input on if/how these should change.
Things to do:
wday