Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time utils #528

Draft
wants to merge 5 commits into
base: dev
Choose a base branch
from
Draft

Time utils #528

wants to merge 5 commits into from

Conversation

dsweber2
Copy link
Contributor

@dsweber2 dsweber2 commented Sep 26, 2024

Checklist

Please:

  • Make sure this PR is against "dev", not "main" (unless this is a release
    PR).
  • Request a review from one of the current main reviewers:
    brookslogan, nmdefries.
  • Makes sure to bump the version number in DESCRIPTION. Always increment
    the patch version number (the third number), unless you are making a
    release PR from dev to main, in which case increment the minor version
    number (the second number).
  • Describe changes made in NEWS.md, making sure breaking changes
    (backwards-incompatible changes to the documented interface) are noted.
    Collect the changes under the next release number (e.g. if you are on
    1.7.2, then write your changes under the 1.8 heading).
  • See DEVELOPMENT.md for more information on the development
    process.

Change explanations for reviewer

This adds a couple of functions that were enough of a headache that having them as utils with tests would be nice. Some of these are adapted from work that Richard did. These are:

  • daily_to_weekly an archive to archive transformation that converts daily data to weekly data, with the time_value being a single day of the week.
  • convert_to_period_upsample goes from e.g. weekly data to daily data, either splitting data equally between days (so dividing by 7 in the case of weekly-> daily) or not (useful for e.g. % of cases).

Then there's a family of epiweek/season utilities. Hopefully those are all clear from the names.

@lcbrooks @dshemetov @nmdefries before I do things like add tests and docs, I figured getting input on if/how these should change.

Things to do:

  • be explicit about first day of the week for wday
  • add tests
    • maybe bug: fill downward may need to be grouped by output epikey
  • add docs

@brookslogan
Copy link
Contributor

brookslogan commented Sep 26, 2024

Dmitry added a basic function sum_groups_epi_df for epi_dfs in #477 that overlaps in functionality, though I'm not sure if either would be completely subsumed. I think maybe his could have theoretically allowed coarsening of epikeys, which presents more interface issues (if we have both count and rate&percent cols, we don't want same aggregation method, so we'd need something like count_cols and rate_cols selectors rather than agg_cols + agg_method. Plus we'd want pop data for aggregating rate&percent cols).

before = 99999 isn't needed anymore for lg window, Inf is default

wday() in lubridate does not follow the same numbering scheme as base R as.POSIXlt()$wday! There might be bugs here due to this. In writing week-related code, we should [maybe] be using naming conventions or types to clarify which definition we're using [actually I think your naming scheme is fine]. (For reference, I had some scheme like wday0, wday1, wday7 in brookslogan/epicalendar though maybe I just imagined adding the comments about which of the various time offering "wday"s these correspond to. But I think maybe there were like 3 different conventions in common packages.)
[transferred to inline comment]

convert_to_period_upsample seems suspiciously general... Allowing selection of groups at same time as periods seems like it might run into the same problematic edge cases as epi_slide did (& in epi_slide we're moving to just forbid things that can potentially allow those edge cases). We should probably be more rigid, perhaps not allowing groups parm, or ensuring that it's also upsampling epikeys? [epikey upsampling is harder if you allow count cols though... assuming no wday effects is very different from assuming equal epikey populations and risk rates --- actually risk rate factor also means that epikey upsampling on rates is also fraught. Easiest to just say we only do time upsampling and not allow user to specify groups?]

maybe bug: fill downward may need to be grouped by output epikey, or else it might complete early vals of one epikey with last vals of "preceding" epikey [if user has complete(geo_value, time_value = full_seq(....)) etc. and has NAs for its first time_values]

optimization for daily -> weekly: two ideas:

  • we could use epi_slide_opt [grouped summarize, we don't need slide, and if we write it in a friendly way, dplyr will transform our computation to use a specialized fast native sum/mean-by-group rather than a slower lapply(, base::sum) type of thing] within the epix_slide, and also set before according to something like max(version - time_value) + 6L
  • we can probably do even better directly using data.table ops, especially if there is significant variance in reporting latency from version to version

@brookslogan
Copy link
Contributor

brookslogan commented Sep 26, 2024

We're also missing 7d time_value-sliding (rather than the "tiling"/downsampling/... here) aggregation of archives to archives (without thinning to weekly time_value resolution), but that has more interface concerns like new col names. But maybe we should think a little about that so that if/when we add that the interfaces mesh well.

@dsweber2
Copy link
Contributor Author

Glad to see that sum_groups_epi_df exists! The two should have fundamentally different approaches to rate and % data: aggregating by time, the first is a simple sum and the second is a mean, while aggregating by geo_value or some other group we need the associated weights (e.g. population). Given this, they should probably be 2 separate functions or family of functions.

before = 99999 isn't needed anymore for lg window, Inf is default

This didn't work on dev w/out it, but given that this will probably be merging after the slide stuff I'll keep that in mind.

wday() in lubridate

if I understand correctly, by default Sunday = 1, which I think is the only choice needed to specify the day of the week convention uniquely. Week number is a whole different can of worms where I'm still confused if lubridate's epiweek actually matches MMWRweek matches the more maintained epiweek package.

we should always be using naming conventions or types to clarify which definition we're using.

This is tricky, because cross language that's actually not the same; I ended up helping someone debug some results from scoringutils that came down to the fact that their system was using Hebrew instead of English. I suspect prototype days are probably the best way to go about this; "2020-09-27" is unambiguously Sunday, for example.

convert_to_period_upsample seems suspiciously general

yeah its scope should probably be narrowed. I developed it on the weekly->daily case; the longer the gaps though, the more questionable it gets as a method. Though I think you may be misinterpreting the groups parameter; it's just intended to make sure values aren't leaking across group boundaries. It should probably be inferred from the epi_archive. I wasn't planning on getting into imputing values cross-groups with this, fill is way too crude for that.

maybe bug: fill

I'll make sure to add a test for this; iirc because completed was grouped, after the join it was still grouped, but that could easily be wrong.

grouped summarize...

I'm not sure how this would work, because for each time value at a given ref_time_value, we need to return a value? Summarize would only return one value per group; are you suggesting adding a week indicator first, then grouping by that? I'm using epi_slide_sum or epi_slide_mean, both of which are actually using epi_slide_opt, so I suspect that's not actually the expensive part. I think the reason this ends up slow is that it generates ~5m rows, but can compress it back down to 300k. If we had an archive to archive transform that compressed as we went that might speed this up, but making all that memory is expensive.

I considered uncompressing the epi_archive and then doing the computation, but did this version because that was ~5m rows; I didn't know at the time that this method was also going to generate a lot of rows anyway. It would probably be faster to generate once rather than repeatedly append.

we can probably do even better directly using data.table ops

Not sure I follow really. Do you mean e.g. frollmean and friends?

@brookslogan
Copy link
Contributor

brookslogan commented Sep 26, 2024

if I understand correctly, by default Sunday = 1, which I think is the only choice needed to specify the day of the week convention uniquely.

~~

  • lubridate::wday() w/o week_start =: who knows. Depends on user's lubridate.week.start option. Default default is indeed Sunday = 1. If we use this definition we should think carefully about implications of the option setting. [e.g., do we use default, or do we ensure week_start = 1 and detect Saturday with wday == 7?]
  • as.POSIXlt()$wday, format(, "%w"): Sunday = 0 / "0"
  • format(, "%u"): Sunday = "7"
    ~~

[put ideas into line comment w/ suggestions]

Week number is a whole different can of worms where I'm still confused if lubridate's epiweek actually matches MMWRweek matches the more maintained epiweek package.

dates <- seq(as.Date("1900-01-01"), as.Date("2500-01-01"), by = "day"); all.equal(MMWRweek::MMWRweek(dates)$MMWRweek, lubridate::epiweek(dates))
[1] TRUE

oh, apparently there's also a lubridate::epiyear() now!

This is tricky, because cross language that's actually not the same

Agreed, I'm not suggesting using "Sunday". But rather, instead of wday we should use something that indicates what numbering/whatever scheme we're using. Using something other than integerish/character (an S3 class) and checking for that type could be a cleaner way, though I haven't seen that ever done. Current arg names seem fine, I fail at reading; ...day... seems better than ...wday... when there are 3+ wday defs and not everybody knows this.

Though I think you may be misinterpreting the groups parameter; it's just intended to make sure values aren't leaking across group boundaries. It should probably be inferred from the epi_archive

Inferring sounds good.

I'm using epi_slide_sum or epi_slide_mean

I didn't see this, sorry.

Summarize would only return one value per group; are you suggesting adding a week indicator first, then grouping by that?

Yeah, make week column, group by epikey x week column, summarize. [Would probably need some completion and/or filtering at the beginning (and/or stuff within the summarize op) to make sure we get the right implicit and/or explicit NAs out that we want given gaps, partial weeks at beginning, partial weeks at end.]

I think the reason this ends up slow is that it generates ~5m rows, but can compress it back down to 300k.

before = <right non-huge value> [+ some annoying logic] might help here (though before = non-Inf has overhead not in before = Inf), or:

we can probably do even better directly using data.table ops

Not sure I follow really. Do you mean e.g. frollmean and friends?

Probably, but the main thing is to try to avoid avoiding epix_slide overhead [and some annoying logic], and to adapt to the version / version-epikey-specific latency. Example of the latter: if there is a 1000d-latency day but most are 5d latency then using the before = trick is still going to have a bunch of extra munging done (before = 1006 [& chopping off any partial week at the beginning], even though before = 11 [& chopping any starting partial] would work for most versions). (It could also avoid redundant calculations on some no-diff versions in certain edge cases with epix_slide_default_ref_time_values.)

[Note: partial week chopping correctness here is a bit gnarly to think about; I'm hoping the above is correct if we require that "real" partial weeks are summed to NA (no na.rm = TRUE), but it still feels hacky. I think the data.table approach might actually make it easier to address this. We want to avoid before = including a partial first week, summing to NA or an incomplete number, and then saying something real was revised to that. But if we actually land on version - before being the starting wday, then no chopping is needed. However, we also want to avoid chopping off a partial week if it was actually partial to begin with, and not just from before = <not Inf> chopping off part of it.]

might look something like

ranges <- DT[,list(mintv = min(time_value), maxtv = max(time_value)), by = setdiff(key(DT), "time_value")]
requests <- ranges %>% rowwise() %>% reframe(geo_value, time_value = seq(mintv - min_days_to_get_to_start_of_week(mintv, day_of_week_end), maxtv, by = "day"), version) %>% setDT()
# ^ or something fully data.table-based, and maybe non-`Date`-based.
# ^ note this probably breaks if you have dtplyr loaded, we would need to do a better way.  With dtplyr unloaded, it's temporarily breaking DT ownership expectations and keeping around some attributes which MIGHT make `setDT` acknowledge it's still sorted according to the same key as the origiinal & set that as the key without checking/re-sorting.
# ^ bug: wrong if we have any other_keys.  would need a bit of tidyeval stuff
DT[requests][, .last_date_of_week := calc_last_date_of_same_week(time_value)][, .... sum/mean(.....) ....., by = list(epikey, .last_date_of_week, version)]

followed by [applying whatever math needed to convert .last_date_of_week to the appropriate day_of_week and naming the result] time_value and feeding through as_epi_archive(compactify = TRUE)

[Yet another perf idea, also relevant for other similar archive -> archive slides: looping over versions sequentially & actually applying the diffs sequentially rather than using epix_as_of a bunch of times and having to do many rolling joins or doing the massive rolling join DT[requests] above ..... though this would also need extra logic to not be re-calculating things that haven't changed, and AFAIK R doesn't have efficient ways to grow vectors to apply the new-obs parts of the diffs ... maybe not the best idea]

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall seems sensible and handy to have, but we should think a bit about design - like Logan says, we can either extend sum_groups_epi_df or make a complement to it (expand_groups_epi_df?). i suggest we do a use case analysis here: prototype example workflows we want and see if we can make a common interface.

day_of_week = 4L,
day_of_week_end = 6L) {
agg_method <- arg_match(agg_method)
if (agg_method == "total") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (agg_method == "total") {
if (agg_method == "sum") {

keys <- grep("time_value", key_colnames(epi_arch), invert = TRUE, value = TRUE)
too_many_tibbles <- epix_slide(
epi_arch,
before = 99999999L,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inf is default

Suggested change
before = 99999999L,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this should also give some perf boost)

ref_time_values = ref_time_values,
function(x, group, ref_time) {
x %>%
group_by(across(all_of(keys))) %>%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after new updates you can just do and get rid of the keys assignment above

Suggested change
group_by(across(all_of(keys))) %>%
group_epi_df(exclude = "time_value") %>%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glad to hear it! I was getting very tired of recreating keys all over the place

completed <- to_complete_datatable %>%
full_join(
completed_time_values,
by = join_by(season, geo_value, version, time_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

season seems particular to whatever you were working on. this probably needs to relate somehow to groups above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, should probably just be keys. Relatedly, we probably need a function to create keys

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is epiprocess::key_colnames. unless you mean something like epiprocess::add_key which would probably require a join and recreating the epi_df under the hood.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key_colnames includes time_value, unless there's an option I missed. So every time I have to remove time_value; Dan even added a kill_time_value function in epipredict because it's such a recurring pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed kill_time_value and instead you can now just do key_colnames(edf, exclude = "time_value").

R/time_converters.R Show resolved Hide resolved
@dshemetov
Copy link
Contributor

Glad to see that sum_groups_epi_df exists! The two should have fundamentally different approaches to rate and % data: aggregating by time, the first is a simple sum and the second is a mean, while aggregating by geo_value or some other group we need the associated weights (e.g. population). Given this, they should probably be 2 separate functions or family of functions.

Time resolution upsampling and downsampling definitely felt like a separate feature, so we punted on it in #519.

Mostly what we're doing in sum_groups_epi_df is allowing users to contract data along a grouping key index. The intention there was to give an easy workaround to someone who used to use ungrouped epi_slides to aggregate to national at the same time as sliding - now they call sum_groups_epi_df and then slide.

Logan's comments about disallowing user-supplied groups seem wise. I would suggest we first try to group epi_dfs by default by key_colnames(epi_df, exclude = "time_value") (this should be geo_value + other_keys) and handling all upsampling/downsampling there. Since our restriction on time_types, we might want to have the user explicitly mark their "from time_type" and "to time_type" (starting to feel like geo_coding) to avoid the user up and downsampling to types that we don't understand how to handle. One of your functions required a "period" length and it's set to 7 to weekly, but going up to month or yearly, this approach won't work, so we'll need special code paths for those time_types if we plan to support them.

@brookslogan
Copy link
Contributor

brookslogan commented Sep 27, 2024

idea[non-blocking]: As I thought about optimizations, I'm also realizing that our logic here depends on whether user is using Sun--Sat weeks, Mon--Sun weeks, or something more esoteric (more esoteric things have been used, I believe, in some source mirrored in covidcast). We could start off with Sun--Sat for US epiweeks but might want this to be a parameter. Also there's the question of what day of the week should represent the week in the output, or if we add support for tsibble::yearweek to represent; day_of_week_end sounds like a decent default at least given how CDC/Hubverse stuff looks.

@dshemetov
Copy link
Contributor

As for concerns about different conventions for wday(), it's good to be aware of these issues, but as long as we describe our convention in the docs and are consistent with whatever date-time library we use (whether as.POSIX or lubridate or other), I don't see what else there is to do.

idea[non-blocking]: As I thought about optimizations, I'm also realizing that our logic here depends on whether user is using Sun--Sat weeks, Mon--Sun weeks, or something more esoteric (more esoteric things have been used, I believe, in some source mirrored in covidcast). We could start off with Sun--Sat for US epiweeks but might want this to be a parameter. Also there's the question of what day of the week should represent the week in the output, or if we add support for tsibble::yearweek to represent; day_of_week_end sounds like a decent default at least given how CDC/Hubverse stuff looks.

The fully general solution here would require a general week format library - epi_week_sequence(start, end, wday_start, at_least_n_days_in_week) - and I don't think we're prepared to write or support that. Given that the general solution is infeasbile, we should make some common choices and just support those until we get another request. Probably the CDC epiweek and just a basic "sum to every Monday" (without the epiweek rounding) seem good enough to me.

@brookslogan
Copy link
Contributor

brookslogan commented Sep 27, 2024

I realize David already has all the options I was talking about but forgot about while writing data.tableish appr: day_of_week and day_of_week_end. (For week numbering you may also need owning_wday if you don't take it to be day_of_week.) Guess these 2 args could be cut out if it's too complicated.

select(-all_of(agg_columns)) %>%
rename_with(~ gsub("slide_value_", "", .x)) %>%
# only keep 1/week
filter(wday(time_value) == day_of_week_end) %>%
Copy link
Contributor

@brookslogan brookslogan Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filter(wday(time_value) == day_of_week_end) %>%
filter(as.POSIXlt(time_value)$wday == day_of_week_end) %>%

If you want to ensure 0 = Sun, 6 = Sat. Could maybe call this lt_day_of_week_end, but as long as we avoid wday in the arg name (which I didn't realize you already did) without a clarifying prefix/suffix then that sounds good.

OR

Suggested change
filter(wday(time_value) == day_of_week_end) %>%
filter(wday(time_value, week_start = 1) == day_of_week_end) %>%

If you want to ensure 1 = Mon, 6 = Sat, 7 = Sun.

OR

Suggested change
filter(wday(time_value) == day_of_week_end) %>%
filter(wday(time_value, week_start = 7) == day_of_week_end) %>%

If you want to ensure 1 = Sun, 7 = Sat. [Update default day_of_week_end to match docs then. Note week_start = 7 is the default default but we need to pass it because user could have set the lubridate option to change the default.]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on you pointing out the default is customizable, I was just planning on doing the 3rd one. Most of our collaborators/expected users are in the US, so defaulting to Sunday being the first day of the week is probably safest.

# only keep 1/week
filter(wday(time_value) == day_of_week_end) %>%
# switch time_value to the designated day of the week
mutate(time_value = time_value - 7L + day_of_week) %>%
Copy link
Contributor

@brookslogan brookslogan Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue?: this might need to be

Suggested change
mutate(time_value = time_value - 7L + day_of_week) %>%
mutate(time_value = time_value - (day_of_week_end - day_of_week) %% 7L) %>%

Probably easiest to construct a test case with day_of_week_end not Saturday & see if anything gives the right result. [And also one with day_of_week equal to day_of_week_end; I think I messed up in an earlier version of this suggestion for that case, but hopefully fixed.]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a little test, seems to back up current suggestion above.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
crossing(end_date = as.Date("2020-01-01") + 0:6, day_of_week = 1:7) %>%
  mutate(day_of_week_end = lubridate::wday(end_date, week_start = 7L)) %>%
  mutate(
    method1 = end_date - 7L + day_of_week,
    method2 = end_date - 7L + (day_of_week - day_of_week_end) %% 7L,
    method3 = end_date - (day_of_week_end - day_of_week) %% 7L
  ) %>%
  pivot_longer(method1:method3, names_to = "method", values_to = "date") %>%
  mutate(day_of_week_actual = lubridate::wday(date, week_start = 7L),
         right_day_of_week = day_of_week == day_of_week_actual,
         right_week = as.integer(end_date - date) %in% 0:6) %>%
  summarize(.by = "method", issues = sum(!right_day_of_week | !right_week))
#> # A tibble: 3 × 2
#>   method  issues
#>   <chr>    <int>
#> 1 method1     42
#> 2 method2      7
#> 3 method3      0

Created on 2024-09-28 with reprex v2.1.1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this definite seems right, thanks for the catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants