-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epix rbind #343
base: dev
Are you sure you want to change the base?
Epix rbind #343
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for wrapping this up in a utility function! I've made a bunch of comments which are a bit deep in the weeds; maybe don't look too much at those. I want to step back and ask/discuss design and naming, and get your insights/reflections after implementing this version. [marking "requested changes" maybe a bit premature, we need to discuss the design first]
Q1: Should we have this single function for both (i) combining archives for different-ish version ranges, and (ii) combining archives for different geo-time-other-key sets, or should (i) and (ii) be separate functions?
Q2: what sorts of helper features should we provide? If we combine (i) and (ii), some possibilities:
- later archives can clobber data for versions >= clobberable_versions_start of the previous ones for overlapping geo-time-other-keys, and assume any version data for non-overlapping geo-time-other-keys doesn't count as clobbering --- makes sense as the default [This probably gets detail-y if we're trying to do (i) and (ii) at the same time with a single function call. And the wording here overlooks some cases: if a previous
clobberable_versions_start
is.na
, then we should error on any version data<= versions_end
for geo-time-other-key combinations that that previous archive covered.] - allow all clobbering, ignoring
clobberable_versions_start
s - later archives can clobber data for versions >= clobberable_versions_start of the previous ones. Any missing geo-time-other-key combinations in later versions means deletion; we can't directly represent this, so put in NA rows. Warn if later versions completely miss a geo-time-other-key combination (or maybe geo-other-key combinations?) present in the earlier archives.
- [no clobbering allowed --- probably the simplest to implement! would make things work for various chunked downloads, but wouldn't work for cache updating unless we assume there's no clobbering]
I'm not sure if the current sync
&force_distinct
hit the behaviors I would expect. (Do they?)
Q3: what do we name things?
Q4: how do we present this all clearly in docs? I was a bit confused reading through the roxygen docs, and it was difficult trying to word things above. We've been thinking of introducing a name for geo+other keys, maybe "epikey", which might help. Plus maybe a few editing passes or something; I assume you had that planned already.
#' edge cases | ||
#' | ||
#' @param ... list of `epi_archive` objects to append in order. | ||
#' @param sync Optional; `"forbid"`, `"na"`, or `"locf"`; in the case that later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: the shared name with the epix_merge
parameter seems confusing as it seems like it's doing something a bit different
suggestion: rename this parameter either
- something like
fill
, likedata.table::nafill
(and alsotidyr::complete
, which is less similar), or maybe something slightly different so users don't assume they can pass a replacement value / list of replacement values like in those functions. - something more archive-specific, maybe involving the word "clobber" or "rewrite"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of the choices below should also be revised. E.g., if it's fill
then maybe "na"
should be "keep"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, tracks. I'd rather not use fill
directly since I have to call that function; maybe fill_na
, since it only effects NA
s? I think clobber
fits better with force_distinct
, and will probably be getting its own parameter based on other discussions. "keep"
is definitely the right word for what's currently "na"
#' this function is still under active development, so there may be remaining | ||
#' edge cases | ||
#' | ||
#' @param ... list of `epi_archive` objects to append in order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: delete "list of"
question: what order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the order ends up mattering only if there are duplicate keys, where earlier ones overwrite later ones. The wording here could be a lot clearer
#' maintained (up to archival compression). | ||
#' - `"locf"`: for every shared time value, use earlier versions of | ||
#' earlier archives to overwrite any `NA`'s found in later | ||
#' versions of later archives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I don't think we should allow "locf"
here. Any explicit NAs in "later" archives represent revisions to NA or row removals (we can't distinguish between the two in the current archive representation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed the motivation somewhat in the main text, though it might make sense to move that conversation here. One way to mitigate the risk of overwriting meaningful NA
s as well is to only locf
on shared time values when the user specifically asks for it. Reflecting, there are possibly 3 levels to doing this writing:
- overwrite all future
NA
's, which is whatsync="locf"
currently does - only overwrite
NA
's on sharedtime_value
's - only overwrite when the first entry in a given archive is
NA
; I think this is probably the best approach
2 and 3 are both somewhat more expensive, and definitely more complicated (you have to split the data, then perform the write on the problem half, and then rejoin).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync="locf" is addressing a fairly subtle issue, where one indicator has versions released in a given month while the other doesn't.
I think I'm starting to understand, but I think this situation probably would emerge only when a mistake has been made already in setting up the component archives, in epix_merge
settings, or an epix_merge
bug. If signal A has version data where signal B does not, signal B should only have NA filled in if it's before its first measurement for a given epikey-time, or by filling forward explicit NAs for that epikey-time. Is this the situation you're thinking of, and do you know if epix_merge
has a bug where it's not doing this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just saw your great example. I agree we need something like this functionality, maybe by default.
Some contributors to the problem [maybe we could address later and then remove some choices in this function]:
- Archives were originally intended to hold complete version histories.
The archive format not distinguishing explicit NAs vs. missing rows.[nevermind, no, this is not connected. The issue is not being able to signal something like "out of range of the updates recorded in this partial archive for this signal/{epikey,signal}/{epikey,time_value,signal}......." / "its latest value prior to the earliest version covered by this archive"]Solving the point above is messier with our current[might or might not be true, but I'm not sure we want to mandate a lazy merge to get a working format.]epix_merge()
implementation. At some point I thought I had an issue to use a lazyepix_merge()
which would just store the parameters and re-implement some base archive operations; that might be useful if we want to make this cleaner later.
result_clobberable_versions_start <- if (all(is.na(clobberable_versions_start))) { | ||
NA # (any type of NA is fine here) | ||
} else { | ||
max(clobberable_versions_start) # unlike the case of merging, we want the later date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
musing: this seems like it could be a little off in edge cases. Hard to handle both aggregation across keys and across versions in the same function; should they be separate? Consider epix_rbind(archive1, archive2)
where: archive1
has data for CA, clobberable starting with version 500, and archive2
has data for AS, clobberable starting with version 800. That's pretty edge-case-y, probably an error. But we might also have archive1b
with CA data starting with version 500, clobberable starting with version 805. Then we might want something similar to the merge sync that says the clobberable versions start with version 800.
#' - `"forbid"`: emit an error; | ||
#' - "na": use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, but ensure that, if we request a snapshot as of a version after `min(x$versions_end, y$versions_end)`, the observation columns from the less up-to-date archive will be all NAs (i.e., imagine there was an update immediately after its `versions_end` which revised all observations to be `NA`); | ||
#' - `"locf"`: use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, allowing the last version of each observation to be carried forward to extrapolate unavailable versions for the less up-to-date input archive (i.e., imagining that in the less up-to-date archive's data set remained unchanged between its actual `versions_end` and the other archive's `versions_end`); or | ||
#' - `"truncate"`: use `min(x$versions_end, y$versions_end)` as the result's `versions_end`, and discard any rows containing update rows for later versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✨
#' earlier archives to overwrite any `NA`'s found in later | ||
#' versions of later archives. | ||
#' @param force_distinct Optional; `TRUE`, `FALSE`, or `NULL`; should the keys | ||
#' be forced to be distinct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: I thought this meant something similar but not quite the same as "forbid"
, but I can't tell what the implementation is doing. Maybe it's related to the two options I described above? I think part of this is because I've mixed up "shared time_values" and "shared keys" when reading the docs. We probably want to think about design, naming, and documentation some more here.
unique_geo_times <- function(x) { | ||
x %>% | ||
select(geo_value, time_value) %>% | ||
distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: take advantage of DT operations that know the keys & indices for performance. maybe unique(x, by=c("geo_value, "time_value"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(resolved the duplicate, but I'll make sure to include something along these lines)
DT <- reduce(DTs, rbind) | ||
# remove any redundant keys | ||
if (force_distinct) { | ||
DT <- distinct(DT, geo_value, time_value, version, .keep_all = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: we need to prioritize later versions of versions here. could also use data.table
ops
maybe something with
unique(DT, by=c("geo_value", "time_value", "version"), fromLast = TRUE)
first off, thanks for the detailed feedback! I think epikey is a great name, I get tired of writing |
From a user perspective, I think it makes the most sense to combine the operations for different key subsets in the same function, and provide any different behavior via options. The core idea is the same, and ideally it "just works" without the user having to muck around with epikey details, warning them if there are strange things going on and allowing them to either confirm that's what they want or pointing them to what to fix. On reflection, I'm less sure exactly how clobbering is working in the rest of the package. Is there a list of examples somewhere? Searching the documentation it seems like its primarily described in the reference for as_epi_archive. If I'm understanding correctly, I've written this so that I think I definitely could be clearer in the write-up. |
Notes on skimming
To be clear, "epikey" or maybe "epigroup" (or maybe "unit", but there's been some dissent there) would be geo + other_keys, not time or version. Was this what you were thinking of, saying epikey-time-version instead of (geo_value, time_value, version)?
It's not really used too much right now. There is a warning about nonreproducibility in
Sanity check: we wouldn't be using this function to combine multiple indicators, right? Just different epikey-time-version sets for the same indicators?
This sounds like it will actually have to be |
Right. But the problem occurs if you're
Thanks for the clobbering example. Not sure how to handle it here, but it does seem like a relevant feature. |
So my todo for this at the moment, which will probably wait until after I have a functional forecaster: feel free to edit if you want to add/change things (I have a local copy)
|
related: #358 |
This simplifies binding two archives together, dealing with any problems caused by merging different sources before binding (see the tests for some examples that would cause the problem). In general, avoiding public accessing of internals when possible is a good idea.
The core behavior is well tested; it should probably have more edge case behavior (extra keys, strange types, etc) testing eventually.
There's also a very minor fix to the docs of
epix_merge
where a bulleted list was crammed into a single paragraph.