Epix rbind #343

dsweber2 · 2023-07-12T20:11:04Z

This simplifies binding two archives together, dealing with any problems caused by merging different sources before binding (see the tests for some examples that would cause the problem). In general, avoiding public accessing of internals when possible is a good idea.
The core behavior is well tested; it should probably have more edge case behavior (extra keys, strange types, etc) testing eventually.

There's also a very minor fix to the docs of epix_merge where a bulleted list was crammed into a single paragraph.

brookslogan

Thanks for wrapping this up in a utility function! I've made a bunch of comments which are a bit deep in the weeds; maybe don't look too much at those. I want to step back and ask/discuss design and naming, and get your insights/reflections after implementing this version. [marking "requested changes" maybe a bit premature, we need to discuss the design first]

Q1: Should we have this single function for both (i) combining archives for different-ish version ranges, and (ii) combining archives for different geo-time-other-key sets, or should (i) and (ii) be separate functions?

Q2: what sorts of helper features should we provide? If we combine (i) and (ii), some possibilities:

later archives can clobber data for versions >= clobberable_versions_start of the previous ones for overlapping geo-time-other-keys, and assume any version data for non-overlapping geo-time-other-keys doesn't count as clobbering --- makes sense as the default [This probably gets detail-y if we're trying to do (i) and (ii) at the same time with a single function call. And the wording here overlooks some cases: if a previous clobberable_versions_start is.na, then we should error on any version data <= versions_end for geo-time-other-key combinations that that previous archive covered.]
allow all clobbering, ignoring clobberable_versions_starts
later archives can clobber data for versions >= clobberable_versions_start of the previous ones. Any missing geo-time-other-key combinations in later versions means deletion; we can't directly represent this, so put in NA rows. Warn if later versions completely miss a geo-time-other-key combination (or maybe geo-other-key combinations?) present in the earlier archives.
[no clobbering allowed --- probably the simplest to implement! would make things work for various chunked downloads, but wouldn't work for cache updating unless we assume there's no clobbering]

I'm not sure if the current sync&force_distinct hit the behaviors I would expect. (Do they?)

Q3: what do we name things?

Q4: how do we present this all clearly in docs? I was a bit confused reading through the roxygen docs, and it was difficult trying to word things above. We've been thinking of introducing a name for geo+other keys, maybe "epikey", which might help. Plus maybe a few editing passes or something; I assume you had that planned already.

brookslogan · 2023-07-13T00:33:38Z

R/methods-epi_archive.R

+#' edge cases
+#'
+#' @param ... list of `epi_archive` objects to append in order.
+#' @param sync Optional; `"forbid"`, `"na"`, or `"locf"`; in the case that later


issue: the shared name with the epix_merge parameter seems confusing as it seems like it's doing something a bit different

suggestion: rename this parameter either

something like fill, like data.table::nafill (and also tidyr::complete, which is less similar), or maybe something slightly different so users don't assume they can pass a replacement value / list of replacement values like in those functions.

something more archive-specific, maybe involving the word "clobber" or "rewrite"

The naming of the choices below should also be revised. E.g., if it's fill then maybe "na" should be "keep".

Yeah, tracks. I'd rather not use fill directly since I have to call that function; maybe fill_na, since it only effects NAs? I think clobber fits better with force_distinct, and will probably be getting its own parameter based on other discussions. "keep" is definitely the right word for what's currently "na"

brookslogan · 2023-07-13T00:34:10Z

R/methods-epi_archive.R

+#' this function is still under active development, so there may be remaining
+#' edge cases
+#'
+#' @param ... list of `epi_archive` objects to append in order.


suggestion: delete "list of"
question: what order?

I think the order ends up mattering only if there are duplicate keys, where earlier ones overwrite later ones. The wording here could be a lot clearer

brookslogan · 2023-07-13T00:52:33Z

R/methods-epi_archive.R

+#'                 maintained (up to archival compression).
+#'   - `"locf"`: for every shared time value, use earlier versions of
+#'                   earlier archives to overwrite any `NA`'s found in later
+#'                   versions of later archives.


issue: I don't think we should allow "locf" here. Any explicit NAs in "later" archives represent revisions to NA or row removals (we can't distinguish between the two in the current archive representation).

I discussed the motivation somewhat in the main text, though it might make sense to move that conversation here. One way to mitigate the risk of overwriting meaningful NAs as well is to only locf on shared time values when the user specifically asks for it. Reflecting, there are possibly 3 levels to doing this writing:

overwrite all future NA's, which is what sync="locf" currently does

only overwrite NA's on shared time_value's

only overwrite when the first entry in a given archive is NA; I think this is probably the best approach
2 and 3 are both somewhat more expensive, and definitely more complicated (you have to split the data, then perform the write on the problem half, and then rejoin).

sync="locf" is addressing a fairly subtle issue, where one indicator has versions released in a given month while the other doesn't.

I think I'm starting to understand, but I think this situation probably would emerge only when a mistake has been made already in setting up the component archives, in epix_merge settings, or an epix_merge bug. If signal A has version data where signal B does not, signal B should only have NA filled in if it's before its first measurement for a given epikey-time, or by filling forward explicit NAs for that epikey-time. Is this the situation you're thinking of, and do you know if epix_merge has a bug where it's not doing this correctly?

Oh, just saw your great example. I agree we need something like this functionality, maybe by default.

Some contributors to the problem [maybe we could address later and then remove some choices in this function]:

Archives were originally intended to hold complete version histories.

~~The archive format not distinguishing explicit NAs vs. missing rows.~~ [nevermind, no, this is not connected. The issue is not being able to signal something like "out of range of the updates recorded in this partial archive for this signal/{epikey,signal}/{epikey,time_value,signal}......." / "its latest value prior to the earliest version covered by this archive"]

Solving the point above is messier with our current epix_merge() implementation. At some point I thought I had an issue to use a lazy epix_merge() which would just store the parameters and re-implement some base archive operations; that might be useful if we want to make this cleaner later. [might or might not be true, but I'm not sure we want to mandate a lazy merge to get a working format.]

brookslogan · 2023-07-13T01:10:10Z

R/methods-epi_archive.R

+  result_clobberable_versions_start <- if (all(is.na(clobberable_versions_start))) {
+    NA # (any type of NA is fine here)
+  } else {
+    max(clobberable_versions_start) # unlike the case of merging, we want the later date


musing: this seems like it could be a little off in edge cases. Hard to handle both aggregation across keys and across versions in the same function; should they be separate? Consider epix_rbind(archive1, archive2) where: archive1 has data for CA, clobberable starting with version 500, and archive2 has data for AS, clobberable starting with version 800. That's pretty edge-case-y, probably an error. But we might also have archive1b with CA data starting with version 500, clobberable starting with version 805. Then we might want something similar to the merge sync that says the clobberable versions start with version 800.

R/methods-epi_archive.R

brookslogan · 2023-07-13T01:31:03Z

R/methods-epi_archive.R

+#'   - `"forbid"`: emit an error;
+#'   - "na": use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, but ensure that, if we request a snapshot as of a version after `min(x$versions_end, y$versions_end)`, the observation columns from the less up-to-date archive will be all NAs (i.e., imagine there was an update immediately after its `versions_end` which revised all observations to be `NA`);
+#'   - `"locf"`: use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, allowing the last version of each observation to be carried forward to extrapolate unavailable versions for the less up-to-date input archive (i.e., imagining that in the less up-to-date archive's data set remained unchanged between its actual `versions_end` and the other archive's `versions_end`); or
+#'   - `"truncate"`: use `min(x$versions_end, y$versions_end)` as the result's `versions_end`, and discard any rows containing update rows for later versions.


brookslogan · 2023-07-13T01:34:44Z

R/methods-epi_archive.R

+#'                   earlier archives to overwrite any `NA`'s found in later
+#'                   versions of later archives.
+#' @param force_distinct Optional; `TRUE`, `FALSE`, or `NULL`; should the keys
+#'   be forced to be distinct?


issue: I thought this meant something similar but not quite the same as "forbid", but I can't tell what the implementation is doing. Maybe it's related to the two options I described above? I think part of this is because I've mixed up "shared time_values" and "shared keys" when reading the docs. We probably want to think about design, naming, and documentation some more here.

R/methods-epi_archive.R

brookslogan · 2023-07-13T01:40:20Z

R/methods-epi_archive.R

+  unique_geo_times <- function(x) {
+    x %>%
+      select(geo_value, time_value) %>%
+      distinct()


suggestion: take advantage of DT operations that know the keys & indices for performance. maybe unique(x, by=c("geo_value, "time_value"))

(resolved the duplicate, but I'll make sure to include something along these lines)

brookslogan · 2023-07-13T01:51:37Z

R/methods-epi_archive.R

+    DT <- reduce(DTs, rbind)
+    # remove any redundant keys
+    if (force_distinct) {
+      DT <- distinct(DT, geo_value, time_value, version, .keep_all = TRUE)


issue: we need to prioritize later versions of versions here. could also use data.table ops

maybe something with
unique(DT, by=c("geo_value", "time_value", "version"), fromLast = TRUE)

dsweber2 · 2023-07-13T18:06:40Z

first off, thanks for the detailed feedback!

I think epikey is a great name, I get tired of writing (geo_value, time_value, version). As far as understanding what the function does, I would suggest running the tests. I wrote them to demonstrate a minimal example of the sort of problem this solves. It may be a good idea to actually include those in the roxygen rather than the current example.

dsweber2 · 2023-07-13T19:06:27Z

From a user perspective, I think it makes the most sense to combine the operations for different key subsets in the same function, and provide any different behavior via options. The core idea is the same, and ideally it "just works" without the user having to muck around with epikey details, warning them if there are strange things going on and allowing them to either confirm that's what they want or pointing them to what to fix.

On reflection, I'm less sure exactly how clobbering is working in the rest of the package. Is there a list of examples somewhere? Searching the documentation it seems like its primarily described in the reference for as_epi_archive. If I'm understanding correctly, I've written this so that force_distinct=TRUE treats each subsequent archive as clobberable by the previous archives, regardless of any clobberable_versions_start. This is inheriting from the way distinct() works (earlier entries take precedence), and seems to be the opposite of the way you're thinking about it (your order is probably more appropriate given the way clobbering works). This parameter probably needs a different name/more granularity.

I think force_distinct is working more or less the way you understand it (modulo reversing the precedence). sync="locf" is addressing a fairly subtle issue, where one indicator has versions released in a given month while the other doesn't. This can result in spurious NA's, which we want to overwrite using locf, but since later NA's can actually be meaningful we need other options.

I definitely could be clearer in the write-up.

brookslogan · 2023-07-13T20:34:56Z

Notes on skimming

I think epikey is a great name, I get tired of writing (geo_value, time_value, version)

To be clear, "epikey" or maybe "epigroup" (or maybe "unit", but there's been some dissent there) would be geo + other_keys, not time or version. Was this what you were thinking of, saying epikey-time-version instead of (geo_value, time_value, version)?

On reflection, I'm less sure exactly how clobbering is working in the rest of the package. Is there a list of examples somewhere?

It's not really used too much right now. There is a warning about nonreproducibility in epix_as_of and epix_slide if you use versions that are marked as clobberable, and some bookkeeping in epix_merge() and maybe some other places. I introduced the clobbering fields partially in anticipation of doing smart archive updating with a cache. E.g., consider this scenario:

On 2023-05-05 at 1am, we fetch all issue data available. Let's say version 2023-05-05 wasn't published at [1am] [but we do have some data for version 2023-05-04].
On 2023-05-05 at 3am, the DB replicas update. Turns out we hotfixed version 2023-05-04 data and it just got through to the replicas.
On 2023-05-05 at 8pm, the DB replicas update with data for version 2023-05-05.
On 2023-05-12, we want all issue data available.
- (A) We could re-request all issue data from the API.
- (B) We could try to match (A) more efficiently by requesting issue data beginning with clobberable_versions_start, and perform the clobbering by favoring these later versions of the version data. (~~By default, clobberable versions start would be 2023-05-04~~ [probably not anymore, sorry; I think we changed the default to NA --- "clobbering should be impossible"], and versions_end might be 2023-05-04 or 2023-05-05. So [If we had the correct clobberable_versions_start,] this approach would work to match (A).)
- (C) We could try to do something to keep around the versions of version data that are being clobbered... this might be useful for debugging historical bad forecasts when there were hotfixes to the version data, but seems complicated to think about.

where one indicator has versions released in a given month while the other doesn't

Sanity check: we wouldn't be using this function to combine multiple indicators, right? Just different epikey-time-version sets for the same indicators?

I think force_distinct is working more or less the way you understand it (modulo reversing the precedence).

This sounds like it will actually have to be TRUE for other archive operations to work properly. We currently assume epikey-time-version combinations in an archive are distinct as an invariant. That might change but not soon; we should probably do some validation soon.

dsweber2 · 2023-07-13T21:25:47Z

Was this what you were thinking of, saying epikey-time-version instead of (geo_value, time_value, version)?
Oh, I see, no I was thinking epikey as an alias for the (geo_value, time_value, version) set, since that's the default key for the archives. I don't really have strong opinions about this.

Sanity check: we wouldn't be using this function to combine multiple indicators, right? Just different epikey-time-version sets for the same indicators?

Right. But the problem occurs if you're rbinding the results from several epix_merges, like if you're updating an archive that combines from different sources: yoinking the relevant lines from the test

first_merge <- epix_merge(x1,y1) # old data in an already existing archive
second_merge <- epix_merge(x2,y2) # new data to be added
epix_rbind(first_merge, second_merge,sync = "locf") # overwrites nonsense NA's from forming second_merge

This sounds like it will actually have to be TRUE for other archive operations to work properly.
I was a little unclear about the behavior. If force_distinct is false, as_epi_archive will throw an error if there are duplicate keys. So either way, if it returns (rather than erroring), the keys are distinct, if there are non-distinct keys, its just a question of whether the user gets rid of them before calling epix_rbind or just allows the overwrite to go through as per default.

Thanks for the clobbering example. Not sure how to handle it here, but it does seem like a relevant feature.

dsweber2 · 2023-07-13T22:07:15Z

dsweber2 · 2023-10-26T18:41:53Z

related: #358

dsweber2 added 2 commits July 12, 2023 11:56

feat: initial epix_rbind and test

99e8860

feat: package version passes tests, doc fixes

ad9105b

dsweber2 requested review from dajmcdon and brookslogan as code owners July 12, 2023 20:11

dsweber2 added 4 commits July 12, 2023 13:49

fix: get remote tests functional

18436b1

fix: but actually working remote this time

68b33e9

fix: roxygen example was wrong

489d924

fix: document force_distinct, Abort->abort

4abc409

brookslogan requested changes Jul 13, 2023

View reviewed changes

fix: partial variable rename (possibly_redundant->DT)

fbbafe2

fix: various Logan catches

70c76c7

brookslogan mentioned this pull request Jul 26, 2023

(How) should we explicitly support partial version histories? #352

Open

brookslogan mentioned this pull request Oct 27, 2023

Consider disk-backed, caching, derived epi_archives / targets/etc. interop #358

Open

dsweber2 added this to the Epiprocess Issue Triage milestone Oct 31, 2023

brookslogan changed the base branch from dev to main November 15, 2023 23:59

brookslogan marked this pull request as draft November 30, 2023 00:45

dsweber2 added the blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete label Dec 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epix rbind #343

Epix rbind #343

dsweber2 commented Jul 12, 2023

brookslogan left a comment •

edited

Loading

brookslogan Jul 13, 2023

brookslogan Jul 13, 2023

dsweber2 Jul 13, 2023

brookslogan Jul 13, 2023

dsweber2 Jul 13, 2023

brookslogan Jul 13, 2023

dsweber2 Jul 13, 2023

brookslogan Jul 21, 2023 •

edited

Loading

brookslogan Jul 21, 2023 •

edited

Loading

brookslogan Jul 13, 2023

brookslogan Jul 13, 2023

brookslogan Jul 13, 2023

brookslogan Jul 13, 2023

dsweber2 Jul 13, 2023

brookslogan Jul 13, 2023

dsweber2 commented Jul 13, 2023

dsweber2 commented Jul 13, 2023

brookslogan commented Jul 13, 2023 •

edited

Loading

dsweber2 commented Jul 13, 2023

dsweber2 commented Jul 13, 2023 •

edited by brookslogan

Loading

dsweber2 commented Oct 26, 2023

Epix rbind #343

Are you sure you want to change the base?

Epix rbind #343

Conversation

dsweber2 commented Jul 12, 2023

brookslogan left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brookslogan Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

brookslogan Jul 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsweber2 commented Jul 13, 2023

dsweber2 commented Jul 13, 2023

brookslogan commented Jul 13, 2023 • edited Loading

dsweber2 commented Jul 13, 2023

dsweber2 commented Jul 13, 2023 • edited by brookslogan Loading

dsweber2 commented Oct 26, 2023

brookslogan left a comment •

edited

Loading

brookslogan Jul 21, 2023 •

edited

Loading

brookslogan Jul 21, 2023 •

edited

Loading

brookslogan commented Jul 13, 2023 •

edited

Loading

dsweber2 commented Jul 13, 2023 •

edited by brookslogan

Loading