Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epix rbind #343

Draft
wants to merge 8 commits into
base: dev
Choose a base branch
from
Draft

Epix rbind #343

wants to merge 8 commits into from

Conversation

dsweber2
Copy link
Contributor

This simplifies binding two archives together, dealing with any problems caused by merging different sources before binding (see the tests for some examples that would cause the problem). In general, avoiding public accessing of internals when possible is a good idea.
The core behavior is well tested; it should probably have more edge case behavior (extra keys, strange types, etc) testing eventually.

There's also a very minor fix to the docs of epix_merge where a bulleted list was crammed into a single paragraph.

Copy link
Contributor

@brookslogan brookslogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for wrapping this up in a utility function! I've made a bunch of comments which are a bit deep in the weeds; maybe don't look too much at those. I want to step back and ask/discuss design and naming, and get your insights/reflections after implementing this version. [marking "requested changes" maybe a bit premature, we need to discuss the design first]

Q1: Should we have this single function for both (i) combining archives for different-ish version ranges, and (ii) combining archives for different geo-time-other-key sets, or should (i) and (ii) be separate functions?

Q2: what sorts of helper features should we provide? If we combine (i) and (ii), some possibilities:

  • later archives can clobber data for versions >= clobberable_versions_start of the previous ones for overlapping geo-time-other-keys, and assume any version data for non-overlapping geo-time-other-keys doesn't count as clobbering --- makes sense as the default [This probably gets detail-y if we're trying to do (i) and (ii) at the same time with a single function call. And the wording here overlooks some cases: if a previous clobberable_versions_start is.na, then we should error on any version data <= versions_end for geo-time-other-key combinations that that previous archive covered.]
  • allow all clobbering, ignoring clobberable_versions_starts
  • later archives can clobber data for versions >= clobberable_versions_start of the previous ones. Any missing geo-time-other-key combinations in later versions means deletion; we can't directly represent this, so put in NA rows. Warn if later versions completely miss a geo-time-other-key combination (or maybe geo-other-key combinations?) present in the earlier archives.
  • [no clobbering allowed --- probably the simplest to implement! would make things work for various chunked downloads, but wouldn't work for cache updating unless we assume there's no clobbering]

I'm not sure if the current sync&force_distinct hit the behaviors I would expect. (Do they?)

Q3: what do we name things?

Q4: how do we present this all clearly in docs? I was a bit confused reading through the roxygen docs, and it was difficult trying to word things above. We've been thinking of introducing a name for geo+other keys, maybe "epikey", which might help. Plus maybe a few editing passes or something; I assume you had that planned already.

#' edge cases
#'
#' @param ... list of `epi_archive` objects to append in order.
#' @param sync Optional; `"forbid"`, `"na"`, or `"locf"`; in the case that later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: the shared name with the epix_merge parameter seems confusing as it seems like it's doing something a bit different

suggestion: rename this parameter either

  • something like fill, like data.table::nafill (and also tidyr::complete, which is less similar), or maybe something slightly different so users don't assume they can pass a replacement value / list of replacement values like in those functions.
  • something more archive-specific, maybe involving the word "clobber" or "rewrite"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of the choices below should also be revised. E.g., if it's fill then maybe "na" should be "keep".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, tracks. I'd rather not use fill directly since I have to call that function; maybe fill_na, since it only effects NAs? I think clobber fits better with force_distinct, and will probably be getting its own parameter based on other discussions. "keep" is definitely the right word for what's currently "na"

#' this function is still under active development, so there may be remaining
#' edge cases
#'
#' @param ... list of `epi_archive` objects to append in order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: delete "list of"
question: what order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the order ends up mattering only if there are duplicate keys, where earlier ones overwrite later ones. The wording here could be a lot clearer

#' maintained (up to archival compression).
#' - `"locf"`: for every shared time value, use earlier versions of
#' earlier archives to overwrite any `NA`'s found in later
#' versions of later archives.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I don't think we should allow "locf" here. Any explicit NAs in "later" archives represent revisions to NA or row removals (we can't distinguish between the two in the current archive representation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed the motivation somewhat in the main text, though it might make sense to move that conversation here. One way to mitigate the risk of overwriting meaningful NAs as well is to only locf on shared time values when the user specifically asks for it. Reflecting, there are possibly 3 levels to doing this writing:

  1. overwrite all future NA's, which is what sync="locf" currently does
  2. only overwrite NA's on shared time_value's
  3. only overwrite when the first entry in a given archive is NA; I think this is probably the best approach
    2 and 3 are both somewhat more expensive, and definitely more complicated (you have to split the data, then perform the write on the problem half, and then rejoin).

Copy link
Contributor

@brookslogan brookslogan Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync="locf" is addressing a fairly subtle issue, where one indicator has versions released in a given month while the other doesn't.

I think I'm starting to understand, but I think this situation probably would emerge only when a mistake has been made already in setting up the component archives, in epix_merge settings, or an epix_merge bug. If signal A has version data where signal B does not, signal B should only have NA filled in if it's before its first measurement for a given epikey-time, or by filling forward explicit NAs for that epikey-time. Is this the situation you're thinking of, and do you know if epix_merge has a bug where it's not doing this correctly?

Copy link
Contributor

@brookslogan brookslogan Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just saw your great example. I agree we need something like this functionality, maybe by default.

Some contributors to the problem [maybe we could address later and then remove some choices in this function]:

  • Archives were originally intended to hold complete version histories.
  • The archive format not distinguishing explicit NAs vs. missing rows. [nevermind, no, this is not connected. The issue is not being able to signal something like "out of range of the updates recorded in this partial archive for this signal/{epikey,signal}/{epikey,time_value,signal}......." / "its latest value prior to the earliest version covered by this archive"]
  • Solving the point above is messier with our current epix_merge() implementation. At some point I thought I had an issue to use a lazy epix_merge() which would just store the parameters and re-implement some base archive operations; that might be useful if we want to make this cleaner later. [might or might not be true, but I'm not sure we want to mandate a lazy merge to get a working format.]

result_clobberable_versions_start <- if (all(is.na(clobberable_versions_start))) {
NA # (any type of NA is fine here)
} else {
max(clobberable_versions_start) # unlike the case of merging, we want the later date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

musing: this seems like it could be a little off in edge cases. Hard to handle both aggregation across keys and across versions in the same function; should they be separate? Consider epix_rbind(archive1, archive2) where: archive1 has data for CA, clobberable starting with version 500, and archive2 has data for AS, clobberable starting with version 800. That's pretty edge-case-y, probably an error. But we might also have archive1b with CA data starting with version 500, clobberable starting with version 805. Then we might want something similar to the merge sync that says the clobberable versions start with version 800.

R/methods-epi_archive.R Outdated Show resolved Hide resolved
#' - `"forbid"`: emit an error;
#' - "na": use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, but ensure that, if we request a snapshot as of a version after `min(x$versions_end, y$versions_end)`, the observation columns from the less up-to-date archive will be all NAs (i.e., imagine there was an update immediately after its `versions_end` which revised all observations to be `NA`);
#' - `"locf"`: use `max(x$versions_end, y$versions_end)` as the result's `versions_end`, allowing the last version of each observation to be carried forward to extrapolate unavailable versions for the less up-to-date input archive (i.e., imagining that in the less up-to-date archive's data set remained unchanged between its actual `versions_end` and the other archive's `versions_end`); or
#' - `"truncate"`: use `min(x$versions_end, y$versions_end)` as the result's `versions_end`, and discard any rows containing update rows for later versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#' earlier archives to overwrite any `NA`'s found in later
#' versions of later archives.
#' @param force_distinct Optional; `TRUE`, `FALSE`, or `NULL`; should the keys
#' be forced to be distinct?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: I thought this meant something similar but not quite the same as "forbid", but I can't tell what the implementation is doing. Maybe it's related to the two options I described above? I think part of this is because I've mixed up "shared time_values" and "shared keys" when reading the docs. We probably want to think about design, naming, and documentation some more here.

R/methods-epi_archive.R Show resolved Hide resolved
unique_geo_times <- function(x) {
x %>%
select(geo_value, time_value) %>%
distinct()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: take advantage of DT operations that know the keys & indices for performance. maybe unique(x, by=c("geo_value, "time_value"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(resolved the duplicate, but I'll make sure to include something along these lines)

DT <- reduce(DTs, rbind)
# remove any redundant keys
if (force_distinct) {
DT <- distinct(DT, geo_value, time_value, version, .keep_all = TRUE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: we need to prioritize later versions of versions here. could also use data.table ops

maybe something with
unique(DT, by=c("geo_value", "time_value", "version"), fromLast = TRUE)

@dsweber2
Copy link
Contributor Author

first off, thanks for the detailed feedback!

I think epikey is a great name, I get tired of writing (geo_value, time_value, version). As far as understanding what the function does, I would suggest running the tests. I wrote them to demonstrate a minimal example of the sort of problem this solves. It may be a good idea to actually include those in the roxygen rather than the current example.

@dsweber2
Copy link
Contributor Author

From a user perspective, I think it makes the most sense to combine the operations for different key subsets in the same function, and provide any different behavior via options. The core idea is the same, and ideally it "just works" without the user having to muck around with epikey details, warning them if there are strange things going on and allowing them to either confirm that's what they want or pointing them to what to fix.

On reflection, I'm less sure exactly how clobbering is working in the rest of the package. Is there a list of examples somewhere? Searching the documentation it seems like its primarily described in the reference for as_epi_archive. If I'm understanding correctly, I've written this so that force_distinct=TRUE treats each subsequent archive as clobberable by the previous archives, regardless of any clobberable_versions_start. This is inheriting from the way distinct() works (earlier entries take precedence), and seems to be the opposite of the way you're thinking about it (your order is probably more appropriate given the way clobbering works). This parameter probably needs a different name/more granularity.

I think force_distinct is working more or less the way you understand it (modulo reversing the precedence). sync="locf" is addressing a fairly subtle issue, where one indicator has versions released in a given month while the other doesn't. This can result in spurious NA's, which we want to overwrite using locf, but since later NA's can actually be meaningful we need other options.

I definitely could be clearer in the write-up.

@brookslogan
Copy link
Contributor

brookslogan commented Jul 13, 2023

Notes on skimming

I think epikey is a great name, I get tired of writing (geo_value, time_value, version)

To be clear, "epikey" or maybe "epigroup" (or maybe "unit", but there's been some dissent there) would be geo + other_keys, not time or version. Was this what you were thinking of, saying epikey-time-version instead of (geo_value, time_value, version)?

On reflection, I'm less sure exactly how clobbering is working in the rest of the package. Is there a list of examples somewhere?

It's not really used too much right now. There is a warning about nonreproducibility in epix_as_of and epix_slide if you use versions that are marked as clobberable, and some bookkeeping in epix_merge() and maybe some other places. I introduced the clobbering fields partially in anticipation of doing smart archive updating with a cache. E.g., consider this scenario:

  • On 2023-05-05 at 1am, we fetch all issue data available. Let's say version 2023-05-05 wasn't published at [1am] [but we do have some data for version 2023-05-04].
  • On 2023-05-05 at 3am, the DB replicas update. Turns out we hotfixed version 2023-05-04 data and it just got through to the replicas.
  • On 2023-05-05 at 8pm, the DB replicas update with data for version 2023-05-05.
  • On 2023-05-12, we want all issue data available.
    • (A) We could re-request all issue data from the API.
    • (B) We could try to match (A) more efficiently by requesting issue data beginning with clobberable_versions_start, and perform the clobbering by favoring these later versions of the version data. (By default, clobberable versions start would be 2023-05-04 [probably not anymore, sorry; I think we changed the default to NA --- "clobbering should be impossible"], and versions_end might be 2023-05-04 or 2023-05-05. So [If we had the correct clobberable_versions_start,] this approach would work to match (A).)
    • (C) We could try to do something to keep around the versions of version data that are being clobbered... this might be useful for debugging historical bad forecasts when there were hotfixes to the version data, but seems complicated to think about.

where one indicator has versions released in a given month while the other doesn't

Sanity check: we wouldn't be using this function to combine multiple indicators, right? Just different epikey-time-version sets for the same indicators?

I think force_distinct is working more or less the way you understand it (modulo reversing the precedence).

This sounds like it will actually have to be TRUE for other archive operations to work properly. We currently assume epikey-time-version combinations in an archive are distinct as an invariant. That might change but not soon; we should probably do some validation soon.

@dsweber2
Copy link
Contributor Author

Was this what you were thinking of, saying epikey-time-version instead of (geo_value, time_value, version)?
Oh, I see, no I was thinking epikey as an alias for the (geo_value, time_value, version) set, since that's the default key for the archives. I don't really have strong opinions about this.

Sanity check: we wouldn't be using this function to combine multiple indicators, right? Just different epikey-time-version sets for the same indicators?

Right. But the problem occurs if you're rbinding the results from several epix_merges, like if you're updating an archive that combines from different sources: yoinking the relevant lines from the test

first_merge <- epix_merge(x1,y1) # old data in an already existing archive
second_merge <- epix_merge(x2,y2) # new data to be added
epix_rbind(first_merge, second_merge,sync = "locf") # overwrites nonsense NA's from forming second_merge

This sounds like it will actually have to be TRUE for other archive operations to work properly.
I was a little unclear about the behavior. If force_distinct is false, as_epi_archive will throw an error if there are duplicate keys. So either way, if it returns (rather than erroring), the keys are distinct, if there are non-distinct keys, its just a question of whether the user gets rid of them before calling epix_rbind or just allows the overwrite to go through as per default.

Thanks for the clobbering example. Not sure how to handle it here, but it does seem like a relevant feature.

@dsweber2
Copy link
Contributor Author

dsweber2 commented Jul 13, 2023

So my todo for this at the moment, which will probably wait until after I have a functional forecaster: feel free to edit if you want to add/change things (I have a local copy)

@dsweber2
Copy link
Contributor Author

related: #358

@dsweber2 dsweber2 added this to the Epiprocess Issue Triage milestone Oct 31, 2023
@brookslogan brookslogan changed the base branch from dev to main November 15, 2023 23:59
@brookslogan brookslogan marked this pull request as draft November 30, 2023 00:45
@dsweber2 dsweber2 added the blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete label Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked-by-internal Can't/shouldn't proceed w/o internal action --- link to relevant Issue, rm label when complete
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants