Skip to content

edm‐publishing

Jack Rosacker edited this page Aug 8, 2024 · 3 revisions

edm-publishing is our Digital Ocean Space for managing the data we actually build, qa, package, and/or distribute

DE Data Products

aka the db-{product name} folders

Each of DE's data products (with the exception of KPDB) has a corresponding top-level folder in edm-publishing. These folders contain the build outputs for their respective data products. With some exceptions, they contain outputs in 3 stages of the products lifecycle. These three stages are represented in the folder structure.

Folder Path to individual outputs Description/Rationale
build build/{build_name}/ When a dataset build is run, the outputs end up here. Build name, the unique key for a build output, corresponds to build environment, and is generally the name of the branch that this build was run on, though a value can be provided manually when running a build (or has a special name, such as our "nightly_qa" action run on main.

Builds are somewhat transient - there's no contract on them living for any amount of time, and our utilities will overwrite them
draft draft/{version}/{revision}/ Drafts are builds that we've promoted as to mark them ready for QA. Their unique key is now not their build environment/name, but instead is comprised of

1. the "official" version, which if this dataset is published will be its unique identifier to the world, and
2. the "revision". Our utilities generate this revision number, so they will always start at 1 and increment. In addition to the revision number, they have an optional description ("1", "2-fix-null-bbls", etc).

The latest revision of a draft should always be considered the only current "valid" draft.
Drafts are immutable - we will never delete or overwrite a draft.
publish publish/{version}
publish/{version}.{patch}
publish/latest
Once a draft has passed QA and is approved to be published/distributed/etc, we promote it from the draft folder to the publish folder, as well as the latest folder if applicable. If there is uncertainty to the freshness of what is in the latest folder, the version can be checked in build_metadata.json.

We also patch versions as needed. This rarely happens, but sometimes we make corrections to a dataset after its been published. Since published versions, like drafts, are immutable, we create a new patched version in this case. Essentially, this is just a new version, but as it's slightly specialized behavior it seemed worth highlighting. We haven't really discussed downstream ramifications of this (does a patched version label end up on Bytes? Or do we just replace the previous version and keep the patch number internal?