nextstrain · victorlin · Aug 21, 2023 · Aug 9, 2023 · Aug 9, 2023 · Aug 10, 2023
diff --git a/README.md b/README.md
@@ -22,6 +22,19 @@ RSV sequences and metadata can be downloaded in the ```/ingest``` folder using
 The ingest pipeline is based on the Nextstrain Monkeypox ingest (nextstrain.org/monkeypox/ingest).
 Running the ingest pipeline produces ```ingest/data/{a and b}/metadata.tsv``` and ```ingest/data/{a and b}/sequences.fasta```.
 
+#### `ingest/vendored`
+
+This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in `ingest/vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, first install `git subrepo`, then run:
+
+```sh
+git subrepo pull ingest/vendored
+```
+
+Changes should not be pushed using `git subrepo push`.
+
+1. For pathogen-specific changes, make them in this repository via a pull request.
+2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subrepo pull` to add those changes to this repository.
+
 
 ## Run Analysis Pipeline
 

diff --git a/ingest/vendored/.github/pull_request_template.md b/ingest/vendored/.github/pull_request_template.md
@@ -0,0 +1,16 @@
+### Description of proposed changes
+
+<!-- What is the goal of this pull request? What does this pull request change? -->
+
+### Related issue(s)
+
+<!-- Link any related issues here. -->
+
+### Checklist
+
+<!-- Make sure checks are successful at the bottom of the PR. -->
+
+- [ ] Checks pass
+- [ ] If adding a script, add an entry for it in the README.
+
+<!-- 🙌 Thank you for contributing to Nextstrain! ✨ -->
diff --git a/ingest/vendored/.github/workflows/ci.yaml b/ingest/vendored/.github/workflows/ci.yaml
@@ -0,0 +1,13 @@
+name: CI
+
+on:
+  - push
+  - pull_request
+  - workflow_dispatch
+
+jobs:
+  shellcheck:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: nextstrain/.github/actions/shellcheck@master
diff --git a/ingest/vendored/.gitrepo b/ingest/vendored/.gitrepo
@@ -0,0 +1,12 @@
+; DO NOT EDIT (unless you know what you are doing)
+;
+; This subdirectory is a git "subrepo", and this file is maintained by the
+; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
+;
+[subrepo]
+	remote = https://github.com/nextstrain/ingest
+	branch = main
+	commit = 1eb8b30428d5f66adac201f0a246a7ab4bdc9792
+	parent = 9f6b59f1ce418d9e5bdd1c4e0bbf5a070d15072e
+	method = merge
+	cmdver = 0.4.6
diff --git a/ingest/vendored/.shellcheckrc b/ingest/vendored/.shellcheckrc
@@ -0,0 +1,6 @@
+# Use of this file requires Shellcheck v0.7.0 or newer.
+#
+# SC2064 - We intentionally want variables to expand immediately within traps
+#          so the trap can not fail due to variable interpolation later.
+#
+disable=SC2064
diff --git a/ingest/vendored/README.md b/ingest/vendored/README.md
@@ -0,0 +1,91 @@
+# ingest
+
+Shared internal tooling for pathogen data ingest.  Used by our individual
+pathogen repos which produce Nextstrain builds.  Expected to be vendored by
+each pathogen repo using `git subtree`.
+
+Some tools may only live here temporarily before finding a permanent home in
+`augur curate` or Nextstrain CLI.  Others may happily live out their days here.
+
+## Vendoring
+
+Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
+(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)
+
+If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
+Then add the latest ingest scripts to the pathogen repo by running:
+
+```
+git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
+```
+
+Any future updates of ingest scripts can be pulled in with:
+
+```
+git subrepo pull ingest/vendored
+```
+
+## History
+
+Much of this tooling originated in
+[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
+[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
+It subsequently proliferated from [monkeypox][] to other pathogen repos
+([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
+thru copying.  To [counter that
+proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
+this repo was made.
+
+[monkeypox]: https://github.com/nextstrain/monkeypox
+[rsv]: https://github.com/nextstrain/rsv
+[zika]: https://github.com/nextstrain/zika/pull/24
+[dengue]: https://github.com/nextstrain/dengue/pull/10
+[hepatitisB]: https://github.com/nextstrain/hepatitisB
+[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov
+
+## Elsewhere
+
+The creation of this repo, in both the abstract and concrete, and the general
+approach to "ingest" has been discussed in various internal places, including:
+
+- https://github.com/nextstrain/private/issues/59
+- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
+- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
+- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
+- _…many others_
+
+## Scripts
+
+Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.
+
+- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
+- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
+- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
+  If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
+- [notify-slack](notify-slack) - Send message or file to Slack
+- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
+- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
+- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
+  A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
+
+Potential Nextstrain CLI scripts
+
+- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
+- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
+  This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
+- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
+  Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
+  Adds the following user defined metadata to uploaded S3 object:
+    - `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
+    - `recordcount` - the line count of the file
+- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
+  Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.
+
+Potential augur curate scripts
+
+- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
+- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
+- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
+- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
+- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
diff --git a/ingest/bin/apply-geolocation-rules → ingest/vendored/apply-geolocation-rules b/ingest/bin/apply-geolocation-rules → ingest/vendored/apply-geolocation-rules
@@ -32,7 +32,7 @@ def load_geolocation_rules(geolocation_rules_file):
     with open(geolocation_rules_file, 'r') as rules_fh:
         for line in rules_fh:
             # ignore comments
-            if line.lstrip()[0] == '#':
+            if line.strip()=="" or line.lstrip()[0] == '#':
                 continue
 
             row = line.strip('\n').split('\t')
@@ -222,7 +222,7 @@ if __name__ == '__main__':
         record = json.loads(record)
 
         try:
-            annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
+            annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
         except CyclicGeolocationRulesError as e:
             print(e, file=stderr)
             exit(1)

diff --git a/ingest/bin/cloudfront-invalidate → ingest/vendored/cloudfront-invalidate b/ingest/bin/cloudfront-invalidate → ingest/vendored/cloudfront-invalidate
diff --git a/ingest/vendored/download-from-s3 b/ingest/vendored/download-from-s3
@@ -0,0 +1,48 @@
+#!/bin/bash
+set -euo pipefail
+
+bin="$(dirname "$0")"
+
+main() {
+    local src="${1:?A source s3:// URL is required as the first argument.}"
+    local dst="${2:?A destination file path is required as the second argument.}"
+    # How many lines to subsample to. 0 means no subsampling. Optional.
+    # It is not advised to use this for actual subsampling! This is intended to be
+    # used for debugging workflows with large datasets such as ncov-ingest as
+    # described in https://github.com/nextstrain/ncov-ingest/pull/367
+
+    # Uses `tsv-sample` to subsample, so it will not work as expected with files
+    # that have a single record split across multiple lines (i.e. FASTA sequences)
+    local n="${3:-0}"
+
+    local s3path="${src#s3://}"
+    local bucket="${s3path%%/*}"
+    local key="${s3path#*/}"
+
+    local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
+    dst_hash="$("$bin/sha256sum" < "$dst" || true)"
+    src_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
+
+    echo "[ INFO] Downloading $src → $dst"
+    if [[ $src_hash != "$dst_hash" ]]; then
+        aws s3 cp --no-progress "$src" - |
+        if [[ "$src" == *.gz ]]; then
+            gunzip -cfq
+        elif  [[ "$src" == *.xz ]]; then
+            xz -T0 -dcq
+        elif [[ "$src" == *.zst ]]; then
+            zstd -T0 -dcq
+        else
+            cat
+        fi |
+        if [[ "$n" -gt 0 ]]; then
+            tsv-sample -H -i -n "$n"
+        else
+            cat
+        fi >"$dst"
+    else
+        echo "[ INFO] Files are identical, skipping download"
+    fi
+}
+
+main "$@"
diff --git a/ingest/bin/merge-user-metadata → ingest/vendored/merge-user-metadata b/ingest/bin/merge-user-metadata → ingest/vendored/merge-user-metadata
diff --git a/ingest/vendored/notify-on-diff b/ingest/vendored/notify-on-diff
@@ -0,0 +1,35 @@
+#!/bin/bash
+
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+bin="$(dirname "$0")"
+
+src="${1:?A source file is required as the first argument.}"
+dst="${2:?A destination s3:// URL is required as the second argument.}"
+
+dst_local="$(mktemp -t s3-file-XXXXXX)"
+diff="$(mktemp -t diff-XXXXXX)"
+
+trap "rm -f '$dst_local' '$diff'" EXIT
+
+# if the file is not already present, just exit
+"$bin"/s3-object-exists "$dst" || exit 0
+
+"$bin"/download-from-s3 "$dst" "$dst_local"
+
+# diff's exit code is 0 for no differences, 1 for differences found, and >1 for errors
+diff_exit_code=0
+diff "$dst_local" "$src" > "$diff" || diff_exit_code=$?
+
+if [[ "$diff_exit_code" -eq 1 ]]; then
+    echo "Notifying Slack about diff."
+    "$bin"/notify-slack --upload "$src.diff" < "$diff"
+elif [[ "$diff_exit_code" -gt 1 ]]; then
+    echo "Notifying Slack about diff failure"
+    "$bin"/notify-slack "Diff failed for $src"
+else
+    echo "No change in $src."
+fi
diff --git a/ingest/vendored/notify-on-job-fail b/ingest/vendored/notify-on-job-fail
@@ -0,0 +1,23 @@
+#!/bin/bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+: "${AWS_BATCH_JOB_ID:=}"
+: "${GITHUB_RUN_ID:=}"
+
+bin="$(dirname "$0")"
+job_name="${1:?A job name is required as the first argument}"
+github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
+
+echo "Notifying Slack about failed ${job_name} job."
+message="❌ ${job_name} job has FAILED 😞 "
+
+if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
+    message+="See AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>) for error details. "
+elif [[ -n "${GITHUB_RUN_ID}" ]]; then
+    message+="See GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details. "
+fi
+
+"$bin"/notify-slack "$message"
diff --git a/ingest/vendored/notify-on-job-start b/ingest/vendored/notify-on-job-start
@@ -0,0 +1,27 @@
+#!/bin/bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+: "${AWS_BATCH_JOB_ID:=}"
+: "${GITHUB_RUN_ID:=}"
+
+bin="$(dirname "$0")"
+job_name="${1:?A job name is required as the first argument}"
+github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
+build_dir="${3:-ingest}"
+
+echo "Notifying Slack about started ${job_name} job."
+message="${job_name} job has started."
+
+if [[ -n "${GITHUB_RUN_ID}" ]]; then
+  message+=" The job was submitted by GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
+fi
+
+if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
+  message+=" The job was launched as AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>)."
+  message+=" Follow along in your local clone of ${github_repo} with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ${build_dir}"'```'
+fi
+
+"$bin"/notify-slack "$message"
diff --git a/ingest/vendored/notify-on-record-change b/ingest/vendored/notify-on-record-change
@@ -0,0 +1,53 @@
+#!/bin/bash
+set -euo pipefail
+
+: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
+: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"
+
+bin="$(dirname "$0")"
+
+src="${1:?A source ndjson file is required as the first argument.}"
+dst="${2:?A destination ndjson s3:// URL is required as the second argument.}"
+source_name=${3:?A record source name is required as the third argument.}
+
+# if the file is not already present, just exit
+"$bin"/s3-object-exists "$dst" || exit 0
+
+s3path="${dst#s3://}"
+bucket="${s3path%%/*}"
+key="${s3path#*/}"
+
+src_record_count="$(wc -l < "$src")"
+
+# Try getting record count from S3 object metadata
+dst_record_count="$(aws s3api head-object --bucket "$bucket" --key "$key" --query "Metadata.recordcount || ''" --output text 2>/dev/null || true)"
+if [[ -z "$dst_record_count" ]]; then
+  # This object doesn't have the record count stored as metadata
+  # We have to download it and count the lines locally
+  dst_record_count="$(wc -l < <(aws s3 cp --no-progress "$dst" - | xz -T0 -dcfq))"
+fi
+
+added_records="$(( src_record_count - dst_record_count ))"
+
+printf "%'4d %s\n" "$src_record_count" "$src"
+printf "%'4d %s\n" "$dst_record_count" "$dst"
+printf "%'4d added records\n" "$added_records"
+
+slack_message=""
+
+if [[ $added_records -gt 0 ]]; then
+    echo "Notifying Slack about added records (n=$added_records)"
+    slack_message="📈 New records (n=$added_records) found on $source_name."
+
+elif [[ $added_records -lt 0 ]]; then
+    echo "Notifying Slack about fewer records (n=$added_records)"
+    slack_message="📉 Fewer records (n=$added_records) found on $source_name."
+
+else
+    echo "Notifying Slack about same number of records"
+    slack_message="⛔ No new records found on $source_name."
+fi
+
+slack_message+=" (Total record count: $src_record_count)"
+
+"$bin"/notify-slack "$slack_message"