Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nextstrain/ingest #32

Merged
merged 4 commits into from
Aug 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,19 @@ RSV sequences and metadata can be downloaded in the ```/ingest``` folder using
The ingest pipeline is based on the Nextstrain Monkeypox ingest (nextstrain.org/monkeypox/ingest).
Running the ingest pipeline produces ```ingest/data/{a and b}/metadata.tsv``` and ```ingest/data/{a and b}/sequences.fasta```.

#### `ingest/vendored`

This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in `ingest/vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, first install `git subrepo`, then run:

```sh
git subrepo pull ingest/vendored
```

Changes should not be pushed using `git subrepo push`.

1. For pathogen-specific changes, make them in this repository via a pull request.
2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subrepo pull` to add those changes to this repository.


## Run Analysis Pipeline

Expand Down
16 changes: 16 additions & 0 deletions ingest/vendored/.github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### Description of proposed changes

<!-- What is the goal of this pull request? What does this pull request change? -->

### Related issue(s)

<!-- Link any related issues here. -->

### Checklist

<!-- Make sure checks are successful at the bottom of the PR. -->

- [ ] Checks pass
- [ ] If adding a script, add an entry for it in the README.

<!-- 🙌 Thank you for contributing to Nextstrain! ✨ -->
13 changes: 13 additions & 0 deletions ingest/vendored/.github/workflows/ci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: CI

on:
- push
- pull_request
- workflow_dispatch

jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: nextstrain/.github/actions/shellcheck@master
12 changes: 12 additions & 0 deletions ingest/vendored/.gitrepo
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
; DO NOT EDIT (unless you know what you are doing)
;
; This subdirectory is a git "subrepo", and this file is maintained by the
; git-subrepo command. See https://github.com/ingydotnet/git-subrepo#readme
;
[subrepo]
remote = https://github.com/nextstrain/ingest
branch = main
commit = 1eb8b30428d5f66adac201f0a246a7ab4bdc9792
parent = 9f6b59f1ce418d9e5bdd1c4e0bbf5a070d15072e
method = merge
cmdver = 0.4.6
6 changes: 6 additions & 0 deletions ingest/vendored/.shellcheckrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Use of this file requires Shellcheck v0.7.0 or newer.
#
# SC2064 - We intentionally want variables to expand immediately within traps
# so the trap can not fail due to variable interpolation later.
#
disable=SC2064
91 changes: 91 additions & 0 deletions ingest/vendored/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# ingest

Shared internal tooling for pathogen data ingest. Used by our individual
pathogen repos which produce Nextstrain builds. Expected to be vendored by
each pathogen repo using `git subtree`.

Some tools may only live here temporarily before finding a permanent home in
`augur curate` or Nextstrain CLI. Others may happily live out their days here.

## Vendoring

Nextstrain maintained pathogen repos will use [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to vendor ingest scripts.
(See discussion on this decision in https://github.com/nextstrain/ingest/issues/3)

If you don't already have `git subrepo` installed, follow the [git subrepo installation instructions](https://github.com/ingydotnet/git-subrepo#installation).
Then add the latest ingest scripts to the pathogen repo by running:

```
git subrepo clone https://github.com/nextstrain/ingest ingest/vendored
```

Any future updates of ingest scripts can be pulled in with:

```
git subrepo pull ingest/vendored
```

## History

Much of this tooling originated in
[ncov-ingest](https://github.com/nextstrain/ncov-ingest) and was passaged thru
[monkeypox's ingest/](https://github.com/nextstrain/monkeypox/tree/@/ingest/).
It subsequently proliferated from [monkeypox][] to other pathogen repos
([rsv][], [zika][], [dengue][], [hepatitisB][], [forecasts-ncov][]) primarily
thru copying. To [counter that
proliferation](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079),
this repo was made.

[monkeypox]: https://github.com/nextstrain/monkeypox
[rsv]: https://github.com/nextstrain/rsv
[zika]: https://github.com/nextstrain/zika/pull/24
[dengue]: https://github.com/nextstrain/dengue/pull/10
[hepatitisB]: https://github.com/nextstrain/hepatitisB
[forecasts-ncov]: https://github.com/nextstrain/forecasts-ncov

## Elsewhere

The creation of this repo, in both the abstract and concrete, and the general
approach to "ingest" has been discussed in various internal places, including:

- https://github.com/nextstrain/private/issues/59
- @joverlee521's [workflows document](https://docs.google.com/document/d/1rLWPvEuj0Ayc8MR0O1lfRJZfj9av53xU38f20g8nU_E/edit#heading=h.4g0d3mjvb89i)
- [5 July 2023 Slack thread](https://bedfordlab.slack.com/archives/C7SDVPBLZ/p1688577879947079)
- [6 July 2023 team meeting](https://docs.google.com/document/d/1FPfx-ON5RdqL2wyvODhkrCcjgOVX3nlXgBwCPhIEsco/edit)
- _…many others_

## Scripts

Scripts for supporting ingest workflow automation that don’t really belong in any of our existing tools.

- [notify-on-diff](notify-on-diff) - Send Slack message with diff of a local file and an S3 object
- [notify-on-job-fail](notify-on-job-fail) - Send Slack message with details about failed workflow job on GitHub Actions and/or AWS Batch
- [notify-on-job-start](notify-on-job-start) - Send Slack message with details about workflow job on GitHub Actions and/or AWS Batch
- [notify-on-record-change](notify-on-recod-change) - Send Slack message with details about line count changes for a file compared to an S3 object's metadata `recordcount`.
If the S3 object's metadata does not have `recordcount`, then will attempt to download S3 object to count lines locally, which only supports `xz` compressed S3 objects.
- [notify-slack](notify-slack) - Send message or file to Slack
- [s3-object-exists](s3-object-exists) - Used to prevent 404 errors during S3 file comparisons in the notify-* scripts
- [trigger](trigger) - Triggers downstream GitHub Actions via the GitHub API using repository_dispatch events.
- [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.

Potential Nextstrain CLI scripts

- [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
- [cloudfront-invalidate](cloudfront-invalidate) - CloudFront invalidation is already supported in the [nextstrain remote command for S3 files](https://github.com/nextstrain/cli/blob/a5dda9c0579ece7acbd8e2c32a4bbe95df7c0bce/nextstrain/cli/remote/s3.py#L104).
This exists as a separate script to support CloudFront invalidation when using the upload-to-s3 script.
- [upload-to-s3](upload-to-s3) - Upload file to AWS S3 bucket with compression based on file extension in S3 URL.
Skips upload if the local file's hash is identical to the S3 object's metadata `sha256sum`.
Adds the following user defined metadata to uploaded S3 object:
- `sha256sum` - hash of the file generated by [sha256sum](sha256sum)
- `recordcount` - the line count of the file
- [download-from-s3](download-from-s3) - Download file from AWS S3 bucket with decompression based on file extension in S3 URL.
Skips download if the local file already exists and has a hash identical to the S3 object's metadata `sha256sum`.

Potential augur curate scripts

- [apply-geolocation-rules](apply-geolocation-rules) - Applies user curated geolocation rules to NDJSON records
- [merge-user-metadata](merge-user-metadata) - Merges user annotations with NDJSON records
- [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
- [transform-field-names](transform-field-names) - Rename fields of NDJSON records
- [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ def load_geolocation_rules(geolocation_rules_file):
with open(geolocation_rules_file, 'r') as rules_fh:
for line in rules_fh:
# ignore comments
if line.lstrip()[0] == '#':
if line.strip()=="" or line.lstrip()[0] == '#':
continue

row = line.strip('\n').split('\t')
Expand Down Expand Up @@ -222,7 +222,7 @@ if __name__ == '__main__':
record = json.loads(record)

try:
annotated_values = transform_geolocations(geolocation_rules, [record[field] for field in location_fields])
annotated_values = transform_geolocations(geolocation_rules, [record.get(field, '') for field in location_fields])
except CyclicGeolocationRulesError as e:
print(e, file=stderr)
exit(1)
Expand Down
File renamed without changes.
48 changes: 48 additions & 0 deletions ingest/vendored/download-from-s3
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#!/bin/bash
set -euo pipefail

bin="$(dirname "$0")"

main() {
local src="${1:?A source s3:// URL is required as the first argument.}"
local dst="${2:?A destination file path is required as the second argument.}"
# How many lines to subsample to. 0 means no subsampling. Optional.
# It is not advised to use this for actual subsampling! This is intended to be
# used for debugging workflows with large datasets such as ncov-ingest as
# described in https://github.com/nextstrain/ncov-ingest/pull/367

# Uses `tsv-sample` to subsample, so it will not work as expected with files
# that have a single record split across multiple lines (i.e. FASTA sequences)
local n="${3:-0}"

local s3path="${src#s3://}"
local bucket="${s3path%%/*}"
local key="${s3path#*/}"

local src_hash dst_hash no_hash=0000000000000000000000000000000000000000000000000000000000000000
dst_hash="$("$bin/sha256sum" < "$dst" || true)"
src_hash="$(aws s3api head-object --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"

echo "[ INFO] Downloading $src → $dst"
if [[ $src_hash != "$dst_hash" ]]; then
aws s3 cp --no-progress "$src" - |
if [[ "$src" == *.gz ]]; then
gunzip -cfq
elif [[ "$src" == *.xz ]]; then
xz -T0 -dcq
elif [[ "$src" == *.zst ]]; then
zstd -T0 -dcq
else
cat
fi |
if [[ "$n" -gt 0 ]]; then
tsv-sample -H -i -n "$n"
else
cat
fi >"$dst"
else
echo "[ INFO] Files are identical, skipping download"
fi
}

main "$@"
File renamed without changes.
35 changes: 35 additions & 0 deletions ingest/vendored/notify-on-diff
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"

src="${1:?A source file is required as the first argument.}"
dst="${2:?A destination s3:// URL is required as the second argument.}"

dst_local="$(mktemp -t s3-file-XXXXXX)"
diff="$(mktemp -t diff-XXXXXX)"

trap "rm -f '$dst_local' '$diff'" EXIT

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0

"$bin"/download-from-s3 "$dst" "$dst_local"

# diff's exit code is 0 for no differences, 1 for differences found, and >1 for errors
diff_exit_code=0
diff "$dst_local" "$src" > "$diff" || diff_exit_code=$?

if [[ "$diff_exit_code" -eq 1 ]]; then
echo "Notifying Slack about diff."
"$bin"/notify-slack --upload "$src.diff" < "$diff"
elif [[ "$diff_exit_code" -gt 1 ]]; then
echo "Notifying Slack about diff failure"
"$bin"/notify-slack "Diff failed for $src"
else
echo "No change in $src."
fi
23 changes: 23 additions & 0 deletions ingest/vendored/notify-on-job-fail
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

: "${AWS_BATCH_JOB_ID:=}"
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"

echo "Notifying Slack about failed ${job_name} job."
message="❌ ${job_name} job has FAILED 😞 "

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+="See AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>) for error details. "
elif [[ -n "${GITHUB_RUN_ID}" ]]; then
message+="See GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}> for error details. "
fi

"$bin"/notify-slack "$message"
27 changes: 27 additions & 0 deletions ingest/vendored/notify-on-job-start
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

: "${AWS_BATCH_JOB_ID:=}"
: "${GITHUB_RUN_ID:=}"

bin="$(dirname "$0")"
job_name="${1:?A job name is required as the first argument}"
github_repo="${2:?A GitHub repository with owner and repository name is required as the second argument}"
build_dir="${3:-ingest}"

echo "Notifying Slack about started ${job_name} job."
message="${job_name} job has started."

if [[ -n "${GITHUB_RUN_ID}" ]]; then
message+=" The job was submitted by GitHub Action <https://github.com/${github_repo}/actions/runs/${GITHUB_RUN_ID}?check_suite_focus=true|${GITHUB_RUN_ID}>."
fi

if [[ -n "${AWS_BATCH_JOB_ID}" ]]; then
message+=" The job was launched as AWS Batch job \`${AWS_BATCH_JOB_ID}\` (<https://console.aws.amazon.com/batch/v2/home?region=us-east-1#jobs/detail/${AWS_BATCH_JOB_ID}|link>)."
message+=" Follow along in your local clone of ${github_repo} with: "'```'"nextstrain build --aws-batch --no-download --attach ${AWS_BATCH_JOB_ID} ${build_dir}"'```'
fi

"$bin"/notify-slack "$message"
53 changes: 53 additions & 0 deletions ingest/vendored/notify-on-record-change
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash
set -euo pipefail

: "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"
: "${SLACK_CHANNELS:?The SLACK_CHANNELS environment variable is required.}"

bin="$(dirname "$0")"

src="${1:?A source ndjson file is required as the first argument.}"
dst="${2:?A destination ndjson s3:// URL is required as the second argument.}"
source_name=${3:?A record source name is required as the third argument.}

# if the file is not already present, just exit
"$bin"/s3-object-exists "$dst" || exit 0

s3path="${dst#s3://}"
bucket="${s3path%%/*}"
key="${s3path#*/}"

src_record_count="$(wc -l < "$src")"

# Try getting record count from S3 object metadata
dst_record_count="$(aws s3api head-object --bucket "$bucket" --key "$key" --query "Metadata.recordcount || ''" --output text 2>/dev/null || true)"
if [[ -z "$dst_record_count" ]]; then
# This object doesn't have the record count stored as metadata
# We have to download it and count the lines locally
dst_record_count="$(wc -l < <(aws s3 cp --no-progress "$dst" - | xz -T0 -dcfq))"
fi

added_records="$(( src_record_count - dst_record_count ))"

printf "%'4d %s\n" "$src_record_count" "$src"
printf "%'4d %s\n" "$dst_record_count" "$dst"
printf "%'4d added records\n" "$added_records"

slack_message=""

if [[ $added_records -gt 0 ]]; then
echo "Notifying Slack about added records (n=$added_records)"
slack_message="📈 New records (n=$added_records) found on $source_name."

elif [[ $added_records -lt 0 ]]; then
echo "Notifying Slack about fewer records (n=$added_records)"
slack_message="📉 Fewer records (n=$added_records) found on $source_name."

else
echo "Notifying Slack about same number of records"
slack_message="⛔ No new records found on $source_name."
fi

slack_message+=" (Total record count: $src_record_count)"

"$bin"/notify-slack "$slack_message"
Loading