Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to augur curate titlecase and format-dates subcommands #34

Merged
merged 2 commits into from
Sep 15, 2023

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Aug 15, 2023

Description of proposed changes

As part of an ongoing migration of transform scripts to augur curate subcommands, this PR moves to replace the titlecase and format-dates scripts with their corresponding augur curate subcommands:

During the process of manual testing, an issue related to multiple headers in the fetched data was identified and resolved. This PR includes those changes.

Related issue(s)

Testing

  • Checks pass

For a manual check, could run:

git clone https://github.com/nextstrain/rsv.git
cd rsv/ingest
git checkout curate/format-dates
nextstrain build . data/a/metadata.tsv

@j23414 j23414 requested a review from a team August 15, 2023 18:48
@j23414 j23414 changed the title Curate/format dates Switch to augur curate titlecase and format-dates subcommands Aug 15, 2023
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch over to the new augur curate commands look good to me!

I've left a comment for an alternative solution to the header lines issue, which is more "snakemake-y" and allows us to parallelize the rules.

ingest/workflow/snakemake_rules/fetch_sequences.smk Outdated Show resolved Hide resolved
@j23414 j23414 force-pushed the curate/format-dates branch 3 times, most recently from 93082c7 to bab8cee Compare August 22, 2023 13:32
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me from inspection!

I triggered a trial run for the ingest and we can do a diff with the production data to make sure nothing has changed.

@joverlee521
Copy link
Contributor

Trial run completed successfully and uploaded outputs to s3://nextstrain-data/files/workflows/rsv/trial/curate-format-dates/*. I compared the output metadata TSV files with the production files on S3 and there were no differences in the existing data, just a couple of new records for rsv/b that's been added since the last automated run.

Compared outputs with the current files with:
ingest/vendored/download-from-s3 s3://nextstrain-data/files/workflows/rsv/trial/curate-format-dates/a/metadata.tsv.gz ingest/data/test-run/a-metadata.tsv
ingest/vendored/download-from-s3 s3://nextstrain-data/files/workflows/rsv/a/metadata.tsv.gz ingest/data/a-metadata.tsv
daff diff ingest/data/a-metadata.tsv ingest/data/test-run/a-metadata.tsv > /tmp/rsv-a-metadata.tsv.diff

ingest/vendored/download-from-s3 s3://nextstrain-data/files/workflows/rsv/trial/curate-format-dates/b/metadata.tsv.gz ingest/data/test-run/b-metadata.tsv
ingest/vendored/download-from-s3 s3://nextstrain-data/files/workflows/rsv/b/metadata.tsv.gz ingest/data/b-metadata.tsv
daff diff ingest/data/b-metadata.tsv ingest/data/test-run/b-metadata.tsv > /tmp/rsv-b-metadata.tsv.diff

@j23414 j23414 merged commit c07627e into master Sep 15, 2023
6 checks passed
@j23414 j23414 deleted the curate/format-dates branch September 15, 2023 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

2 participants