Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest: Switch to NCBI Datasets CLI to fetch data #37

Merged
merged 5 commits into from
Sep 13, 2023

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Sep 12, 2023

Replace the config URLs for the NCBI Virus API with the NCBI Datasets CLI commands that are driven by the ncbi_taxon_id present in the config.

NCBI datasets downloads a virus dataset ZIP file that includes a metadata JSON Lines file and a sequences FASTA file. To maintain a record of the single NDJSON file on S3, extract the sequences FASTA file and format the metadata into a TSV file that are parsed into a single NDJSON file using augur curate passthru. The metadata TSV is created using the NCBI dataformat command so that we do not have to parse the nested JSON lines files ourselves and header fields are renamed to match the previous fields we used for NCBI Virus.

The NDJSON file created here no longer includes equivalent fields for "title" or "publication".

Copying over a lot of the same steps from nextstrain/mpox#179, but with additional wildcards to support the ingest of different subtypes.

Resolves #36

Checklist

Replace the config URLs for the NCBI Virus API with the NCBI Datasets
CLI commands that are driven by the `ncbi_taxon_id` present in the
config.

NCBI datasets downloads a virus dataset ZIP file that includes a
metadata JSON Lines file and a sequences FASTA file. To maintain a record
of the single NDJSON file on S3, extract the sequences FASTA file and
format the metadata into a TSV file that are parsed into a single NDJSON
file using `augur curate passthru`. The metadata TSV is created using
the NCBI `dataformat` command so that we do not have to parse the nested
JSON lines files ourselves and header fields are renamed to match the
previous fields we used for NCBI Virus.

The NDJSON file created here no longer includes equivalent fields
for "title" or "publication".
Use the `--env` option for `nextstrain build` to pass envvars to the
build runtime so that we no longer need to use the `--exec env` +
`envdir env.d snakemake` invocations, which removes the need to maintain
the --cores flag.

Removes the unused `bin/write-envdir` script and the unused envvars
GITHUB_RUN_ID and AWS_DEFAULT_REGION.
Switch over to the central reusable pathogen-repo-build workflow so
there's less overall maintenance and we get the nice functionality of
printing the AWS Batch summary.
The results are expected to be uploaded to the same AWS S3 bucket,
so remove the unnecessary complexity caused by the nested `dst`
config param. This will make it easier to set up trial runs for the
ingest GitHub Action workflow.
Allow trial runs to be uploaded to the S3 bucket with the additional
`/trial/<trial_name>` prefix and do not trigger rebuilds when doing a
trial run.

This was easily done because the `s3_dst` has been made a top level
config param in the previous commit.
@joverlee521 joverlee521 marked this pull request as ready for review September 13, 2023 21:21
@joverlee521
Copy link
Contributor Author

Added a couple more commits to update the ingest/rebuild GitHub workflows to use the latest pathogen-repo-build reusable workflow and added trials to the ingest workflow.

@joverlee521
Copy link
Contributor Author

The trial ingest workflow ran successfully and I diffed outputs for:

s3://nextstrain-data/files/workflows/rsv/trial/ncbi-datasets-cli/a/metadata.tsv.gz with the production metadata (version id DA.E1gQNtP2W3e4zKc25HieT57fefGdW).
s3://nextstrain-data/files/workflows/rsv/trial/ncbi-datasets-cli/b/metadata.tsv.gz with production metadata (version id tv8smtae3ZsXjVwVHVnAsr4S3opKXVlY).

The only difference was whitespace changes in the authors column.

@joverlee521
Copy link
Contributor Author

Merging and triggering a re-run of ingest which should trigger the rebuild. Will monitor the new runs.

@joverlee521 joverlee521 merged commit cd662d4 into master Sep 13, 2023
9 checks passed
@joverlee521 joverlee521 deleted the ncbi-datasets-cli branch September 13, 2023 21:47
@joverlee521
Copy link
Contributor Author

triggering a re-run of ingest which should trigger the rebuild

Turns out I was wrong, the rebuild is just scheduled to run after ingest, but the ingest workflow does not actually trigger the rebuild.

Triggered the rebuild manually (after fixing with 4ecf498).

joverlee521 added a commit that referenced this pull request Sep 14, 2023
Realized through #37 that the
ingest pipeline does _not_ trigger the rebuild. The rebuild is just
scheduled to run after the ingest workflow. Removing all parameters and
references to trigger in this commit so that it does not confuse anyone
else in the future.

Keeping the schedule as-is since it's been working fine and we are
planning to be shift pathogen workflows in the future to be able to
go from ingest to a build within a single run without going through
triggers and S3 interactions.
joverlee521 added a commit that referenced this pull request Sep 15, 2023
This should have been done as a part of #37,
but I totally missed that we have this section in the build's
description.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Ingest currently blocked by NCBI Virus changes
1 participant