-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ingest: Switch to NCBI Datasets CLI to fetch data #37
Conversation
Replace the config URLs for the NCBI Virus API with the NCBI Datasets CLI commands that are driven by the `ncbi_taxon_id` present in the config. NCBI datasets downloads a virus dataset ZIP file that includes a metadata JSON Lines file and a sequences FASTA file. To maintain a record of the single NDJSON file on S3, extract the sequences FASTA file and format the metadata into a TSV file that are parsed into a single NDJSON file using `augur curate passthru`. The metadata TSV is created using the NCBI `dataformat` command so that we do not have to parse the nested JSON lines files ourselves and header fields are renamed to match the previous fields we used for NCBI Virus. The NDJSON file created here no longer includes equivalent fields for "title" or "publication".
Use the `--env` option for `nextstrain build` to pass envvars to the build runtime so that we no longer need to use the `--exec env` + `envdir env.d snakemake` invocations, which removes the need to maintain the --cores flag. Removes the unused `bin/write-envdir` script and the unused envvars GITHUB_RUN_ID and AWS_DEFAULT_REGION.
Switch over to the central reusable pathogen-repo-build workflow so there's less overall maintenance and we get the nice functionality of printing the AWS Batch summary.
The results are expected to be uploaded to the same AWS S3 bucket, so remove the unnecessary complexity caused by the nested `dst` config param. This will make it easier to set up trial runs for the ingest GitHub Action workflow.
Allow trial runs to be uploaded to the S3 bucket with the additional `/trial/<trial_name>` prefix and do not trigger rebuilds when doing a trial run. This was easily done because the `s3_dst` has been made a top level config param in the previous commit.
166b348
to
99b830e
Compare
Added a couple more commits to update the ingest/rebuild GitHub workflows to use the latest pathogen-repo-build reusable workflow and added trials to the ingest workflow. |
The trial ingest workflow ran successfully and I diffed outputs for:
The only difference was whitespace changes in the authors column. |
Merging and triggering a re-run of ingest which should trigger the rebuild. Will monitor the new runs. |
Turns out I was wrong, the rebuild is just scheduled to run after ingest, but the ingest workflow does not actually trigger the rebuild. Triggered the rebuild manually (after fixing with 4ecf498). |
Realized through #37 that the ingest pipeline does _not_ trigger the rebuild. The rebuild is just scheduled to run after the ingest workflow. Removing all parameters and references to trigger in this commit so that it does not confuse anyone else in the future. Keeping the schedule as-is since it's been working fine and we are planning to be shift pathogen workflows in the future to be able to go from ingest to a build within a single run without going through triggers and S3 interactions.
This should have been done as a part of #37, but I totally missed that we have this section in the build's description.
Replace the config URLs for the NCBI Virus API with the NCBI Datasets CLI commands that are driven by the
ncbi_taxon_id
present in the config.NCBI datasets downloads a virus dataset ZIP file that includes a metadata JSON Lines file and a sequences FASTA file. To maintain a record of the single NDJSON file on S3, extract the sequences FASTA file and format the metadata into a TSV file that are parsed into a single NDJSON file using
augur curate passthru
. The metadata TSV is created using the NCBIdataformat
command so that we do not have to parse the nested JSON lines files ourselves and header fields are renamed to match the previous fields we used for NCBI Virus.The NDJSON file created here no longer includes equivalent fields for "title" or "publication".
Copying over a lot of the same steps from nextstrain/mpox#179, but with additional wildcards to support the ingest of different subtypes.
Resolves #36
Checklist