-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate curated data by L and S segments #12
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,5 +17,9 @@ s3_dst: "s3://nextstrain-data/files/workflows/lassa" | |
# Mapping of files to upload | ||
files_to_upload: | ||
ncbi.ndjson.zst: data/ncbi.ndjson | ||
metadata.tsv.zst: results/metadata.tsv | ||
sequences.fasta.zst: results/sequences.fasta | ||
metadata_all.tsv.zst: results/metadata_all.tsv | ||
sequences_all.fasta.zst: results/sequences_all.fasta | ||
metadata_L.tsv.zst: results/metadata_L.tsv | ||
sequences_L.fasta.zst: results/sequences_L.fasta | ||
metadata_S.tsv.zst: results/metadata_S.tsv | ||
sequences_S.fasta.zst: results/sequences_S.fasta | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to nextstrain/dengue#72, I suggest reorganizing to hierarchical structure on S3
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
""" | ||
This part of the workflow handles running Nextclade on the curated metadata | ||
and sequences to split the sequences into L and S segments. | ||
|
||
REQUIRED INPUTS: | ||
|
||
metadata = data/subset_metadata.tsv | ||
sequences = "results/sequences_all.fasta" | ||
j23414 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
OUTPUTS: | ||
|
||
metadata = results/metadata_{segment}.tsv | ||
sequences = results/sequences_{segment}.fasta | ||
|
||
See Nextclade docs for more details on usage, inputs, and outputs if you would | ||
like to customize the rules: | ||
https://docs.nextstrain.org/projects/nextclade/page/user/nextclade-cli.html | ||
""" | ||
|
||
rule run_nextclade_to_identify_segment: | ||
input: | ||
metadata = "data/subset_metadata.tsv", | ||
sequences = "results/sequences_all.fasta", | ||
segment_reference = config["nextclade"]["segment_reference"], | ||
output: | ||
sequences = "results/sequences_{segment}.fasta", | ||
params: | ||
min_seed_cover = config["nextclade"]["min_seed_cover"], | ||
shell: | ||
""" | ||
nextclade run \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm curious what the decision process was around There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question! At least in dengue, I admit I haven't done the same comparison for lassa data, mostly just wanted one that worked and didn't seem noticeably slow. |
||
--input-ref {input.segment_reference} \ | ||
--output-fasta {output.sequences} \ | ||
--min-seed-cover {params.min_seed_cover} \ | ||
--silent \ | ||
{input.sequences} | ||
""" | ||
|
||
rule subset_metadata_by_segment: | ||
input: | ||
metadata = "data/subset_metadata.tsv", | ||
sequences = "results/sequences_{segment}.fasta", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe consider |
||
output: | ||
metadata = "results/metadata_{segment}.tsv", | ||
params: | ||
strain_id_field = config["curate"]["output_id_field"], | ||
shell: | ||
""" | ||
augur filter \ | ||
--sequences {input.sequences} \ | ||
--metadata {input.metadata} \ | ||
--metadata-id-columns {params.strain_id_field} \ | ||
--output-metadata {output.metadata} | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[using this line just to organise discussion into a thread]
The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?
Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I have no idea how the underlying data for those builds were processed. The early README implies the data came from fauna, but I don't see any lassa table in fauna...
For sure, feel free to add thoughts on segment analysis to nextstrain/pathogen-repo-guide#59.