Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate curated data by L and S segments #12

Merged
merged 4 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ workdir: workflow.current_basedir
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

segments = ['L', 'S']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[using this line just to organise discussion into a thread]

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Good question. I have no idea how the underlying data for those builds were processed. The early README implies the data came from fauna, but I don't see any lassa table in fauna...

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

For sure, feel free to add thoughts on segment analysis to nextstrain/pathogen-repo-guide#59.


wildcard_constraints:
segment = "|".join(segments)

# This is the default rule that Snakemake will run when there are no specified targets.
# The default output of the ingest workflow is usually the curated metadata and sequences.
# Nextstrain-maintained ingest workflows will produce metadata files with the
Expand All @@ -17,8 +22,9 @@ configfile: "defaults/config.yaml"
# TODO: Add link to centralized docs on standard Nextstrain metadata fields
rule all:
input:
"results/sequences.fasta",
"results/metadata.tsv",
sequences=expand("results/sequences_{segment}.fasta", segment=segments),
metadata=expand("results/metadata_{segment}.tsv", segment=segments),
metadata_all="results/metadata_all.tsv",


# Note that only PATHOGEN-level customizations should be added to these
Expand All @@ -28,12 +34,13 @@ rule all:
# by build-specific rules.
include: "rules/fetch_from_ncbi.smk"
include: "rules/curate.smk"
include: "rules/nextclade.smk"

rule create_final_metadata:
input:
metadata="data/subset_metadata.tsv"
output:
metadata="results/metadata.tsv"
metadata="results/metadata_all.tsv"
shell:
"""
mv {input.metadata} {output.metadata}
Expand Down
8 changes: 6 additions & 2 deletions ingest/build-configs/nextstrain-automation/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,9 @@ s3_dst: "s3://nextstrain-data/files/workflows/lassa"
# Mapping of files to upload
files_to_upload:
ncbi.ndjson.zst: data/ncbi.ndjson
metadata.tsv.zst: results/metadata.tsv
sequences.fasta.zst: results/sequences.fasta
metadata_all.tsv.zst: results/metadata_all.tsv
sequences_all.fasta.zst: results/sequences_all.fasta
metadata_L.tsv.zst: results/metadata_L.tsv
sequences_L.fasta.zst: results/sequences_L.fasta
metadata_S.tsv.zst: results/metadata_S.tsv
sequences_S.fasta.zst: results/sequences_S.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to nextstrain/dengue#72, I suggest reorganizing to hierarchical structure on S3

<segment>/metadata.tsv.zst
<segment>/sequences.fasta.zst

4 changes: 4 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -116,3 +116,7 @@ curate:
"abbr_authors",
"institution",
]

nextclade:
segment_reference: "../shared/lassa_{segment}.fasta"
min_seed_cover: 0.01
2 changes: 1 addition & 1 deletion ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ rule curate:
annotations=config["curate"]["annotations"],
output:
metadata="data/all_metadata.tsv",
sequences="results/sequences.fasta",
sequences="results/sequences_all.fasta",
log:
"logs/curate.txt",
benchmark:
Expand Down
54 changes: 54 additions & 0 deletions ingest/rules/nextclade.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
"""
This part of the workflow handles running Nextclade on the curated metadata
and sequences to split the sequences into L and S segments.

REQUIRED INPUTS:

metadata = data/subset_metadata.tsv
sequences = "results/sequences_all.fasta"
j23414 marked this conversation as resolved.
Show resolved Hide resolved

OUTPUTS:

metadata = results/metadata_{segment}.tsv
sequences = results/sequences_{segment}.fasta

See Nextclade docs for more details on usage, inputs, and outputs if you would
like to customize the rules:
https://docs.nextstrain.org/projects/nextclade/page/user/nextclade-cli.html
"""

rule run_nextclade_to_identify_segment:
input:
metadata = "data/subset_metadata.tsv",
sequences = "results/sequences_all.fasta",
segment_reference = config["nextclade"]["segment_reference"],
output:
sequences = "results/sequences_{segment}.fasta",
params:
min_seed_cover = config["nextclade"]["min_seed_cover"],
shell:
"""
nextclade run \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the decision process was around nextclade run versus augur align

Copy link
Contributor Author

@j23414 j23414 Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! At least in dengue, nextclade run ran much faster and on more data than augur align from slide 14 here.

Screenshot 2024-07-26 at 2 14 32 PM

I admit I haven't done the same comparison for lassa data, mostly just wanted one that worked and didn't seem noticeably slow.

--input-ref {input.segment_reference} \
--output-fasta {output.sequences} \
--min-seed-cover {params.min_seed_cover} \
--silent \
{input.sequences}
"""

rule subset_metadata_by_segment:
input:
metadata = "data/subset_metadata.tsv",
sequences = "results/sequences_{segment}.fasta",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider results/{segment}/sequences.fasta to keep things consistent across the build and upload?

output:
metadata = "results/metadata_{segment}.tsv",
params:
strain_id_field = config["curate"]["output_id_field"],
shell:
"""
augur filter \
--sequences {input.sequences} \
--metadata {input.metadata} \
--metadata-id-columns {params.strain_id_field} \
--output-metadata {output.metadata}
"""