-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate curated data by L and S segments #12
Conversation
The original lassa build creates two segment trees (l, s) with references defined in config/lassa_{segment}.gb files. This commit creates the cooresponding fasta files for the segment trees to be used in Nextclade in both ingest or phylogenetic workflows as needed.
Lassa is a segmented virus, ergo the pairs of metadata and sequences files can be separated by segment using Nextclade. We still maintain a pair of metadata_all.tsv and sequences_all.fasta which also include samples that may be too short to align to either segment.
Comments from afar: Why are separate per-segment metadata files necessary? We could follow the (avian-)flu approach of having one metadata file (I think you are calling this
What use is the combined sequences file? If a strain |
😆 Ah, I should have said "The combined files are necessary to accommodate samples that do not classify with either segment specifically. I'm guessing from being too short." My primary goal was to create metadata files to feed into https://github.com/j23414/generated-reports/blob/main/reports/lassa.pdf for a closer examination. Preliminary exploration shows that the non-L and non-S sequences are not just short sequences (my hypothesis is wrong). I plan to look into this later. |
In the end, I want this 😄 (or a "segment" field that contains "L" or "S"... I don't think a record has both). This way we can modify I'm very interested in learning more about how avian-flu is ingesting data (doesn't seem to be NCBI datasets) and handling segments; and what lessons we can apply to the lassa workflow. Any suggestions or insights/walkthroughs would be greatly appreciated! |
I find the avian-flu structure quite nice, and would think it provides a good jumping off point for all segmented viruses. Perhaps @huddlej and/or @joverlee521 could comment on any differences with seasonal-flu. We have a single metadata file and n sequences files, one-per-segment. A strain, e.g. We also encode a column in the metadata TSV |
I just tried the avian-flu
The results are not great, with 38% not labelled
|
Yeah, that's unfortunate that it's no so well annotated, and was perhaps my expectation after reading Jennifer's comments in this thread. I still think it's worth wrapping up the complexity of segment assignment & forming strain names which match across segments into the ingest pipeline. The end result of ingest being three canonical files: |
Yeah, the strain name matching is the hard part. It doesn't look like there's reliable standardized strain names in the lassa data. We can try to construct the strain name with There would be some overlap, i.e more than 2 sequences per "strain"
How should we handle records that have more than 2 sequences per strain name? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fine to have the separate metadata/sequences files per segment until we figure out a better way to link segments to the metadata. This will at least simplify the downstream phylogenetic workflow.
metadata_all.tsv.zst: results/metadata_all.tsv | ||
sequences_all.fasta.zst: results/sequences_all.fasta | ||
metadata_L.tsv.zst: results/metadata_L.tsv | ||
sequences_L.fasta.zst: results/sequences_L.fasta | ||
metadata_S.tsv.zst: results/metadata_S.tsv | ||
sequences_S.fasta.zst: results/sequences_S.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to nextstrain/dengue#72, I suggest reorganizing to hierarchical structure on S3
<segment>/metadata.tsv.zst
<segment>/sequences.fasta.zst
I thought maybe we could link segments by BioSample accession, but >80% of records do not have a linked BioSample 😞 |
ingest/rules/nextclade.smk
Outdated
rule subset_metadata_by_segment: | ||
input: | ||
metadata = "data/subset_metadata.tsv", | ||
sequences = "results/sequences_{segment}.fasta", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe consider results/{segment}/sequences.fasta
to keep things consistent across the build and upload?
min_seed_cover = config["nextclade"]["min_seed_cover"], | ||
shell: | ||
""" | ||
nextclade run \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what the decision process was around nextclade run
versus augur align
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! At least in dengue, nextclade run
ran much faster and on more data than augur align
from slide 14 here.
I admit I haven't done the same comparison for lassa data, mostly just wanted one that worked and didn't seem noticeably slow.
Co-authored-by: John SJ Anderson <[email protected]>
@@ -8,6 +8,11 @@ workdir: workflow.current_basedir | |||
# Use default configuration values. Override with Snakemake's --configfile/--config options. | |||
configfile: "defaults/config.yaml" | |||
|
|||
segments = ['L', 'S'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[using this line just to organise discussion into a thread]
The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?
Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?
Good question. I have no idea how the underlying data for those builds were processed. The early README implies the data came from fauna, but I don't see any lassa table in fauna...
Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.
For sure, feel free to add thoughts on segment analysis to nextstrain/pathogen-repo-guide#59.
Across the pipeline and upload steps, organize the files into a hierarchical structure * <segment>/metadata.tsv.zst * <segment>/sequences.fasta.zst
a3c9794
to
a0a4cf8
Compare
Description of proposed changes
After further consideration of the discussion in #8 (comment), I have decided to implement the separation of segments in the ingest workflow. Since Lassa is a segmented virus, it can be treated similarly to influenza, where each segment is stored in a separate file.
The proposed changes include:
Creating separate files for each segment [using Nextclade align]:
results/metadata_L.tsvresults/L/metadata.tsvresults/sequences_L.fastaresults/L/sequences.fastaresults/metadata_S.tsvresults/S/metadata.tsvresults/sequences_S.fastaresults/S/sequences.fastaMaintaining combined files for all samples:
results/metadata_all.tsvresults/all/metadata.tsvresults/sequences_all.fastaresults/all/sequences.fastaThe combined files are necessary to accommodate samples
that may be too short to[fail to] align to either segment specifically [and for validation and debugging purposes].Note on segment identification:
The NCBI data dump did not include explicit "segment" annotations. To separate the segments, I employed a combination of Nextclade and segment reference comparisons.
Related issue(s)
Checklist