Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate curated data by L and S segments #12

Merged
merged 4 commits into from
Jul 29, 2024
Merged

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jul 23, 2024

Description of proposed changes

After further consideration of the discussion in #8 (comment), I have decided to implement the separation of segments in the ingest workflow. Since Lassa is a segmented virus, it can be treated similarly to influenza, where each segment is stored in a separate file.

The proposed changes include:

Creating separate files for each segment [using Nextclade align]:

  • results/metadata_L.tsv results/L/metadata.tsv
  • results/sequences_L.fasta results/L/sequences.fasta
  • results/metadata_S.tsv results/S/metadata.tsv
  • results/sequences_S.fasta results/S/sequences.fasta

Maintaining combined files for all samples:

  • results/metadata_all.tsv results/all/metadata.tsv
  • results/sequences_all.fasta results/all/sequences.fasta

The combined files are necessary to accommodate samples that may be too short to [fail to] align to either segment specifically [and for validation and debugging purposes].

Note on segment identification:

The NCBI data dump did not include explicit "segment" annotations. To separate the segments, I employed a combination of Nextclade and segment reference comparisons.

nextstrain build ingest dump_ncbi_dataset_report
less ingest/data/ncbi_dataset_report_raw.tsv

Related issue(s)

Checklist

  • Checks pass

The original lassa build creates two segment trees (l, s) with references
defined in config/lassa_{segment}.gb files. This commit creates the cooresponding
fasta files for the segment trees to be used in Nextclade in both ingest or
phylogenetic workflows as needed.
Lassa is a segmented virus, ergo the pairs of metadata and sequences files
can be separated by segment using Nextclade. We still maintain a pair of
metadata_all.tsv and sequences_all.fasta which also include samples that
may be too short to align to either segment.
@jameshadfield
Copy link
Member

Comments from afar:

Why are separate per-segment metadata files necessary? We could follow the (avian-)flu approach of having one metadata file (I think you are calling this metadata_all.tsv). I've found this pattern simplifying in the flu work. You can optionally add boolean columns such as "L_seq", "S_seq", which can be helpful in some situations.

The combined files are necessary to accommodate samples that may be too short to align to either segment specifically.

What use is the combined sequences file? If a strain X is too short to align to either segment then what analysis will make use of it?

@j23414
Copy link
Contributor Author

j23414 commented Jul 24, 2024

If a strain X is too short to align to either segment then what analysis

😆 Ah, I should have said "The combined files are necessary to accommodate samples that do not classify with either segment specifically. I'm guessing from being too short." My primary goal was to create metadata files to feed into https://github.com/j23414/generated-reports/blob/main/reports/lassa.pdf for a closer examination.

Preliminary exploration shows that the non-L and non-S sequences are not just short sequences (my hypothesis is wrong). I plan to look into this later.

Screenshot 2024-07-24 at 9 57 33 AM

@j23414
Copy link
Contributor Author

j23414 commented Jul 24, 2024

having one metadata file ... optionally add boolean columns such as "L_seq", "S_seq"

In the end, I want this 😄 (or a "segment" field that contains "L" or "S"... I don't think a record has both). This way we can modify augur filter in the phylogenetic workflow to separate the segments and there are less files to track.

I'm very interested in learning more about how avian-flu is ingesting data (doesn't seem to be NCBI datasets) and handling segments; and what lessons we can apply to the lassa workflow. Any suggestions or insights/walkthroughs would be greatly appreciated!

@jameshadfield
Copy link
Member

jameshadfield commented Jul 25, 2024

I'm very interested in learning more about how avian-flu is ingesting data

I find the avian-flu structure quite nice, and would think it provides a good jumping off point for all segmented viruses. Perhaps @huddlej and/or @joverlee521 could comment on any differences with seasonal-flu.

We have a single metadata file and n sequences files, one-per-segment. A strain, e.g. A/Texas/37/2024 is present in a single row in the metadata TSV and is also present in each sequence FASTA for which that segment's been sequenced. The segment name is not encoded in the strain name, i.e. the strain name is the same for each segment, and this makes it trivial to join segments together for a genome analysis, draw tangletrees etc. This seems to be different from the current lassa approach as you say "I don't think a record has both ["L" and "S" segments]".

We also encode a column in the metadata TSV n_segments, but we could equally encode 8 columns ("segment_NA", "segment_HA" etc). I think this aspect is workflow-dependent.

@joverlee521
Copy link
Contributor

I just tried the avian-flu fetch-from-ncbi-virus script for lassa to see what we get from the segment field.

./build-configs/ncbi/bin/fetch-from-ncbi-virus 3052310 nextstrain/lassa > data/lassa.csv

The results are not great, with 38% not labelled

segment count percent
1069 38.65
S 955 34.53
L 740 26.75
I 1 0.04
II 1 0.04

@jameshadfield
Copy link
Member

Yeah, that's unfortunate that it's no so well annotated, and was perhaps my expectation after reading Jennifer's comments in this thread. I still think it's worth wrapping up the complexity of segment assignment & forming strain names which match across segments into the ingest pipeline. The end result of ingest being three canonical files: metadata.tsv, L.fasta, S.fasta. Perhaps there's subtleties I'm not quite grasping.

@joverlee521
Copy link
Contributor

I still think it's worth wrapping up the complexity of segment assignment & forming strain names which match across segments into the ingest pipeline.

Yeah, the strain name matching is the hard part. It doesn't look like there's reliable standardized strain names in the lassa data.

We can try to construct the strain name with geo-location/strain or isolate-lineage/year

There would be some overlap, i.e more than 2 sequences per "strain"
geo-location strain isolate-lineage year count count>2 percent
40 True 2.46
Josiah 8 True 0.49
USA: MD Josiah 5 True 0.31
AV 4 True 0.25
Cote d'Ivoire LASV_3604 2015 4 True 0.25
Cote d'Ivoire LASV_3609 2015 4 True 0.25
Cote d'Ivoire LASV_3625 2015 4 True 0.25
Cote d'Ivoire LASV_3629 2015 4 True 0.25
Cote d'Ivoire LASV_3630 2015 4 True 0.25
Cote d'Ivoire LASV_R124 2013 4 True 0.25
Cote d'Ivoire LASV_R132 2013 4 True 0.25
Cote d'Ivoire LASV_CIV139 2013 4 True 0.25
Cote d'Ivoire LASV_3706 2015 4 True 0.25
Cote d'Ivoire LASV_3713 2015 4 True 0.25
Cote d'Ivoire LASV_3715 2015 4 True 0.25
Cote d'Ivoire LASV_3523 2015 4 True 0.25
Cote d'Ivoire LASV_3711 2015 4 True 0.25
Nigeria: Plateau State Nig07-05 2007 3 True 0.18
Nigeria: Federal Capital Territory Nig08-02 2008 3 True 0.18
Ghana: Jirandogo JIR76 2011 3 True 0.18
Sierra Leone LM0034 2009 3 True 0.18
Sierra Leone LM0036 2009 3 True 0.18
Sierra Leone LM0054 2009 3 True 0.18
Sierra Leone LM0058 2009 3 True 0.18
Sierra Leone LM0064 2009 3 True 0.18
Sierra Leone LM0091 2009 3 True 0.18
Sierra Leone LM0092 2009 3 True 0.18
Sierra Leone LM0093 2009 3 True 0.18
Sierra Leone LM0111 2009 3 True 0.18
Sierra Leone LM0122 2009 3 True 0.18
Sierra Leone LM0123 2009 3 True 0.18
Sierra Leone LM0124 2009 3 True 0.18
Sierra Leone LM0224 2009 3 True 0.18
Sierra Leone LM0250 2009 3 True 0.18
Sierra Leone LM0273 2009 3 True 0.18
Sierra Leone LM0473 2009 3 True 0.18
Sierra Leone LM0395 2009 3 True 0.18
Sierra Leone LM0434 2009 3 True 0.18
Sierra Leone LM0582 2009 3 True 0.18
Sierra Leone LM0610 2009 3 True 0.18
Sierra Leone LM0645 2009 3 True 0.18
Sierra Leone LM0657 2009 3 True 0.18
Sierra Leone LM0660 2009 3 True 0.18
Sierra Leone LM0661 2009 3 True 0.18
Sierra Leone LM0676 2009 3 True 0.18
Sierra Leone LM0677 2009 3 True 0.18
Sierra Leone LM0678 2009 3 True 0.18
Sierra Leone LM0680 2009 3 True 0.18
Sierra Leone LM0716 2009 3 True 0.18
Sierra Leone Z0005 2009 3 True 0.18
Sierra Leone Z0007 2009 3 True 0.18
Guinea: Bantou Ban-377 2003 3 True 0.18
Guinea: Brissa Bri-08 2013 3 True 0.18
Guinea: Dalafilani Dal-11 2013 3 True 0.18
Guinea: Damania Dam-05 2013 3 True 0.18
Guinea: Damania Dam-15 2013 3 True 0.18
Guinea: Damania Dam-49 2013 3 True 0.18
Guinea: Denguedou Den-43 2005 3 True 0.18
Guinea: Khoria Kho-29 2013 3 True 0.18
Guinea: Safrani Saf-01 2013 3 True 0.18
Guinea: Silimi Sil-03 2013 3 True 0.18
Guinea: Tanganya Tan-491 2003 3 True 0.18
Guinea: Yarawalia Yar-24 2013 3 True 0.18
Guinea: Yarawalia Yar-46 2013 3 True 0.18
Sierra Leone: Badala Bad-27 2018 3 True 0.18
Sierra Leone: Falaba Fal-26 2017 3 True 0.18
Sierra Leone: Falaba Fal-31 2017 3 True 0.18
Sierra Leone: Makakura Mak-67 2017 3 True 0.18

How should we handle records that have more than 2 sequences per strain name?

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fine to have the separate metadata/sequences files per segment until we figure out a better way to link segments to the metadata. This will at least simplify the downstream phylogenetic workflow.

Comment on lines 20 to 25
metadata_all.tsv.zst: results/metadata_all.tsv
sequences_all.fasta.zst: results/sequences_all.fasta
metadata_L.tsv.zst: results/metadata_L.tsv
sequences_L.fasta.zst: results/sequences_L.fasta
metadata_S.tsv.zst: results/metadata_S.tsv
sequences_S.fasta.zst: results/sequences_S.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to nextstrain/dengue#72, I suggest reorganizing to hierarchical structure on S3

<segment>/metadata.tsv.zst
<segment>/sequences.fasta.zst

@joverlee521
Copy link
Contributor

I thought maybe we could link segments by BioSample accession, but >80% of records do not have a linked BioSample 😞

ingest/rules/nextclade.smk Outdated Show resolved Hide resolved
rule subset_metadata_by_segment:
input:
metadata = "data/subset_metadata.tsv",
sequences = "results/sequences_{segment}.fasta",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider results/{segment}/sequences.fasta to keep things consistent across the build and upload?

min_seed_cover = config["nextclade"]["min_seed_cover"],
shell:
"""
nextclade run \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what the decision process was around nextclade run versus augur align

Copy link
Contributor Author

@j23414 j23414 Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! At least in dengue, nextclade run ran much faster and on more data than augur align from slide 14 here.

Screenshot 2024-07-26 at 2 14 32 PM

I admit I haven't done the same comparison for lassa data, mostly just wanted one that worked and didn't seem noticeably slow.

Co-authored-by: John SJ Anderson <[email protected]>
@@ -8,6 +8,11 @@ workdir: workflow.current_basedir
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

segments = ['L', 'S']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[using this line just to organise discussion into a thread]

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Good question. I have no idea how the underlying data for those builds were processed. The early README implies the data came from fauna, but I don't see any lassa table in fauna...

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

For sure, feel free to add thoughts on segment analysis to nextstrain/pathogen-repo-guide#59.

Across the pipeline and upload steps, organize the files into a hierarchical structure

* <segment>/metadata.tsv.zst
* <segment>/sequences.fasta.zst
@j23414 j23414 merged commit 9f95104 into master Jul 29, 2024
6 checks passed
@j23414 j23414 deleted the draft-segment-coverage branch July 29, 2024 15:44
@j23414 j23414 mentioned this pull request Aug 12, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants