Separate curated data by L and S segments #12

j23414 · 2024-07-23T22:17:54Z

Description of proposed changes

After further consideration of the discussion in #8 (comment), I have decided to implement the separation of segments in the ingest workflow. Since Lassa is a segmented virus, it can be treated similarly to influenza, where each segment is stored in a separate file.

The proposed changes include:

Creating separate files for each segment [using Nextclade align]:

~~results/metadata_L.tsv~~ results/L/metadata.tsv
~~results/sequences_L.fasta~~ results/L/sequences.fasta
~~results/metadata_S.tsv~~ results/S/metadata.tsv
~~results/sequences_S.fasta~~ results/S/sequences.fasta

Maintaining combined files for all samples:

~~results/metadata_all.tsv~~ results/all/metadata.tsv
~~results/sequences_all.fasta~~ results/all/sequences.fasta

The combined files are necessary to accommodate samples ~~that may be too short to~~ [fail to] align to either segment specifically [and for validation and debugging purposes].

Note on segment identification:

The NCBI data dump did not include explicit "segment" annotations. To separate the segments, I employed a combination of Nextclade and segment reference comparisons.

nextstrain build ingest dump_ncbi_dataset_report
less ingest/data/ncbi_dataset_report_raw.tsv

Related issue(s)

prereq for Add phylogenetic #8

Checklist

Checks pass

The original lassa build creates two segment trees (l, s) with references defined in config/lassa_{segment}.gb files. This commit creates the cooresponding fasta files for the segment trees to be used in Nextclade in both ingest or phylogenetic workflows as needed.

Lassa is a segmented virus, ergo the pairs of metadata and sequences files can be separated by segment using Nextclade. We still maintain a pair of metadata_all.tsv and sequences_all.fasta which also include samples that may be too short to align to either segment.

jameshadfield · 2024-07-23T22:29:31Z

Comments from afar:

Why are separate per-segment metadata files necessary? We could follow the (avian-)flu approach of having one metadata file (I think you are calling this metadata_all.tsv). I've found this pattern simplifying in the flu work. You can optionally add boolean columns such as "L_seq", "S_seq", which can be helpful in some situations.

The combined files are necessary to accommodate samples that may be too short to align to either segment specifically.

What use is the combined sequences file? If a strain X is too short to align to either segment then what analysis will make use of it?

j23414 · 2024-07-24T14:12:50Z

If a strain X is too short to align to either segment then what analysis

😆 Ah, I should have said "The combined files are necessary to accommodate samples that do not classify with either segment specifically. I'm guessing from being too short." My primary goal was to create metadata files to feed into https://github.com/j23414/generated-reports/blob/main/reports/lassa.pdf for a closer examination.

Preliminary exploration shows that the non-L and non-S sequences are not just short sequences (my hypothesis is wrong). I plan to look into this later.

j23414 · 2024-07-24T14:12:58Z

having one metadata file ... optionally add boolean columns such as "L_seq", "S_seq"

In the end, I want this 😄 (or a "segment" field that contains "L" or "S"... I don't think a record has both). This way we can modify augur filter in the phylogenetic workflow to separate the segments and there are less files to track.

I'm very interested in learning more about how avian-flu is ingesting data (doesn't seem to be NCBI datasets) and handling segments; and what lessons we can apply to the lassa workflow. Any suggestions or insights/walkthroughs would be greatly appreciated!

jameshadfield · 2024-07-25T00:28:46Z

I'm very interested in learning more about how avian-flu is ingesting data

I find the avian-flu structure quite nice, and would think it provides a good jumping off point for all segmented viruses. Perhaps @huddlej and/or @joverlee521 could comment on any differences with seasonal-flu.

We have a single metadata file and n sequences files, one-per-segment. A strain, e.g. A/Texas/37/2024 is present in a single row in the metadata TSV and is also present in each sequence FASTA for which that segment's been sequenced. The segment name is not encoded in the strain name, i.e. the strain name is the same for each segment, and this makes it trivial to join segments together for a genome analysis, draw tangletrees etc. This seems to be different from the current lassa approach as you say "I don't think a record has both ["L" and "S" segments]".

We also encode a column in the metadata TSV n_segments, but we could equally encode 8 columns ("segment_NA", "segment_HA" etc). I think this aspect is workflow-dependent.

joverlee521 · 2024-07-25T21:37:23Z

I just tried the avian-flu fetch-from-ncbi-virus script for lassa to see what we get from the segment field.

./build-configs/ncbi/bin/fetch-from-ncbi-virus 3052310 nextstrain/lassa > data/lassa.csv

The results are not great, with 38% not labelled

segment	count	percent
	1069	38.65
S	955	34.53
L	740	26.75
I	1	0.04
II	1	0.04

jameshadfield · 2024-07-25T22:05:01Z

Yeah, that's unfortunate that it's no so well annotated, and was perhaps my expectation after reading Jennifer's comments in this thread. I still think it's worth wrapping up the complexity of segment assignment & forming strain names which match across segments into the ingest pipeline. The end result of ingest being three canonical files: metadata.tsv, L.fasta, S.fasta. Perhaps there's subtleties I'm not quite grasping.

joverlee521 · 2024-07-25T22:43:06Z

I still think it's worth wrapping up the complexity of segment assignment & forming strain names which match across segments into the ingest pipeline.

Yeah, the strain name matching is the hard part. It doesn't look like there's reliable standardized strain names in the lassa data.

We can try to construct the strain name with geo-location/strain or isolate-lineage/year

There would be some overlap, i.e more than 2 sequences per "strain"

geo-location	strain	isolate-lineage	year	count	count>2	percent
				40	True	2.46
	Josiah			8	True	0.49
USA: MD		Josiah		5	True	0.31
	AV			4	True	0.25
Cote d'Ivoire		LASV_3604	2015	4	True	0.25
Cote d'Ivoire		LASV_3609	2015	4	True	0.25
Cote d'Ivoire		LASV_3625	2015	4	True	0.25
Cote d'Ivoire		LASV_3629	2015	4	True	0.25
Cote d'Ivoire		LASV_3630	2015	4	True	0.25
Cote d'Ivoire		LASV_R124	2013	4	True	0.25
Cote d'Ivoire		LASV_R132	2013	4	True	0.25
Cote d'Ivoire		LASV_CIV139	2013	4	True	0.25
Cote d'Ivoire		LASV_3706	2015	4	True	0.25
Cote d'Ivoire		LASV_3713	2015	4	True	0.25
Cote d'Ivoire		LASV_3715	2015	4	True	0.25
Cote d'Ivoire		LASV_3523	2015	4	True	0.25
Cote d'Ivoire		LASV_3711	2015	4	True	0.25
Nigeria: Plateau State	Nig07-05		2007	3	True	0.18
Nigeria: Federal Capital Territory	Nig08-02		2008	3	True	0.18
Ghana: Jirandogo		JIR76	2011	3	True	0.18
Sierra Leone		LM0034	2009	3	True	0.18
Sierra Leone		LM0036	2009	3	True	0.18
Sierra Leone		LM0054	2009	3	True	0.18
Sierra Leone		LM0058	2009	3	True	0.18
Sierra Leone		LM0064	2009	3	True	0.18
Sierra Leone		LM0091	2009	3	True	0.18
Sierra Leone		LM0092	2009	3	True	0.18
Sierra Leone		LM0093	2009	3	True	0.18
Sierra Leone		LM0111	2009	3	True	0.18
Sierra Leone		LM0122	2009	3	True	0.18
Sierra Leone		LM0123	2009	3	True	0.18
Sierra Leone		LM0124	2009	3	True	0.18
Sierra Leone		LM0224	2009	3	True	0.18
Sierra Leone		LM0250	2009	3	True	0.18
Sierra Leone		LM0273	2009	3	True	0.18
Sierra Leone		LM0473	2009	3	True	0.18
Sierra Leone		LM0395	2009	3	True	0.18
Sierra Leone		LM0434	2009	3	True	0.18
Sierra Leone		LM0582	2009	3	True	0.18
Sierra Leone		LM0610	2009	3	True	0.18
Sierra Leone		LM0645	2009	3	True	0.18
Sierra Leone		LM0657	2009	3	True	0.18
Sierra Leone		LM0660	2009	3	True	0.18
Sierra Leone		LM0661	2009	3	True	0.18
Sierra Leone		LM0676	2009	3	True	0.18
Sierra Leone		LM0677	2009	3	True	0.18
Sierra Leone		LM0678	2009	3	True	0.18
Sierra Leone		LM0680	2009	3	True	0.18
Sierra Leone		LM0716	2009	3	True	0.18
Sierra Leone		Z0005	2009	3	True	0.18
Sierra Leone		Z0007	2009	3	True	0.18
Guinea: Bantou	Ban-377		2003	3	True	0.18
Guinea: Brissa	Bri-08		2013	3	True	0.18
Guinea: Dalafilani	Dal-11		2013	3	True	0.18
Guinea: Damania	Dam-05		2013	3	True	0.18
Guinea: Damania	Dam-15		2013	3	True	0.18
Guinea: Damania	Dam-49		2013	3	True	0.18
Guinea: Denguedou	Den-43		2005	3	True	0.18
Guinea: Khoria	Kho-29		2013	3	True	0.18
Guinea: Safrani	Saf-01		2013	3	True	0.18
Guinea: Silimi	Sil-03		2013	3	True	0.18
Guinea: Tanganya	Tan-491		2003	3	True	0.18
Guinea: Yarawalia	Yar-24		2013	3	True	0.18
Guinea: Yarawalia	Yar-46		2013	3	True	0.18
Sierra Leone: Badala	Bad-27		2018	3	True	0.18
Sierra Leone: Falaba	Fal-26		2017	3	True	0.18
Sierra Leone: Falaba	Fal-31		2017	3	True	0.18
Sierra Leone: Makakura	Mak-67		2017	3	True	0.18

How should we handle records that have more than 2 sequences per strain name?

joverlee521

I think fine to have the separate metadata/sequences files per segment until we figure out a better way to link segments to the metadata. This will at least simplify the downstream phylogenetic workflow.

joverlee521 · 2024-07-25T22:56:42Z

ingest/build-configs/nextstrain-automation/config.yaml

+  metadata_all.tsv.zst: results/metadata_all.tsv
+  sequences_all.fasta.zst: results/sequences_all.fasta
+  metadata_L.tsv.zst: results/metadata_L.tsv
+  sequences_L.fasta.zst: results/sequences_L.fasta
+  metadata_S.tsv.zst: results/metadata_S.tsv
+  sequences_S.fasta.zst: results/sequences_S.fasta


Similar to nextstrain/dengue#72, I suggest reorganizing to hierarchical structure on S3

<segment>/metadata.tsv.zst <segment>/sequences.fasta.zst

joverlee521 · 2024-07-25T23:37:31Z

I thought maybe we could link segments by BioSample accession, but >80% of records do not have a linked BioSample 😞

ingest/rules/nextclade.smk

genehack · 2024-07-26T17:52:16Z

ingest/rules/nextclade.smk

+rule subset_metadata_by_segment:
+    input:
+        metadata = "data/subset_metadata.tsv",
+        sequences = "results/sequences_{segment}.fasta",


maybe consider results/{segment}/sequences.fasta to keep things consistent across the build and upload?

genehack · 2024-07-26T17:53:10Z

ingest/rules/nextclade.smk

+        min_seed_cover = config["nextclade"]["min_seed_cover"],
+    shell:
+        """
+        nextclade run \


I'm curious what the decision process was around nextclade run versus augur align

Good question! At least in dengue, nextclade run ran much faster and on more data than augur align from slide 14 here.

I admit I haven't done the same comparison for lassa data, mostly just wanted one that worked and didn't seem noticeably slow.

Co-authored-by: John SJ Anderson <[email protected]>

jameshadfield · 2024-07-26T20:44:38Z

ingest/Snakefile

@@ -8,6 +8,11 @@ workdir: workflow.current_basedir
 # Use default configuration values. Override with Snakemake's --configfile/--config options.
 configfile: "defaults/config.yaml"

+segments = ['L', 'S']


[using this line just to organise discussion into a thread]

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

The 2018 Nextstrain Lassa builds have the majority of the strains matched up. Was this using a subset of the data? Or were we perhaps not considering strain name duplications and so the matching may be incorrect?

Good question. I have no idea how the underlying data for those builds were processed. The early README implies the data came from fauna, but I don't see any lassa table in fauna...

Matching samples across segments is a key part of segmented analyses. Absolutely fine to not be part of this PR, and perhaps is something we can never generalise, but is something we should certainly keep revisiting.

For sure, feel free to add thoughts on segment analysis to nextstrain/pathogen-repo-guide#59.

Across the pipeline and upload steps, organize the files into a hierarchical structure * <segment>/metadata.tsv.zst * <segment>/sequences.fasta.zst

j23414 added 2 commits July 23, 2024 17:27

joverlee521 approved these changes Jul 25, 2024

View reviewed changes

joverlee521 mentioned this pull request Jul 26, 2024

ingest: How to handle segmented viruses nextstrain/pathogen-repo-guide#59

Open

genehack approved these changes Jul 26, 2024

View reviewed changes

Fixup: typo in docs

bd5b432

Co-authored-by: John SJ Anderson <[email protected]>

jameshadfield reviewed Jul 26, 2024

View reviewed changes

Set segment/filename structure for ingest workflow

a0a4cf8

Across the pipeline and upload steps, organize the files into a hierarchical structure * <segment>/metadata.tsv.zst * <segment>/sequences.fasta.zst

j23414 force-pushed the draft-segment-coverage branch from a3c9794 to a0a4cf8 Compare July 29, 2024 15:24

j23414 merged commit 9f95104 into master Jul 29, 2024
6 checks passed

j23414 deleted the draft-segment-coverage branch July 29, 2024 15:44

miparedes mentioned this pull request Jul 31, 2024

Update ingest to accommodate oropouche segments nextstrain/oropouche#1

Merged

j23414 mentioned this pull request Aug 12, 2024

Update references to match RefSeq #23

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate curated data by L and S segments #12

Separate curated data by L and S segments #12

j23414 commented Jul 23, 2024 •

edited

Loading

jameshadfield commented Jul 23, 2024

j23414 commented Jul 24, 2024

j23414 commented Jul 24, 2024 •

edited

Loading

jameshadfield commented Jul 25, 2024 •

edited

Loading

joverlee521 commented Jul 25, 2024

jameshadfield commented Jul 25, 2024

joverlee521 commented Jul 25, 2024

joverlee521 left a comment

joverlee521 Jul 25, 2024

joverlee521 commented Jul 25, 2024

genehack Jul 26, 2024

genehack Jul 26, 2024

j23414 Jul 26, 2024 •

edited

Loading

jameshadfield Jul 26, 2024

joverlee521 Jul 26, 2024

Separate curated data by L and S segments #12

Separate curated data by L and S segments #12

Conversation

j23414 commented Jul 23, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

jameshadfield commented Jul 23, 2024

j23414 commented Jul 24, 2024

j23414 commented Jul 24, 2024 • edited Loading

jameshadfield commented Jul 25, 2024 • edited Loading

joverlee521 commented Jul 25, 2024

jameshadfield commented Jul 25, 2024

joverlee521 commented Jul 25, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

joverlee521 Jul 25, 2024

Choose a reason for hiding this comment

joverlee521 commented Jul 25, 2024

genehack Jul 26, 2024

Choose a reason for hiding this comment

genehack Jul 26, 2024

Choose a reason for hiding this comment

j23414 Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

jameshadfield Jul 26, 2024

Choose a reason for hiding this comment

joverlee521 Jul 26, 2024

Choose a reason for hiding this comment

j23414 commented Jul 23, 2024 •

edited

Loading

j23414 commented Jul 24, 2024 •

edited

Loading

jameshadfield commented Jul 25, 2024 •

edited

Loading

j23414 Jul 26, 2024 •

edited

Loading