Add Dengue virus DENVx genotypes dataset #203

j23414 · 2024-05-24T17:36:38Z

Add a dengue dataset to Nextclade.

ivan-aksamentov · 2024-05-24T17:45:53Z

Cool!

I'll just drop short links for testing:

DENV1: https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv1

DENV2:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv2

DENV3:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv3

DENV4:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv4

Sadly I don't have any example sequences to run :(

Are there any sequences with permissive licenses available to add them as example sequences into datasets?

Would be nice to fill-in some info to the readme if you have a second. Readme is an optional file though.

j23414 · 2024-05-24T17:49:05Z

I can add some example sequences, it may take me a moment (aka. not in the next hour).

ivan-aksamentov · 2024-05-24T17:50:12Z

@j23414 No worries at all. I will not be able to asses the coolness of it anyways, because I lack required science knowledge. But I'll happily test how it runs and whether any bugs manifest themselves sometimes :)

AngieHinrichs · 2024-05-24T17:50:20Z

FWIW here are some arbitrarily chosen dengue sequences that I use as examples on dev.usher.bio:

https://www.ncbi.nlm.nih.gov/nuccore/OQ605998.1
https://www.ncbi.nlm.nih.gov/nuccore/OQ445967.1
https://www.ncbi.nlm.nih.gov/nuccore/OQ821618.1|
https://www.ncbi.nlm.nih.gov/nuccore/OQ622206.1

And NCBI Virus can provide a bunch: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Dengue%20virus,%20taxid:12637

j23414 · 2024-05-24T17:55:26Z

I lack required science knowledge.

Haha, I also lack the required science knowledge. I'm mostly wandering in the dark.

alex-ranieri · 2024-05-24T19:02:12Z

We need that! @jamessiqueirap and I proposed a lineage system for dengue. Our work utilizes the genotype mutations table from Nextstrain Dengue.

ivan-aksamentov · 2024-05-29T13:35:09Z

A small technical suggestion. These sequences seem to contain many mutations - too many for browser SVG engine to render efficiently in Nextclade's sequence views. If there's a clear "main" gene/CDS of interest for this virus, then one workaround would be to set the default CDS in pathogen.json, so that sequence view automatically switches to it when first rendering:

"defaultCds": "S",

Nuc sequence will of course still be available in the dropdown. But users will pay the associated performance price only if they switch to it.

And, on related note, if you need to customize the order of genes in the dropdown, then you could also add

"cdsOrderPreference": [
  "S",
  "N",
  "M",
  "E"
],

Both are just eye-candy features, so no rush.

The ultimate solution will be to implement a more performant sequence viewer in Nextclade. But this is quite far away.

j23414 · 2024-05-29T16:43:32Z

Thanks @ivan-aksamentov! I think the "main" gene/cds-of-interest for dengue is the E gene, and sometimes it's the only portion of the genome sequenced based on this user comment. I can add to the pathogen.json files following this pattern.

"defaultCds": "E",

Good to know about customizing the order of genes in drop down!

Right now, the dropdown menu matches the gene/cds order in the genome, which feels logical and straightforward to me. However, I welcome differing perspectives on this matter from others in the field. Open to alternatives or potential improvements.

Since dengue sequences seem to contain many mutations - too many for the browser SVG engine to render efficiently in Nextclade's sequence views - we will set the default CDS to display to the E gene as the "main" gene of interest. Viewing the full genome and other gene/CDS regions can still be displayed by selection from the dropdown menu at the top. Flagged by the following comment: nextstrain/nextclade_data#203 (comment)

rneher · 2024-06-02T09:11:03Z

A few additional remarks:

you have added example sequences to the DENV1-4 builds, but the file is missing from the files struct in the pathogen.json.
you didn't enable any QC. is this on purpose? Stop and frameshift QC is quite universal.
I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.

Incorporated some changes suggested by comment: #203 (comment) * Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d * Enabled stop and frameshift QC nextstrain/dengue@90523a7 * Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c

j23414 · 2024-06-04T16:52:19Z

Thanks @rneher! I tried to incorporate your suggested changes in 610e3f5

[example sequences] file is missing from the files struct in the pathogen.json.

An oversight on my part, fixed.

you didn't enable any QC. is this on purpose?

I had turned off several QC during development since dengue sequences seemed very divergent. I agree with adding stop and frameshift QC back in, done.

I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.

For genotype-level datasets (denv1-4), I swapped in the inferred ancestral root in for the reference and root of the tree. Done, although I could use help in evaluating the genotype-level datasets or any suggested next steps.

I thought about blasting a serotype's sequences against the other 3 serotypes to find the nearest cross-serotype outgroup, but wasn't sure if that would be more or less effective then the inferred ancestral root. Or using the other 3 serotype's inferred-ancestral roots as outgroups. I wasn't sure, but suggestions welcome.

rneher · 2024-06-04T20:14:17Z

I am not sure, but this is now the root of the DENV1 tree:

and an entire clade in that tree is DENV2/S

are you sure this is correct?

rneher · 2024-06-04T20:17:53Z

I think the root should be given a clade unassigned or outgroup or not DENV1 and you could root the tree mid-way to the outgroup. I would also color it grey.

j23414 · 2024-06-04T20:38:19Z

are you sure this is correct?

Gah, I must have copied in the wrong dataset files. I was experimenting with using the "dengue/all" reconstructed root for all 4 serotypes. However, as you observed, it was giving me weird genotype calls (e.g. DENV2 genotypes in the DENV1 tree).

I'll copy the correct ones (and double check this time) in a moment.

ivan-aksamentov · 2024-06-04T21:14:45Z

I allowed myself to resolve merge conflict which appeared after merging measles #202

rneher · 2024-06-07T08:19:37Z

thanks, Jennifer. The dataset also contains the genotype annotations. If these are good, you could enable them by adding to them to the meta.extensions as clade-like attributes.

Also, the example data contain two sequences that don't align. That is not a problem per se if these sequences are very weird (and having examples of bad sequences is fine), but if this is unexpected that one could maybe tune parameters.

j23414 · 2024-06-07T16:41:34Z

The dataset also contains the genotype annotations.

Thanks for the question @rneher! Some clarification that the genotype annotations (named genotype_nextclade) are from some placeholder DENV1-4 Nextclade datasets that were created so long ago I'm unsure of dataset provenance. (ergo, I wouldn't trust it yet). It requires predetermining the serotype (from NCBI annotations) before querying it against the DENV1-4 datasets individually and being appended to the metadata.

The more concerning problem occurs when we zoom into individual serotype trees (e.g. DENV2) where the genotype_nextclade does not quite match the augur clades's clade_membership annotations.

I believe @trvrb was going to explore modifying aa-mut defining mutations in clades_genotypes.tsv to apply to the "all" tree. Currently the aa-coords are by serotype reference (e.g. against the DENV1 reference, against the DENV3 reference which has a two amino acid deletion in E gene, etc).

j23414 · 2024-06-07T16:49:55Z

the example data contain two sequences that don't align.

Thanks for flagging! I assume it's an example sequence with >GenBankID_? headers and not something with a serotype annotation (e.g. >GenBankID_DENV1). I'm expecting them not to align and are representing some sylvatic (in the wild and lowly monitored) samples, but yes we can explore "rescuing" them (or assigning them a new serotype) if that makes biological sense.

ivan-aksamentov · 2024-06-07T17:47:56Z

@j23414 I think why Richard is asking about the unalignable or otherwise "broken" (from the point of view of Nextclade results) example sequences is that we had a situation with SC2 dataset, when users come confused after trying Nextclade with example sequences and receiving error or warning messages. They thought they did something wrong or that there is a bug. So we try to keep examples nice and high quality since then.

From one side, "broken" sequence might tell a story about some interesting science fact or just showcase how Nextclade software handles that particular situation technically - which is interesting. On the other hand, without context it might be unclear for the target audience. If you plan on keeping these samples, then perhaps you could explain the details in the readme. Alternatively, there might be a sciency solution, as you mentioned, to make them "good". Otherwise you could just delete the bad examples to avoid the troubles.

j23414 · 2024-06-07T21:42:23Z

I see, thank you for the context! I agree that dropping them (unalignable example sequences) is the smoother path forward.

Dropped from the other PR #208 in commit 932affc

Since the "all" directory is being added in the other PR, I've ommited the "all" directory here as commit 88171f4

Incorporated some changes suggested by comment: #203 (comment) * Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d * Enabled stop and frameshift QC nextstrain/dengue@90523a7 * Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c

j23414 temporarily deployed to refs/pull/203/merge May 24, 2024 17:36 — with GitHub Actions Inactive

j23414 mentioned this pull request May 28, 2024

Add workflow for producing the Nextclade dengue dataset nextstrain/dengue#25

Merged

2 tasks

j23414 had a problem deploying to refs/pull/203/merge May 28, 2024 21:35 — with GitHub Actions Failure

j23414 temporarily deployed to refs/pull/203/merge May 28, 2024 21:47 — with GitHub Actions Inactive

j23414 temporarily deployed to refs/pull/203/merge May 28, 2024 22:33 — with GitHub Actions Inactive

j23414 temporarily deployed to refs/pull/203/merge May 29, 2024 17:16 — with GitHub Actions Inactive

j23414 temporarily deployed to refs/pull/203/merge May 31, 2024 17:12 — with GitHub Actions Inactive

j23414 added SARS-CoV-2 Related to SARS-CoV-2 and removed SARS-CoV-2 Related to SARS-CoV-2 labels Jun 3, 2024

j23414 mentioned this pull request Jun 3, 2024

Fine tuning the Nextclade all dataset nextstrain/dengue#58

Merged

2 tasks

j23414 temporarily deployed to refs/pull/203/merge June 4, 2024 16:32 — with GitHub Actions Inactive

ivan-aksamentov temporarily deployed to refs/pull/203/merge June 4, 2024 21:14 — with GitHub Actions Inactive

ivan-aksamentov temporarily deployed to refs/pull/203/merge June 4, 2024 21:23 — with GitHub Actions Inactive

j23414 mentioned this pull request Jun 7, 2024

Add Dengue virus All serotypes dataset #208

Merged

j23414 changed the title ~~Add dengue dataset~~ Add Dengue virus DENVx genotypes dataset Jun 7, 2024

j23414 temporarily deployed to refs/pull/203/merge June 7, 2024 21:35 — with GitHub Actions Inactive

j23414 mentioned this pull request Jun 7, 2024

drop unalignable example sequences nextstrain/dengue#66

Merged

1 task

j23414 and others added 5 commits June 17, 2024 12:02

Add dengue dataset

e4f97e2

fixup: Add example sequences for denv1-4

32746aa

fixup: Set defaultCds to E gene for browser SVG engine speedup

625ba7c

feat: remove word "dataset" from dataset names

b3cc967

j23414 force-pushed the add-dengue-dataset branch from d9db8ce to b3cc967 Compare June 17, 2024 19:27

j23414 temporarily deployed to refs/pull/203/merge June 17, 2024 19:27 — with GitHub Actions Inactive

chore: rebuild [skip ci]

5507096

j23414 marked this pull request as draft June 17, 2024 19:28

fixup: Add outgroups

485c186

j23414 deployed to refs/pull/203/merge June 17, 2024 20:31 — with GitHub Actions Active

chore: rebuild [skip ci]

f5fca71

j23414 marked this pull request as ready for review June 17, 2024 21:26

j23414 mentioned this pull request Jun 18, 2024

Setting an outgroup for the "Dengue virus DENVx genotypes" dataset nextstrain/dengue#67

Merged

2 tasks

Add serotype-level annotations to genotype-level trees

17b5bfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dengue virus DENVx genotypes dataset #203

Add Dengue virus DENVx genotypes dataset #203

j23414 commented May 24, 2024

ivan-aksamentov commented May 24, 2024 •

edited by j23414

Loading

j23414 commented May 24, 2024

ivan-aksamentov commented May 24, 2024

AngieHinrichs commented May 24, 2024

j23414 commented May 24, 2024

alex-ranieri commented May 24, 2024

ivan-aksamentov commented May 29, 2024 •

edited

Loading

j23414 commented May 29, 2024

rneher commented Jun 2, 2024

j23414 commented Jun 4, 2024

rneher commented Jun 4, 2024 •

edited

Loading

rneher commented Jun 4, 2024

j23414 commented Jun 4, 2024 •

edited

Loading

ivan-aksamentov commented Jun 4, 2024

rneher commented Jun 7, 2024

j23414 commented Jun 7, 2024 •

edited

Loading

j23414 commented Jun 7, 2024

ivan-aksamentov commented Jun 7, 2024 •

edited

Loading

j23414 commented Jun 7, 2024

Add Dengue virus DENVx genotypes dataset #203

Are you sure you want to change the base?

Add Dengue virus DENVx genotypes dataset #203

Conversation

j23414 commented May 24, 2024

ivan-aksamentov commented May 24, 2024 • edited by j23414 Loading

j23414 commented May 24, 2024

ivan-aksamentov commented May 24, 2024

AngieHinrichs commented May 24, 2024

j23414 commented May 24, 2024

alex-ranieri commented May 24, 2024

ivan-aksamentov commented May 29, 2024 • edited Loading

j23414 commented May 29, 2024

rneher commented Jun 2, 2024

j23414 commented Jun 4, 2024

rneher commented Jun 4, 2024 • edited Loading

rneher commented Jun 4, 2024

j23414 commented Jun 4, 2024 • edited Loading

ivan-aksamentov commented Jun 4, 2024

rneher commented Jun 7, 2024

j23414 commented Jun 7, 2024 • edited Loading

j23414 commented Jun 7, 2024

ivan-aksamentov commented Jun 7, 2024 • edited Loading

j23414 commented Jun 7, 2024

ivan-aksamentov commented May 24, 2024 •

edited by j23414

Loading

ivan-aksamentov commented May 29, 2024 •

edited

Loading

rneher commented Jun 4, 2024 •

edited

Loading

j23414 commented Jun 4, 2024 •

edited

Loading

j23414 commented Jun 7, 2024 •

edited

Loading

ivan-aksamentov commented Jun 7, 2024 •

edited

Loading