Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dengue virus DENVx genotypes dataset #203

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented May 24, 2024

Add a dengue dataset to Nextclade.

@j23414 j23414 temporarily deployed to refs/pull/203/merge May 24, 2024 17:36 — with GitHub Actions Inactive
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented May 24, 2024

Cool!

I'll just drop short links for testing:

DENV1: https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv1

DENV2:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv2

DENV3:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv3

DENV4:https://clades.nextstrain.org/?dataset-server=gh:@add-dengue-dataset@&dataset-name=nextstrain/dengue/denv4

Sadly I don't have any example sequences to run :(

Are there any sequences with permissive licenses available to add them as example sequences into datasets?

Would be nice to fill-in some info to the readme if you have a second. Readme is an optional file though.

@j23414
Copy link
Contributor Author

j23414 commented May 24, 2024

I can add some example sequences, it may take me a moment (aka. not in the next hour).

@ivan-aksamentov
Copy link
Member

@j23414 No worries at all. I will not be able to asses the coolness of it anyways, because I lack required science knowledge. But I'll happily test how it runs and whether any bugs manifest themselves sometimes :)

@AngieHinrichs
Copy link

FWIW here are some arbitrarily chosen dengue sequences that I use as examples on dev.usher.bio:

https://www.ncbi.nlm.nih.gov/nuccore/OQ605998.1
https://www.ncbi.nlm.nih.gov/nuccore/OQ445967.1
https://www.ncbi.nlm.nih.gov/nuccore/OQ821618.1|
https://www.ncbi.nlm.nih.gov/nuccore/OQ622206.1

And NCBI Virus can provide a bunch: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Dengue%20virus,%20taxid:12637

@j23414
Copy link
Contributor Author

j23414 commented May 24, 2024

I lack required science knowledge.

Haha, I also lack the required science knowledge. I'm mostly wandering in the dark.

@alex-ranieri
Copy link

We need that! @jamessiqueirap and I proposed a lineage system for dengue. Our work utilizes the genotype mutations table from Nextstrain Dengue.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented May 29, 2024

A small technical suggestion. These sequences seem to contain many mutations - too many for browser SVG engine to render efficiently in Nextclade's sequence views. If there's a clear "main" gene/CDS of interest for this virus, then one workaround would be to set the default CDS in pathogen.json, so that sequence view automatically switches to it when first rendering:

"defaultCds": "S",

Nuc sequence will of course still be available in the dropdown. But users will pay the associated performance price only if they switch to it.

And, on related note, if you need to customize the order of genes in the dropdown, then you could also add

"cdsOrderPreference": [
  "S",
  "N",
  "M",
  "E"
],

Both are just eye-candy features, so no rush.

The ultimate solution will be to implement a more performant sequence viewer in Nextclade. But this is quite far away.

@j23414
Copy link
Contributor Author

j23414 commented May 29, 2024

Thanks @ivan-aksamentov! I think the "main" gene/cds-of-interest for dengue is the E gene, and sometimes it's the only portion of the genome sequenced based on this user comment. I can add to the pathogen.json files following this pattern.

"defaultCds": "E",

Good to know about customizing the order of genes in drop down!

Right now, the dropdown menu matches the gene/cds order in the genome, which feels logical and straightforward to me. However, I welcome differing perspectives on this matter from others in the field. Open to alternatives or potential improvements.

Screenshot 2024-05-29 at 9 27 21 AM

j23414 added a commit to nextstrain/dengue that referenced this pull request May 29, 2024
Since dengue sequences seem to contain many mutations - too many for the browser
SVG engine to render efficiently in Nextclade's sequence views - we will set the
default CDS to display to the E gene as the "main" gene of interest.

Viewing the full genome and other gene/CDS regions can still be displayed by selection
from the dropdown menu at the top.

Flagged by the following comment:

nextstrain/nextclade_data#203 (comment)
@j23414 j23414 temporarily deployed to refs/pull/203/merge May 29, 2024 17:16 — with GitHub Actions Inactive
j23414 added a commit to nextstrain/dengue that referenced this pull request May 30, 2024
Since dengue sequences seem to contain many mutations - too many for the browser
SVG engine to render efficiently in Nextclade's sequence views - we will set the
default CDS to display to the E gene as the "main" gene of interest.

Viewing the full genome and other gene/CDS regions can still be displayed by selection
from the dropdown menu at the top.

Flagged by the following comment:

nextstrain/nextclade_data#203 (comment)
@j23414 j23414 temporarily deployed to refs/pull/203/merge May 31, 2024 17:12 — with GitHub Actions Inactive
@rneher
Copy link
Member

rneher commented Jun 2, 2024

A few additional remarks:

  • you have added example sequences to the DENV1-4 builds, but the file is missing from the files struct in the pathogen.json.
  • you didn't enable any QC. is this on purpose? Stop and frameshift QC is quite universal.
  • I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.

@j23414 j23414 added SARS-CoV-2 Related to SARS-CoV-2 and removed SARS-CoV-2 Related to SARS-CoV-2 labels Jun 3, 2024
j23414 added a commit that referenced this pull request Jun 4, 2024
Incorporated some changes suggested by comment:
#203 (comment)

* Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d
* Enabled stop and frameshift QC nextstrain/dengue@90523a7
* Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c
@j23414 j23414 temporarily deployed to refs/pull/203/merge June 4, 2024 16:32 — with GitHub Actions Inactive
@j23414
Copy link
Contributor Author

j23414 commented Jun 4, 2024

Thanks @rneher! I tried to incorporate your suggested changes in 610e3f5

[example sequences] file is missing from the files struct in the pathogen.json.

An oversight on my part, fixed.

you didn't enable any QC. is this on purpose?

I had turned off several QC during development since dengue sequences seemed very divergent. I agree with adding stop and frameshift QC back in, done.

I would include one outgroup sequence (for example your reconstructed ancestor) to pick up non serotype sequences. The branch to that outgroup could be artificially shortened is necessary.

For genotype-level datasets (denv1-4), I swapped in the inferred ancestral root in for the reference and root of the tree. Done, although I could use help in evaluating the genotype-level datasets or any suggested next steps.

I thought about blasting a serotype's sequences against the other 3 serotypes to find the nearest cross-serotype outgroup, but wasn't sure if that would be more or less effective then the inferred ancestral root. Or using the other 3 serotype's inferred-ancestral roots as outgroups. I wasn't sure, but suggestions welcome.

@rneher
Copy link
Member

rneher commented Jun 4, 2024

I am not sure, but this is now the root of the DENV1 tree:
image

and an entire clade in that tree is DENV2/S

image

are you sure this is correct?

@rneher
Copy link
Member

rneher commented Jun 4, 2024

I think the root should be given a clade unassigned or outgroup or not DENV1 and you could root the tree mid-way to the outgroup. I would also color it grey.

@j23414
Copy link
Contributor Author

j23414 commented Jun 4, 2024

are you sure this is correct?

Gah, I must have copied in the wrong dataset files. I was experimenting with using the "dengue/all" reconstructed root for all 4 serotypes. However, as you observed, it was giving me weird genotype calls (e.g. DENV2 genotypes in the DENV1 tree).

I'll copy the correct ones (and double check this time) in a moment.

@ivan-aksamentov
Copy link
Member

I allowed myself to resolve merge conflict which appeared after merging measles #202

@rneher
Copy link
Member

rneher commented Jun 7, 2024

thanks, Jennifer. The dataset also contains the genotype annotations. If these are good, you could enable them by adding to them to the meta.extensions as clade-like attributes.

Also, the example data contain two sequences that don't align. That is not a problem per se if these sequences are very weird (and having examples of bad sequences is fine), but if this is unexpected that one could maybe tune parameters.

@j23414 j23414 changed the title Add dengue dataset Add Dengue virus DENVx genotypes dataset Jun 7, 2024
@j23414
Copy link
Contributor Author

j23414 commented Jun 7, 2024

The dataset also contains the genotype annotations.

Thanks for the question @rneher! Some clarification that the genotype annotations (named genotype_nextclade) are from some placeholder DENV1-4 Nextclade datasets that were created so long ago I'm unsure of dataset provenance. (ergo, I wouldn't trust it yet). It requires predetermining the serotype (from NCBI annotations) before querying it against the DENV1-4 datasets individually and being appended to the metadata.

The more concerning problem occurs when we zoom into individual serotype trees (e.g. DENV2) where the genotype_nextclade does not quite match the augur clades's clade_membership annotations.

I believe @trvrb was going to explore modifying aa-mut defining mutations in clades_genotypes.tsv to apply to the "all" tree. Currently the aa-coords are by serotype reference (e.g. against the DENV1 reference, against the DENV3 reference which has a two amino acid deletion in E gene, etc).

@j23414
Copy link
Contributor Author

j23414 commented Jun 7, 2024

the example data contain two sequences that don't align.

Thanks for flagging! I assume it's an example sequence with >GenBankID_? headers and not something with a serotype annotation (e.g. >GenBankID_DENV1). I'm expecting them not to align and are representing some sylvatic (in the wild and lowly monitored) samples, but yes we can explore "rescuing" them (or assigning them a new serotype) if that makes biological sense.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jun 7, 2024

@j23414 I think why Richard is asking about the unalignable or otherwise "broken" (from the point of view of Nextclade results) example sequences is that we had a situation with SC2 dataset, when users come confused after trying Nextclade with example sequences and receiving error or warning messages. They thought they did something wrong or that there is a bug. So we try to keep examples nice and high quality since then.

From one side, "broken" sequence might tell a story about some interesting science fact or just showcase how Nextclade software handles that particular situation technically - which is interesting. On the other hand, without context it might be unclear for the target audience. If you plan on keeping these samples, then perhaps you could explain the details in the readme. Alternatively, there might be a sciency solution, as you mentioned, to make them "good". Otherwise you could just delete the bad examples to avoid the troubles.

@j23414 j23414 temporarily deployed to refs/pull/203/merge June 7, 2024 21:35 — with GitHub Actions Inactive
@j23414
Copy link
Contributor Author

j23414 commented Jun 7, 2024

I see, thank you for the context! I agree that dropping them (unalignable example sequences) is the smoother path forward.

Dropped from the other PR #208 in commit 932affc

Since the "all" directory is being added in the other PR, I've ommited the "all" directory here as commit 88171f4

j23414 and others added 5 commits June 17, 2024 12:02
Incorporated some changes suggested by comment:
#203 (comment)

* Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d
* Enabled stop and frameshift QC nextstrain/dengue@90523a7
* Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c
@j23414 j23414 marked this pull request as draft June 17, 2024 19:28
@j23414 j23414 deployed to refs/pull/203/merge June 17, 2024 20:31 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants