Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow for producing the Nextclade dengue dataset #25

Merged
merged 12 commits into from
May 30, 2024
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ This repository contains two workflows for the analysis of dengue virus data:

- [`ingest/`](./ingest) - Download data from GenBank, clean and curate it and upload it to S3
- [`phylogenetic/`](./phylogenetic) - Make phylogenetic trees for nextstrain.org
- [`nextclade/`](./nextclade) - Make Nextclade datasets for nextstrain/nextclade_data

Each folder contains a README.md with more information.

Expand Down
50 changes: 50 additions & 0 deletions nextclade/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Nextclade

Previously, all "official" Nextclade workflows lived in a [central GitHub repository](https://github.com/neherlab/nextclade_data_workflows).
The new standard would be to include the Nextclade workflow within the pathogen repo.

This workflow is used to create the Nextclade datasets for this pathogen.
All official Nextclade datasets are available at https://github.com/nextstrain/nextclade_data.

## Workflow Usage

The workflow can be run from the top level pathogen repo directory:
```
nextstrain build nextclade
```

Alternatively, the workflow can also be run from within the nextclade directory:
```
cd nextclade
nextstrain build .
```

This produces the default outputs of the nextclade workflow:

- nextclade_dataset(s) = datasets/<build_name>/*

## Defaults

The defaults directory contains all of the default configurations for the Nextclade workflow.

[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters
used for the Nextclade workflow. Use Snakemake's `--configfile`/`--config`
options to override these default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules of the core Nextclade workflow.
The modules of the workflow are in separate files to keep the main nextclade [Snakefile](Snakefile) succinct and organized.

The `workdir` is hardcoded to be the nextclade directory so all filepaths for
inputs/outputs should be relative to the nextclade directory.

Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.

## Build configs

The build-configs directory contains custom configs and rules that override and/or
extend the default workflow.

- [test-dataset](build-configs/test-dataset/) - build to test new Nextclade dataset
34 changes: 34 additions & 0 deletions nextclade/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
configfile: "config/config_dengue.yaml"

serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']
genes = ['genome']

wildcard_constraints:
serotype = "|".join(serotypes),
gene = "|".join(genes)

rule all:
input:
auspice_json = expand("auspice/dengue_{serotype}_{gene}.json", serotype=serotypes, gene=genes),
nextclade_dataset = expand("datasets/{serotype}/tree.json", serotype=serotypes),
test_dataset = expand("test_output/{serotype}", serotype=serotypes),

include: "rules/prepare_sequences.smk"
include: "rules/construct_phylogeny.smk"
include: "rules/annotate_phylogeny.smk"
include: "rules/export.smk"
include: "rules/assemble_dataset.smk"

# Include custom rules defined in the config.
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file

rule clean:
"""Removing directories: {params}"""
params:
"results ",
"auspice"
shell:
"rm -rfv {params}"
39 changes: 39 additions & 0 deletions nextclade/config/config_dengue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Sequences must be FASTA and metadata must be TSV
# Both files must be zstd compressed
# Both files must have a {serotype} expandable field to be replaced by all, denv1-denv4
sequences_url: "https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst"
metadata_url: "https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst"

strain_id_field: "genbank_accession"
display_strain_field: "strain"

filter:
exclude: "../phylogenetic/config/exclude.txt"
include: "../phylogenetic/config/include_{serotype}.txt"
group_by: "year region"
min_length:
genome: 5000
E: 1000
sequences_per_group:
all: '10'
denv1: '36'
denv2: '36'
denv3: '36'
denv4: '36'

traits:
sampling_bias_correction: '3'
traits_columns:
all: 'region serotype_genbank genotype_nextclade'
denv1: 'country region serotype_genbank genotype_nextclade'
denv2: 'country region serotype_genbank genotype_nextclade'
denv3: 'country region serotype_genbank genotype_nextclade'
denv4: 'country region serotype_genbank genotype_nextclade'

clades:
clade_definitions:
all: '../phylogenetic/config/clades_serotypes.tsv'
denv1: '../phylogenetic/config/clades_genotypes.tsv'
denv2: '../phylogenetic/config/clades_genotypes.tsv'
denv3: '../phylogenetic/config/clades_genotypes.tsv'
denv4: '../phylogenetic/config/clades_genotypes.tsv'
5 changes: 5 additions & 0 deletions nextclade/datasets/all/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## Unreleased

Initial release for Nextclade v3!

Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
7 changes: 7 additions & 0 deletions nextclade/datasets/all/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Nextclade dataset for "Dengue Virus"

## Dataset attributes

Nextclade dataset

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
14 changes: 14 additions & 0 deletions nextclade/datasets/all/genome_annotation.gff3
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
##gff-version 3
##sequence-region NC_002640.1 1 10649
NC_002640.1 feature gene 102 440 . + . codon_start=1;gene=C;gene_name=C;
NC_002640.1 feature gene 441 713 . + . codon_start=1;gene=pr;gene_name=pr;
NC_002640.1 feature gene 441 938 . + . codon_start=1;gene=M;gene_name=M;
NC_002640.1 feature gene 939 2423 . + . codon_start=1;gene=E;gene_name=E;
NC_002640.1 feature gene 2424 3479 . + . codon_start=1;gene=NS1;gene_name=NS1;
NC_002640.1 feature gene 3480 4133 . + . codon_start=1;gene=NS2A;gene_name=NS2A;
NC_002640.1 feature gene 4134 4523 . + . codon_start=1;gene=NS2B;gene_name=NS2B;
NC_002640.1 feature gene 4524 6377 . + . codon_start=1;gene=NS3;gene_name=NS3;
NC_002640.1 feature gene 6378 6758 . + . codon_start=1;gene=NS4A;gene_name=NS4A;
NC_002640.1 feature gene 6759 6827 . + . codon_start=1;gene=2K;gene_name=2K;
NC_002640.1 feature gene 6828 7562 . + . codon_start=1;gene=NS4B;gene_name=NS4B;
NC_002640.1 feature gene 7563 10262 . + . codon_start=1;gene=NS5;gene_name=NS5;
67 changes: 67 additions & 0 deletions nextclade/datasets/all/pathogen.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"alignmentParams": {
"minSeedCover": 0.01,
"minLength": 1000
},
"attributes": {
"name": "Dengue virus All serotypes",
"reference accession": "NC_002640",
"reference name": "dengue virus type 4"
},
"compatibility": {
"cli": "3.0.0-alpha.0",
"web": "3.0.0-alpha.0"
},
"deprecated": false,
"enabled": true,
"experimental": true,
"files": {
"changelog": "CHANGELOG.md",
"examples": "sequences.fasta",
"genomeAnnotation": "genome_annotation.gff3",
"pathogenJson": "pathogen.json",
"readme": "README.md",
"reference": "reference.fasta",
"treeJson": "tree.json"
},
"meta": {
"bugs": "https://github.com/nextstrain/nextclade_data/issues",
"source code": "https://github.com/nextstrain/nextclade_data"
},
"qc": {
"frameShifts": {
"enabled": false
},
"missingData": {
"enabled": false,
"missingDataThreshold": 2700,
"scoreBias": 300
},
"mixedSites": {
"enabled": false,
"mixedSitesThreshold": 10
},
"privateMutations": {
"cutoff": 24,
"enabled": false,
"typical": 8,
"weightLabeledSubstitutions": 2,
"weightReversionSubstitutions": 1,
"weightUnlabeledSubstitutions": 1
},
"snpClusters": {
"clusterCutOff": 5,
"enabled": false,
"scoreWeight": 50,
"windowSize": 100
},
"stopCodons": {
"enabled": false
}
},
"schemaVersion": "3.0.0",
"version": {
"tag": "unreleased"
},
"defaultCds": "E"
}
Loading