This is the nf-core-based pipeline for SpeciesAbundance. This pipeline estimates the relative abundance of sequence reads originating from different species in a sample. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.
This pipeline is designed to estimate taxonomic abundance using both single- and paired-end Illumina short-read data. It does not currently accommodate long-read sequencing data (Nanopore or PacBio).
The input to the pipeline is a standard sample sheet (passed as --input samplesheet.csv
) that looks like:
sample | fastq_1 | fastq_2 |
---|---|---|
SampleA | file_1.fastq.gz | file_2.fastq.gz |
An example samplesheet has been provided with the pipeline.
The structure of this file is defined in assets/schema_input.json. Validation of the sample sheet is performed by nf-validation.
speciesabundance
accepts the IRIDA-Next format for samplesheets which can contain an additional column: sample_name
sample_name
: An optional column, that overrides sample
for outputs (filenames and sample names) and reference assembly identification.
sample_name
, allows more flexibility in naming output files or sample identification. Unlike sample
, sample_name
is not required to contain unique values. Nextflow
requires unique sample names, and therefore in the instance of repeat sample_names
, sample
will be suffixed to any sample_name
. Non-alphanumeric characters (excluding _
,-
,.
) will be replaced with "_"
.
The sample sheet, when including the optional sample_name
column, should look like:
sample | sample_name | fastq_1 | fastq_2 |
---|---|---|---|
SampleA | A1 | file_1.fastq.gz | file_2.fastq.gz |
An example samplesheet has been provided with the pipeline, which includes the sample_name
column.
The mandatory parameters are as follows:
--input
: a URI to the samplesheet as specified in the Input section.--output
: to specify the output results directory.
It is mandatory to have one of either --database
or both [--kraken2_db
and --bracken_db
].
Please use only:
--database /path/to/database
: to specify the directory to the database files required by both Kraken2 and Bracken
Or:
--kraken2_db /path/to/kraken2database
: to specify the directory to the Kraken2 database files and--bracken_db /path/to/brackendatabase
: to specify the directory to the Bracken database files
Additionally, you may choose to provide:
--taxonomic_level
: to specify the taxonomic level of the bracken abundance estimation.- Must be one of :
S
(species)(default),G
(genus),O
(order),F
(family),P
(phylum), orK
(kingdom)
- Must be one of :
--kmer_len
: to specify the kmer length for the bracken distribution file used to estimate the abundance at the specified taxonomic level- Must be one of : 50, 75, 100 (default), 150, 200, 250, or 300
- Selecting a lower k-mer length enhances sensitivity, while a higher k-mer length increases specificity.
--top_n
: to specify the number of top results to keep and include in the metadata for IRIDA Next.- Default: 5
-profile
: to specify which profile to use (ex:-profile singularity
)-r [branch]
: to specify which GitHub branch you would like to use (ex:-r dev
)
Other parameters (defaults from nf-core) are defined in nextflow_schema.json.
To run the pipeline using the test profile, please run:
nextflow run phac-nml/speciesabundance -profile docker,test --outdir results
The pipeline output will be written to a directory named results
. A JSON file for integrating with IRIDA Next will be written to results/iridanext.output.json.gz
(as detailed in the Output section)
The following output files are generated by the pipeline:
fastp/
sampleID_{R1/R2}_trimmed.fastq.gz
sampleID.fastp.json
sampleID.fastp.html
kraken2/
sampleID_kraken2_output.tsv.gz
sampleID_kraken2_report.txt
bracken/
sampleID_S_bracken_abundance_unsorted.tsv
sampleID_S_bracken.txt
failure/
failures_report.csv
adjust/
sampleID_S_bracken_abundance.csv
sampleIS_S_adjusted_report.txt
top/sampleID_S_top_N.csv
csvtk/merged_topN.csv
bracken2krona/sampleID.txt
krona/sampleID.krona.html
A JSON file for loading metadata into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir
provided to the pipeline with the name iridanext.output.json.gz
(ex: [outdir]/iridanext.output.json.gz
).
An example of the what the contents of the IRIDA Next JSON file looks like for this particular pipeline is as follows:
{
"files": {
"global": [
{
"path": "failure/failures_report.csv"
}
],
"samples": {
"sampleID": [
{"path": "adjust/sampleID_S_bracken_abundances.csv"},
{"path": "krona/sampleID.krona.html"},
{"path": "fastp/sampleID.fastp.html"}
]
}
},
"metadata": {
"samples": {
"sampleID": {
"taxonomy_level": "S",
"abundance_1_name": "Bacteroides fragilis",
"abundance_1_ncbi_taxonomy_id": "817",
"abundance_1_num_assigned_reads": "28877",
"abundance_1_fraction_total_reads": "57.77018",
"abundance_2_name": "Escherichia coli",
"abundance_2_ncbi_taxonomy_id": "562",
"abundance_2_num_assigned_reads": "21065",
"abundance_2_fraction_total_reads": "42.1418",
"abundance_3_name": "",
"abundance_3_ncbi_taxonomy_id": "",
"abundance_3_num_assigned_reads": "",
"abundance_3_fraction_total_reads": "",
"abundance_4_name": "",
"abundance_4_ncbi_taxonomy_id": "",
"abundance_4_num_assigned_reads": "",
"abundance_4_fraction_total_reads": "",
"abundance_5_name": "",
"abundance_5_ncbi_taxonomy_id": "",
"abundance_5_num_assigned_reads": "",
"abundance_5_fraction_total_reads": "",
"unclassified_name": "unclassified",
"unclassified_ncbi_taxonomy_id": "0",
"unclassified_num_assigned_reads": "44",
"unclassified_fraction_total_reads": "0.08802"
}
}
}
}
Within the files
section of this JSON file, all of the output paths are relative to the outdir
. Therefore, "path": "adjust/SAMPLE1_S_bracken_abundances.csv"
refers to a file located within outdir/adjust/SAMPLE1_S_bracken_abundances.csv
.
If one or more samples fail during the pipeline execution, the workflow will still run all other samples in the samplesheet. The samples that fail will be reported in a file named results/failure/failure_report.csv
. This CSV file has three columns:
sample
: the name of the sample that failed (matching the input samplesheet)module
: the module (or process) where the error occurederror_message
: suggestions that aim to provide insights into potential reasons for sample failure in the respective process
For example:
sample,module,error_message
[SAMPLE1],FASTP,The input FASTQ file(s) might exhibit either a mismatch in PAIRED files; corruption in one or both SINGLE/PAIRED file(s); or file(s) may not exist in PATH provided by input samplesheet
[SAMPLE2],KRAKEN2,The reads may not have passed the quality control and trimming process OR the database directory may be missing required KRAKEN2 files
{SAMPLE3},BRACKEN,The reads may have failed to classify against the selected Kraken2 database OR the database directory may be missing the Bracken distribution files
Copyright 2024 Government of Canada
Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
https://opensource.org/license/mit/
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
This pipeline includes source code from a nextflow pipeline for taxon-abundance and an IRIDA-plugin for SpeciesAbundance developed by Dan Fornika as a work of the BC Center for Disease Control Public Health Laboratory (BCCDC_PHL).
The included source code developed by Dan Fornika as a work of the BCCDC-PHL was distributed within the public domain under the Apache Software License version 2.0.
Any such source files in this project that are included from or derived from the original work by Dan Fornika will include a notice.