A relatively simple metagenomics analysis pipeline written in nextflow [1]. The pipeline is based on kraken2
/bracken
and kaiju
, and is supplemented with Krona
visualizations and interactive html tables. It is written with the idea to get taxonomic and abundance information for many samples, and not to compare different taxonomy assignment tools (but can be used for this as well).
The pipeline runs in a docker container by default. Both Illumina and Nanopore data can be processed (separately). For a set of fastq
files it executes:
fastp
- filter and trim reads with default parameterskraken2
[2] - taxonomic assignment of the readsbracken
[3] - abundance estimation at a single level in the taxonomic tree, e.g. species, using the kraken2 outputkaiju
[4] - taxonomic classification of the reads based on maximum exact matches on protein levelkrona
[5] - plots are generated from the output ofkraken2
DataTables
- generates an interactive HTML table with the results frombracken
for each sample, as well as a summary table for all the samplesMultiQC
[6] - aggregates the results into a single html report
The pipeline runs kraken2/bracken or kaiju depending on the parameters supplied: use --kraken_db
to run kraken2/bracken or --kaiju_db
to run kaiju (or both parameters to run both).
The --kraken_db
parameter is a path to a previously downloaded kraken2 database. A collection of ready-to-use kraken2/bracken RefSeq indexes can be downloaded from here.
The --kaiju_db
can be one of refseq, progenomes, viruses, plasmids, fungi, nr, nr_euk, mar
or rvdb
. See the links above for available databases for each tool.
If none of these parameters is used, the pipeline will just run fastp
.
Nothing to install, as soon as you have docker
and nextflow
. Choose a kraken2
and/or a kaiju
database (see below), and run the pipeline:
# run with a test dataset (included)
nextflow run angelovangel/nextflow-kraken2 -profile test
# see options and how to run
nextflow run angelovangel/nextflow-kraken2 --help
All output files are in the folder results-kraken2
, which is found in the folder with reads data used for running the pipeline. An example of the outputs, generated with a small Illumina dataset can be downloaded here.
The outputs are:
timmed_fastq/
- directory with fastq files after trimming, these are also used for taxonomic profilingbracken_summary_heatmap/table.html
- standalone html files with summary information from bracken. Note that these files will be generated only if there are less than 34 samplesbracken_summary_long/wide.csv
- summary bracken information (all found taxa in all samples), in different formatskraken2taxonomy_krona.html
- an interactive Krona plot of the kraken2 output for all samplessamples/
- directory with individual (per sample) kraken2 and bracken-corrected report files and with the abundance table from bracken (as html and tsv). Tip: the report files can be directly imported in Pavian for nice interactive visualizations.
An absolute path to a folder containing a kraken2 database. See the kraken2 homepage or Ben Langmead's collection for a list of avalable pre-built databases. These databases have the required Bracken files included (for read lengths 50, 100, 150, 200 and 250). Take care to use the correct --readlen
parameter according to your reads data.
Note: although still controversial, recent work has shown that kraken2 may be performing better than QIIME in the analysis of 16S amplicons.
This argument can be one of refseq, progenomes, viruses, plasmids, fungi, nr, nr_euk, mar
or rvdb
. When this parameter is used, a source database and the taxonomy files are downloaded from the NCBI FTP server, converted into a protein database and indexed (kaiju-makedb). Check the memory and space requirements here before using.
This pipeline just uses some really nice work from others:
[1] P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) https://doi.org/10.1038/nbt.3820
[2] Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019) https://doi.org/10.1186/s13059-019-1891-0
[3] Lu J, Breitwieser FP, Thielen P, Salzberg SL. 2017. Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science 3:e104 https://doi.org/10.7717/peerj-cs.104
[4] Menzel, P., Ng, K. & Krogh, A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 7, 11257 (2016). https://doi.org/10.1038/ncomms11257
[5] Ondov BD, Bergman NH, Phillippy AM. Interactive metagenomic visualization in a Web browser. BMC Bioinformatics. 2011;12:385. Published 2011 Sep 30. https://doi.org/10.1186/1471-2105-12-385
[6] Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics (2016). https://doi.org/10.1093/bioinformatics/btaa559