Yu Wan
This repository consists of helper scripts for submitting ARIBA jobs or processing their outputs.
This pipeline submits PBS (Portable Batch System) jobs of ARIBA to a high-performance computing (HPC) cluster with a control for queue sizes. Users may also opt to run ARIBA locally (without submitting jobs to the HPC system) using this pipeline. Two Nextflow scripts implement this pipeline:
ariba.nf
: Main Nextflow script for the pipeline;ariba.config
: Configuration file ofariba.nf
. By default, each PBS job requests four CPUs and 16 GB RAM from the system. The same request applies to local jobs as well.
Dependencies
- Singularity-compatible Docker image of ARIBA
- Nextflow v20.0 and above
Parameters
fastq
: A glob for read files. It must be braced by a pair of quotation marks and no square brackets should be used. Default:"*_{1,2}.fastq.gz"
.output_dir
: Name and path for the parental directory of output files. Default:output
.db_dir
: Name and path to access an existing ARIBA-compatible ResFinder database. Default:resfinder
.queue_size
: Maximum number of concurrent PBS jobs submitted to the HPC system. Some computation facilities have restrictions for the queue size for fair use. Default: 10.
Example commands
# Install ARIBA's Singularity image through Docker
singularity pull docker://staphb/ariba # Rename the Image file to ariba.sif
# Run the ARIBA Nextflow pipeline for gene detection
nextflow -Djava.io.tmpdir=$PWD run ariba.nf --fastq "$PWD/reads/*_{1,2}.fastq.gz" --db_dir $HOME/db/resfinder --output_dir output -c ariba.config -profile pbs --queue_size 15 -with-singularity $HOME/software/docker/ariba.sif
Outputs
Five subdirectories are created under the parental output directory by the pipeline:
/report
:[sample]_report.tsv
/gene
:[sample]_genes.fna
, decompressed and renamed from ARIBA's output fileassembled_genes.fa.gz
./stats
:[sample]_stats.tsv
, renamed fromdebug.report.tsv
./contig
:[sample]_assemblies.fna.gz
and[sample]_seqs.fna.gz
, renamed from output filesassemblies.fa.gz
andassembled_seqs.fa.gz
of each sample, respectively./log
:[sample]_log_clusters.gz
and[sample]_log_version.txt
, renamed from output fileslog.clusters.gz
andversion_info.txt
of each sample, respectively.
Users may refer to instructions for PAMmaker for a method converting this pipeline's outputs to an allelic presence-absence matrix.
Compile outputs (mlst_report.tsv
, mlst_report.details.tsv
, and report.tsv
) from MLST analysis by ARIBA, assuming the names of ARIBA's output directories are genome names. Namely, the MLST reports should be generated with the command
ariba run db genome1_1.fastq.gz genome1_2.fastq.gz ./genome1
where an output directory shares the name of a genome.
Example command line
find ./* -maxdepth 1 -type d | python ~/ARIBA_toolkit/compileMLST.py
Outputs
STs.tsv
: compiled frommlst_report.tsv
files;scores.tsv
: compiled fromreport.tsv
files, which include values of nucleotide identities;hits.tsv
: compiled frommlst_report.details.tsv
, which contain values of reference coverage (%) and read depths of allele hits.
Concatenate allele sequences of a given list of STs. It is useful for comparing STs at the nucleotide level and MLST-based phylogenetic construction. Currently this script does not support allele variants, namely, requires exact hits of query genomes to reference allele sequences.
Example command lines:
python concMLSTalleles.py -p st_profiles.tsv -s ./sequence_dir -e tfa > mlstSeqs.fna
# Append sequences to an extant FASTA file
python concMLSTalleles.py -p st_profiles_new.tsv -s ./sequence_dir -e tfa >> mlstSeqs.fna
This script assumes input ST profiles are offered in a tab-delimited file with columns of STs and names of MLST loci. (No other columns should be allowed) For example:
ST | locus_1 | locus_2 | locus_3 | locus_4 | locus_5 | locus_6 | locus_7 |
---|---|---|---|---|---|---|---|
1 | 11 | 21 | 31 | 41 | 51 | 61 | 71 |
2 | 12 | 21 | 31 | 41 | 51 | 61 | 71 |
... | ... | ... | ... | ... | ... | ... | ... |
The directory of MLST allele sequences is expected to be the same as that downloaded from the PubMLST
database using ARIBA. (pubmlst_download
). The '-s' parameter does not require an end forward slash
for the path. By default, the delimiter for locus name (for instance, gene adk) and allele number in
an allele name (for example, adk.1) is a period, although some databases use an underscore as the
delimiter.