Skip to content

Tutorial

Arkadiy-Garber edited this page Jun 25, 2024 · 11 revisions

Quick-start

To run SprayNPray on single genome assembly:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16

the above run will predict genes from the provided genome assembly and query them against a reference database of proteins (using 16 threads). If a metagenome is provided, simply added the --meta flag:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta

Add the --makedb flag if a diamond database does not already exist for your reference protein database. You only need to add this flag once. The DIAMOND database, once created, will be automatically detected without this flag.

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --makedb

If you created the DIAMOND database outside of the spraynpray conda environment, please be sure to check that the version it was built under exactly matches whatever the current version of DIAMOND is used by SprayNPray. You can check the version by running:

diamond --version

Main Output Files

  1. output.csv: Main output summary file, contains all relevant information for each contig: number of genes, length, gene density, GC content, and (importantly) top taxonomic match for each predicted gene.
  2. spraynpray-top100.csv: This lists the top 100 taxonomic matches (along with functional annotation) to each gene. The exact number of top hits can be changed by the user, using the -hits flag.
  3. spraynpray.words.tiff: Word cloud generated from the taxonomic hits listed in the output.csv file.

Differentiating eukaryotic from prokaryotic contigs

Eukaryotic contigs are often markedly different from bacterial and archaeal contigs due to the low coding densities and lower GC content. Moreover, Prodigal - the software used here for gene prediction - is not designed to identify contigs in eukaryotes. Thus, contigs that are assembled from eukaryotic DNA will have genes that are far and few in between. Taking advantage of these principles, SprayNPray can segragate eukaryotic contigs from the assembly using the --euk flag (the --fa flag will direct the software to generate the FASTA files):

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta --euk --fa

By default this will split the assembly into two separate FASTA files: one with prokaryotic contigs, and one with those predicted to be eukaryotic (gene density less than 1/5kb and GC content less than 40%). Altenratively, users can manually set these parameters using flags like -CD and -GC, and also incorporate other information that may be known about the sample (e.g. coverage, contig length, taxonomic affiliation):

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta -CD 0.2 -gc 10 -GC 40 -COV 10 -l 10000 --fa

The above command will write contigs with coding densities below 0.2 (genes per kb), as well as GC content between 10 and 40% to a separate FASTA file. As additional metrics, this command will constrain these contigs to those that have an average coverage of less than 10, and minimum length of 10,000 bp. A maximum coding density of 0.2 means that contigs with les than 1 gene per 5 kb are considered Eukaryotic

Differentiating prokaryotic contigs

Prokaryotes are harder to differentiate from each other. Their GC contents often don't track with phylogeny (with the exception of Bacteria vs Archaea). And even within certain clades or lineages, there can be variation in GC content. Thus, many binning software also use tetranucelotide frequency and read coverage. SprayNPray can do this, and also incorporate codon usage biases to hierarchically cluster contigs. However, the strength of SprayNPray is not in these compositional metrics but rather in taxonomic metrics. Specifically, SprayNPray profiles each contig in an input assembly by querying each gene against a reference database. Some may know the expected taxonomic composition of a sequenced sample. For example, if the microbes of interest in a sample is Pseudomonas aeruginosa, SprayNPray can be invoked like this:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --fa -species aeruginosa -genus Pseudomonas -Class Gammaproteobacteria -phylum Proteobacteria -domain Bacteria

It is iportant to specify all of the above-referenced taxonomic ranks, as redundant names at the species level can cause ambiguities. The above command will produce two FASTA files, one appended with 'unmatched.fasta' and one appended as 'matched.fasta' The matched.fasta file will contain contigs where 50% or more of the genes match the indicated species. You can set this cutoff, if you'd like to me more or less stringent:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --fa -species aeruginosa -genus Pseudomonas -Class Gammaproteobacteria -phylum Proteobacteria -domain Bacteria -perc 65

This will remove all contigs from the matched.fasta file where less than 65% of the genes on the contig do not match the indicated species. And you can set these metrics to whatever taxonomic rank that you'd like, for example, if you only want the Gammaproteobacterial contigs:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --fa -species aeruginosa -genus Pseudomonas -Class Gammaproteobacteria -perc 65

Taxonomic ranks can also be introduced into the main output summary CSV file using the -lvl argument. For example, instead of listing the top species match to each gene, one can set the following to identify the Phylum of each gene hit:

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta -lvl Phylum

To incorporate coverage information into binnin:

`spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta -bam genome.bam

Identifying bacteria-to-eukaryote HGTs

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta --hgt

Identifying phages

spray-and-pray.py -g genome.fna -ref reference_proteins.faa -t 16 --meta --phage