Skip to content
kishori82 edited this page Mar 23, 2017 · 9 revisions

Welcome to the FAST wiki! Here we have a knowledge base detailing installation and usage of FAST.

FAST: Optimized threading for fast annotation

Table of Contents

Overview

Comparative genomic research relies heavily on protein alignment to infer the metabolic potential of an organism or community. Local Alignment Search Tool LAST, uses adaptive seed lengths to gain approximately 50 times the speed of standard seed-and-extend algorithm BLAST. However, LAST is currently capable of utilizing a single CPU. Here, we present FAST, a multi-threaded, IO optimized implementation of LAST that uses a model for algorithmic thread synchronization to allow efficient use of multiple CPU processes while requiring only 4-8GB of memory. We demonstrate that FAST is approximately 6-times faster than LAST using real world environmental data with > 50,000 sequences, and show that FAST is over 2-times faster than DIAMOND, the fastest available aligner prior to FAST. Finally, we implement new features such as the calculation of BLAST like e-values, and tabular format outputs compatible with multiple downstream analyses tools.

Setup and Dependencies

LAST-Plus has the following dependencies:

  • make
  • g++ (with pthread support)

To install FAST, please follow the steps below:

  • It is highly suggested to pull the source code and build rather than downloading a release as the releases may be behind the current source code.
  1. Clone the github repository, which will create a folder called FAST

OR

Download the latest release and unzip the downloaded file, which will generate a folder called FAST

  1. cd into the FAST folder
  • cd FAST
  1. Compile the executables for fastal and fastdb, after that you will see two executables "fastdb" and "fastal"
  • make clean
  • make

Running FAST

The recommended usage is with the optional e-value and bit-score cutoffs, top hits and with as many threads as there are cores.

SUPPORTED ALIGNMENT TYPES:
  • Protein - Protein (blastp like): fastal [options] -o outputFile amino-acid-lastdb-name amino-acid-fasta-sequence-file(s)
  • DNA - DNA (blastn like): fastal [options] -o outputFile DNA-lastdb-name DNA-fasta-sequence-file(s)
INPUT FILE CASES:
  • All alignment between two files must be from matching file types, e.g., either both target (or reference) database and the query (or input) sequences are both of nucleotide or protein. Before aligning a FASTA query, the reference database must have been formatted from a FASTA file, using fastdb, and similarly to align a FASTQ input file, the reference database must have been formatted, using fastdb, from a FASTQ file. FASTA and FASTQ alignment guidelines are given below. Please do not attempt to mix FASTA and FASTQ inputs and databases as that is an unsupported action.
SIMPLE QUICK START USE EXAMPLE:

Suppose your input query sequences (both are protein sequences) are in a file called "query.fasta" and your reference sequences are in a file called "COG.fasta", both are in FASTA format.

(a) Prepare sequences with fastdb for subsequent alignment with fastal.

  • ./fastdb [options] [reference formatted database name] [fasta-sequence-file(s)]

    -- e.g. ./fastdb -p COG_formatted COG.fasta

    (note you need -p for protein sequences, for nucleotide sequences omit the -p)

(b) Now align the query.fasta sequences against the COG_formatted

  • ./fastal -P [# cores to use ] -K [parsing limit for HSPs] -E [e-value cutoff] -S [score cutoff] -o [outputFile] [reference formatted database name] [input file]

    -- e.g. ./fastal -P 3 -K 10 -E 1e-6 -S 20 -o output_file COG_formatted query.fasta

EXAMPLE USAGE:

protein to protein (BLASTP like) (fasta-input) (fasta-database)

  • fastal [options] -o outputFile -P 24 amino-acid-fasta-database amino-acid-fasta

-P (number of threads)

nucleotide to nucleotide (BLASTN like) (fasta-input) (fasta-database)

  • lastal+ [options] -o outputFile -P 24 nucleotide-database-fasta nucleotide-fasta

-P (number of threads)

nucleotide to nucleotide (BLASTN like) (fastq-input) (fastq-database)

  • lastal+ [options] -o outputFile -P 24 -F -Q 3 nucleotide-database-fastq nucleotide-fastq

-P (number of threads)

-Q (input format) 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina, 4=prb, 5=PSSM (0, FASTA by default)

FAST Options

FAST Functionality:

  • -V: Version information
  • -S: Optional bit-Score cutoff value (20)
  • -E: Optional e-value cutoff value (1e-06)
  • -P: Optional number of threads (1)
  • -K: Optional number of top hits wanted (10)
  • -o: output file
  • -X: Temporary directory path for sorting files

Inherited LAST Functionality: Score options (default settings):

  • -r: match score (DNA: 1, 0<Q<5: 6)
  • -q: mismatch cost (DNA: 1, 0<Q<5: 18)
  • -p: match/mismatch score matrix (protein-protein: BL62, DNA-protein: BL80)
  • -a: gap existence cost (DNA: 7, protein: 11, 0<Q<5: 21)
  • -b: gap extension cost (DNA: 1, protein: 2, 0<Q<5: 9)
  • -A: insertion existence cost (a)
  • -B: insertion extension cost (b)
  • -c: unaligned residue pair cost (off)
  • -x: maximum score drop for gapped alignments (max[y, e-1])
  • -y: maximum score drop for gapless alignments (t*10)
  • -z: maximum score drop for final gapped alignments (x)
  • -d: minimum score for gapless alignments (min[e, tln(1000refSize/n)])
  • -e: minimum score for gapped alignments (DNA: 40, protein: 100, 0<Q<5: 180)

Cosmetic options (default settings):

  • -h: show all options and their default settings
  • -v: be verbose: write messages about what lastal is doing

Miscellaneous options (default settings):

  • -s: strand: 0=reverse, 1=forward, 2=both (2 for DNA, 1 for protein)
  • -T: type of alignment: 0=local, 1=overlap (0)
  • -m: maximum initial matches per query position (10)
  • -l: length threshold for initial matches (1 if -j0, else infinity)
  • -n: maximum gapless alignments per query position (infinity if m=0, else m)
  • -k: step-size along the query sequence (1)
  • -i: query batch size (8 KiB, unless there are multiple lastdb volumes)
  • -u: mask lowercase during extensions: 0=never, 1=gapless, 2=gapless+gapped but not final, 3=always (2 if lastdb -c and Q<5, else 0)
  • -w: supress repeats inside exact matches, offset by this distance or less (1000)
  • -G: genetic code file
  • -t: 'temperature' for calculating probabilities (1/lambda)
  • -g: 'gamma' parameter for gamma-centroid and LAMA (1)
  • -j: output type: 0=match counts, 1=gapless, 2=redundant gapped, 3=gapped, 4=column ambiguity estimates, 5=gamma-centroid, 6=LAMA (3)
  • -Q: input format: 0=fasta, 1=fastq-sanger, 2=fastq-solexa, 3=fastq-illumina, 4=prb, 5=PSSM (0)
NOTE:

In order to use FAST, a FAST database must first be formatted using the lastdb binary and a reference database file. The "fastdb" binary is created inside the FAST root directory when make is run from within the LAST+ root directory.

Output

FAST will produce it's output in the tabular format popularized by BLAST. The number of top hits can be minimized via the -K flag which retains on the top K hits and will be sorted.

FAST/LAST+ Result Differences

FAST normalizes the score and evalue using the same normalization scheme as BLAST and most modern aligners. This may lead to a difference in results from FAST and LAST.

More information on the normalization scheme can be found here: http://www.ncbi.nlm.nih.gov/BLAST/tutorial/

FAST uses disk based algorithms to sort the output when memory is low. This allows graceful scaling of memory usage to allow even low resource machines use FAST for alignment.