forked from apetkau/ffp-3.19-custom
-
Notifications
You must be signed in to change notification settings - Fork 0
Modifications to http://sourceforge.net/projects/ffp-phylogeny/ 3.19.
License
ddooley/ffp-3.19-custom
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
FFP 3.18 - Feature Frequency Profile Phylogenetics Package Feb 20, 2012 Author: Gregory E. Sims This is a collection of programs / utilities for implementing the FFP (Feature Frequency Profile) method of phylogenetic comparison. FFP is a class of alignment-free methods suitable for (whole genome) comparisons from viral to mammalian scale genomes. This method has been used to perform various phylogenetic analyses: Sims GE and Kim SH (2011) Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs). PNAS, 108, 8329-34. Jun SR, Sims GE, Wu GA, Kim SH. (2010) Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. PNAS, 107,133-8. Sims GE, Jun SR, Wu GA, Kim SH. (2009) Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS, 106,2677-82. Sims GE, Jun SR, Wu GA, Kim SH (2009) Whole-genome phylogeny of mammals: evolutionary information in genic and nongenic regions. PNAS. 106,17077-82. The utilities are designed to be implemented using unix command pipes. In other words the output of programs can be linked to the input of other programs. Therefore many of the scripts are acceptable as filters to be used in intermediate steps. This package contains the following programs/scripts: ffpgui [Experimental] A perl/Tk based GUI interface for performing some of the basic FFP operations. This utility doesn't support grid-based/ multiprocessor job flow. Also ffpgui is in the beta stage, but it should give an example of what can be done with the utilities listed below. Note, currently ffpgui will only work properly in Cygwin if you download and compile v804.029 perl/Tk module from CPAN. Automatic installation of perl/Tk by Cygwin setup.exe will not provide proper functionality (It uses a much older version version of perl/Tk). See INSTALL for further instructions. Use --disable-gui to bypass installation. ffpry Constructs an FFP profile from nucleic acid sequences in FASTA format (.fna). ffpaa Constructs an FFP profile from amino acid sequences in FASTA format (.faa). ffprwn This performs row normalization of the raw FFP matrix. ffpjsd This calculates the Jensen Shannon Divergence between FFPs and outputs a Divergence (Distance) matrix. A variety of other distances/similarity metrics are available as well. ffpboot This performs bootstrapping or jacknifing permutation of a raw FFP profile produced by ffpry or ffpaa ffpvocab This utility counts the number of words which are used more than a paritcular threshold in the FFP profile. This utility is used to determine what is the best range of word lengths to use for a genome collection ffpre This utility calculates the Relative entropy between the expected and observed frequencies of features of length l (specified on the command line) using an L-2 Markov Model. ffpvprof Script which calculates the word usage for a range of l. Runs ffpvocab ffpreprof Script which calculates the Relative entropy between observed and predicted frequencies for a range of l. Runs ffpre. ffpmerge This utility merges all rows of an FFP into a single row. Use this for merging segments of an FFP, for example different chromosomes of a larger genome. ffpcol This utility converts a FFP which has been written out in key/value format to a columnar format, so that each column corresponds to the same feature in each row of the FFP. ffptxt This utility creates a key/value FFP of text data. This is useful for performing an FFP analysis of human language texts. All non- alphanumeric characters are ignored. ffpfilt Eliminate high/low frequency features using frequency cutoffs or probability based cutoffs assuming a normal or extreme value distributions. ffpcomplex Eliminate high/low complexity features using a complexity cutoff or probability based cutoff assuming a normal distribution. ffpdf Finds clade distinguishing (diagnostic) features. See Sims GE and Kim SH (2011) PNAS 108. ffptree Build neighbor joining and UPGMA trees from ffpjsd output. OTHER REQUISITE PROGRAMS We suggest that you obtain a copy of PHYLIP (http://evolution.genetics.washington.edu/phylip.html) for building trees, however you can use any tree building program which will accept distance matrix input. The utility ffpjsd will produce Phylip style 'infile's as well as raw distance matrices. As of version 3.06, a tree building utility, ffptree is included, which will allow you build Newick style tree output directly as part of a ffp pipeline, which is compatible with the Phylip (3.69) utilities. QUICK START The best way to get a quick start is to read through the simple tutorial PDF file located in the ./doc dir of this distribution. It is also installed during the make install process in the /usr/local/share/doc/ffp directory (unless you have specified a different base directory using ./configure --prefix during the build process). EXAMPLES ***** How do I perform FFP comparison on a collection of nucleic acid sequences, using a particular length of feature? Assuming your nucleic acid .fna files are all in the current working directory and are named with the .fna extension: ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd > matrix for just two files (test1.fna and test2.fna): ffpry -l 5 test1.fna test2.fna | ffpcol | ffprwn | ffpjsd > matrix The above example uses all features of length 5. The output of ffpry will be in key-value form, i.e. pairs of feature sequence followed by the raw count. Each row corresponds to the features from that sequence, in the order of input.The output of ffpry is piped to ffpcol, which converts the key value form into a column form, so that the raw counts corresponds to the same feature across rows. The utility ffprwn row normalizes each row of the ffp feature matrix (output by ffpry), so that each element of that row is a relative frequency. The output is now piped to ffpjsd which calculates a Jensen Shannon Divergence Matrix. Alternatively you can save the output at each step in intermediate files in the following form: ffpry -l 5 *.fna > vectors ffpcol vectors > vectors.col ffprwn vectors.col > vectors.row ffpjsd vectors.row > matrix This may be useful if you want to perform multiple analyses on some intermediate file (For example bootstrapping -- see below). **** How do I script and run commands? All of these commands can be completed programmatically in a shell script file for example: In a file named, for instance ffptest.sh #!/bin/sh ffpry -l 5 *.fna > vectors ffpcol vectors > vectors.col ffprwn vectors.col > vectors.row ffpjsd vectors.row > matrix Save the script and make it executable using: chmod +x ffptest.sh then run the script from the command line ./ffptest.sh **** How do I perform bootstraping? You can use the utility ffpboot to perform bootstrapping on the output of ffpry or ffpaa. ffpry -l 5 *.fna | ffpcol > vectors ffpboot vectors | ffprwn | ffpjsd > matrix **** How do I create multiple bootsrap sets? The example below creates 100 bootstrap pseudoreplicate JSD matrices. ffpry -l 5 *.fna | ffpcol > vectors for i in $(seq 1 1 20) do ffpboot vectors | ffprwn | ffpjsd > matrix.$i done **** How do I create phylip format infiles? To create phylip format infiles to use with programs such as NEIGHBOR, Use the command ffpjsd -p [FILE] which will generate a phylip format infile. FILE specifies the names of the taxa in the fna files you are using This file should whitespace or newline delimit the different taxa names. i.e. Taxa_1 Taxa_2 ... Taxa_N Note the taxa should be in the order that they are read into ffpry (use ls *.fna to get that ordering). For example: ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt > infile Or without the wildcards (*.fna) ffpry -l 5 1.fna 2.fna 3.fna | ffpcol | ffprwn | ffpjsd -p names.txt > infile The file names.txt should contain the taxa names of 1.fna, 2.fna and 3.fna in that order. **** How do I build a neighbor joining tree? **** In the sprit of the pipeline concept you can pipe output directly from ffpjsd into ffptree, provide it is in phylip infile format. Continuing the example from above: ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt | ffptree -q > tree This produces a tree in Newick format. If you want to see the human readable tree to, remove the -q switch. Note, lots of output will be produces, but written to standard error so you will need should redirect standard error to a file to save for later. ffpry -l 5 *.fna | ffpcol | ffprwn | ffpjsd -p names.txt | ffptree 2> progress > tree **** How do I do bootstrapping for use with Phylip? ffpry -l 5 *.fna | ffpcol > vectors for i in $(seq 1 1 20) do ffpboot vectors | ffprwn | ffpjsd -p species.txt >> infile done This will create a multiple dataset file for use with phylip. Use the 'multiple datasets' option in the neighbor program. **** How do I perform FFP on large genomes? The most effective way to do this to calculate FFP's of segments or units of the genome, for instance by chromosome or by contig, the ffp's of individual units can be merged together using the ffpmerge utility. Say you have 10 chromosomes. Calculate the FFP of each as a separate process and merge at the end of calculations. This is especially effective for multiprocess machines. For example: for fna_file in $(ls *.fna) do ffpry -l 10 $fna_file > $fna_file.vector & done ffpmerge *.vector > merged.vector If your cluster machine uses a qeueing system (i.e. Grid engine) then you can create individual shell scripts to give to the scheduler and then merge unit vector files after all scheduled jobs have completed. A simple example using grid engine employs the $SGE_TASK_ID variable. Save a file containing the paths to your sequences There are 10 files total $ cat > sequences.txt seq.fna seq2.fna seq3.fna .... Ctrl-D $ cat > submit.sh #!/bin/bash #submit.sh FILE=`head -n $SGE_TASK_ID < $1 | tail -n 1` ffpry -l 10 $FILE > $FILE.vector & In your shell: chmod +x submit.sh qsub -a 1-10 submit.sh sequences.txt Ctrl-d **** What if the ffp I get from ffprwn is very large and ffpjsd take a long time? In this case you may want to break up the calculation of the JSD matrix, by assigning specific rows to different CPUs using the -r option. Starting with a normalized ffp with 10 rows: ffpjsd -r 1 vector.row > row.1 The other 9 rows can be calculated by other CPUs, once again using a cluster machine and the SGE_TASK_ID variable. The results can be merged again using shell scripting: for i in $(seq 1 1 10) do cat row.$i >> matrix done **** What is the difference between key/value FFPs and columnar FFPs? A key/value FFP is an FFP form which is generated by default from the programs FFPry and FFPaa. For instance the following command will generate a FFP of this form: ffpry -l 5 test*.fna The format will resemble: RYRRR 2 RRRRY 3 RRRRR 0 .... YRRRR 1 YYYRR 1 YRRRY 2 ... ... Each row of the file is a FFP derived from a different sequence file. Columnar formats are required for input to ffprwn and ffpboot. In this format no feature keys are printed and the columns in the file correspond to the counts of that feature in each of the sequence files. For very sparse FFPs the key/ value FFP can generate smaller files. For example this command will generate a key-value FFP: ffpry -l 5 test*.fna To convert a key/value FFP into a columnar format for input to the other utilities the ffpcol utility should be used as a filter. ffpry -l 5 test*.fna | ffpcol The output from ffpcol can be used in ffprwn and ffpjsd ffpry -l 5 test*.fna | ffpcol | ffprwn | ffpjsd **** How do I use a full 4 letter Nucleotide or 20 letter amino acid alphabet? By default character classing is used in both ffpry and for amino acids in ffpaa. To disable this classing specify option -d for ffpry, ffpaa and ffpcol. Please also take note that when you disable RY coding with the -d option you may need to add the -d option to subsequent filters such as ffpcol and ffpmerge. For example: ffpry -l 5 -d test*.fna | ffpcol -d | ffprwn | ffpjsd For amino acids: ffpaa -l 5 -d test*.faa | ffpcol -d -a | ffprwn | ffpjsd **** How do I use a spaced seed hash with FFP? FFP refers to spaced seeds as masks - from the manner in which masks are used in computer programming to 'mask out' certain bit positions in low level bit manipulations of numbers stored in binary format. A spaced seed or mask of '01110' will allow both CAAAG and TAAAA to match each other, as well as to match any 5 letter word with AAA in the middle. A mask can be specified using -w, for example: ffpaa -l 5 -d -w "01110" test*.faa | ffpcol -d -a | ffprwn | ffpjsd If you don't want explicitly supply a mask string but want to allow a certain number of mismatches, use -z. Here for example is how to create a random mask with two mismatches allowed. ffpaa -l 5 -d -z 2 test*.faa | ffpcol -d -a | ffprwn | ffpjsd **** How do I compare text files with FFP? FFP has the ability to compare text files with the utility ffptxt. The procedure is much the same as with nucleic acid and amino acid sequences. Specify text with the -t option when using ffpcol. ffptxt -l 4 file*.txt | ffpcol -t | ffprwn | ffpjsd **** How do I get help? All the utilities come with their own manual page which is installed by default when using 'make install'. man ffpry will retrieve the manual for the ffpry utility. If you have installed ffp in some alternate location (i.e. by using ./configure --prefix), then the locations of the ffp manuals may not be in your MANPATH environmental variable. You can add the manual directory to MANPATH, or simply read the manuals directly using man /pathtomanual/ffpry.1 (Note the manual section extension '1'). **** Example trees Some published examples are shown in the distribution 'examples' directory. **** How do I run FFP in Windows? Use Cygwin (www.cygwin.com). This package has been developed and tested to perform in both Linux and Cygwin. Cygwin is designed as an emulated Linux/Unix environment which runs on the Windows operating system -- it includes the majority of the GNU compilers and utilities that are part of a standard Linux distribution and performs superbly (albeit with a small performace loss because of emulation). **** Multiple fastas in a single file By default the ffpry and ffpaa programs assume that a single file, regardless of the number of fasta records contained in that file, represents a single species/genome/proteome. Therefore the l-mer frequencies represent the counts for all fasta records. If you specify multiple files on the command line i.e. ffpry *.fna Which might expand (the expanding of which is done by your shell of course) to: ffpry test1.fna test2.fna test3.fna then 3 separate ffp lines will be printed in the output. If in fact you want all of these results to be merged together into one FFP you can use the ffpmerge utility, or simply use the cat command, both of which should produce equivalent results. cat *.fna | ffpry or ffpry *.fna | ffpmerge --keys If however you have a single (or multiple fasta files) with multiple records which you want to be individual FFPs then you must specify the -m option. ffpry -m *.fna **** How do I implement the alternate Hamming based distance refered to as the Evolutionary FFP distance in Sims and Kim (2011), PNAS, 108 8329-34? Trees presented in this paper are included in the examples subdirectory. To implement this type of analysis on your own use the following snippet of shell script code (assuming you are using the bash shell). From your working directory containing all your genome fasta files (assuming small single fasta genomes). ffpry -l 20 *.fasta | ffpcol | ffpfilt -l 0.05 -u 0.95 -e > ffp.filtered for i in {1..100} ; do ffpboot -j -p 0.1 ffp.filtered | ffpjsd -H -p species.txt done > infile The features are filtered to remove high and low frequency features. Then 100 pseudoreplicates are created using 10% jackknife sampling. The output is a PHYLIP style infile which can be used directly as input to the PHYLIP tool NEIGHBOR. The option argument to -p is a tab or newline delimited file containing the names of the taxa in the original order which was specified on the command line to the original ffpry invocation. To confirm you have taxa named in the right order in your 'species.txt' taxa name file, execute this shell expansion. echo *.fasta This will show you the order (which will be identical to the ordering observed using ls *.fasta), in which you need to specify the names in species.txt. Rather than relying on the order from shell expansion you can specify the genome file arguments explicitly ffpry -l 20 genome1.fasta genome2.fasta genome3.fasta ... See the examples subdirectory for more information, including sample trees generated using FFP and Phylip. **** How do I implement the block-FFP method mentioned in Sims GE, Jun SR, Wu GA, Kim SH. (2009) Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS, 106,2677-82. There is currently no script included in this distribution which will implement this method of genome comparison. Future releases will contain executables which implement Block-FFP. The main point to keep in mind is that FFP works best when you are comparing genomes/sequences of similar length -- and a good guideline is to make sure that your genomes are within A-fold the size of each other where A is the number of symbols in your alphabet. ******** Copyright (C) 2009-2012 Author: Gregory E. Sims Report Bugs to [email protected]
About
Modifications to http://sourceforge.net/projects/ffp-phylogeny/ 3.19.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- C 79.3%
- Shell 20.4%
- C++ 0.3%