-
Notifications
You must be signed in to change notification settings - Fork 16
5. Commands
Annotate is Pseudofinder's core command. Calling this command will identify pseudogene candidates in the input genome annotation, and will produce various output files, explained in detail below.
As with any other python script, there are two ways how to run it:
# Call it directly with python3 (or just python if python3 is your default).
python3 pseudofinder.py
# Or make the file executable and then rely on its shebang line [#!/usr/bin/env python3].
chmod u+x ./pseudofinder.py
Providing input files:
# Run the full pipeline on 16 processors (for BlastX/BlastP searches).
# Unless you have a $BLASTDB environmental variable set on your system, you have to provide a full path to the NR database.
python3 pseudofinder.py annotate --genome GENOME.GBF --outprefix PREFIX --database /PATH/TO/NR/nr --threads 16
Output of Annotate:
Every run will produce the following files:
File | Description |
---|---|
[prefix]_interactive.html | Interactive plots which summarize the genome-wide analysis. |
[prefix]_intact.gff | Intact genes in GFF3 format. |
[prefix]_intact.faa | Intact genes in fasta format. |
[prefix]_intergenic.fasta | Intergenic regions in fasta format. |
[prefix]_blastX_output.tsv | Tab-delimited output of BLASTX run on intergenic regions. |
[prefix]_log.txt | Summary of all inputs, outputs, parameters and results. |
[prefix]_map.pdf | Concatenated chromosome map. Input genes appear on the inner track in blue, and candidate pseudogenes are shown in red on the outer track. |
[prefix]_proteome.faa | All protein sequences in fasta format. |
[prefix]_blastP_output.tsv | Tab-delimited output of BLASTP run on proteome. |
[prefix]_pseudos.gff | Candidate pseudogenes in GFF3 format. |
[prefix]_pseudos.fasta | Candidate pseudogenes in fasta format. |
If you include a reference genome, the run will also produce:
File | Description |
---|---|
[prefix]_interactive_dnds.html | Interactive genome-wide dN/dS plot. |
[prefix]_dnds | Directory containing output from the dnds module: BLAST results, dN/dS summary file, and a folder containing the nucleotide, amino acids, and codon alignments that were used to calculate dN and dS values. |
The interactive plot is a good place to start engaging with your data. Here you will find a summary of all data collected for each feature on the genome and if you hover over an individual feature, the popup will give you a more detailed look. Red bars indicate features which have been flagged as pseudogenes, and the popup will tell you specifically what kind of pseudogene.
The sleuth command will compare a genome against another closely-related genome. After homologous genes are identified, this module runs PAML on aligned genes to generate codon alignments and calculate per-gene dN/dS values. These dN/dS values can be used to infer neutral selection and potential cryptic pseudogenes. This module can be invoked within the Annotate command by providing a closely-related reference genome using the -ref flag.
Usage:
# Call within annotate
python3 pseudofinder.py annotate --genome GENOME.GBF --reference REFERENCE.GBF --outprefix PREFIX --database /PATH/TO/NR/nr --threads 16
# Stand alone dN/dS calcuation
pseudofinder.py sleuth -a GENOME_PROTS -n GENOME_GENES -ra REFERENCE-PROTS -rn REFERENCE_GENES
Whenever the sleuth module is invoked through the annotate command (use annotate
with the --reference
flag), an interactive dN/dS plot will automatically be generated. This plot is helpful to explore your data and refine your chosen parameters for determining pseudogenes. The plot will include a linear regression, a line indicating the chosen dN/dS cutoff (--max_dnds
), and the calcuated genome-wide mean dN/dS value.
Reannotate will run the annotate workflow, beginning after the computationally intensive BLAST and codon alignment steps. This command can very quickly reannotate pseudogenes if you would like to change any downstream parameters. The log file from the previous run will be parsed for previous parameters and files, so please keep the files in the locations described in the log file.
Usage:
pseudofinder.py reannotate -g GENOME -log LOGFILE -op OUTPREFIX
One strength of Pseudofinder is its ability to be fine-tuned to the user's preferences.
To help visualize the effects of changing the parameters of this program, we have provided the visualize command.
This command will display how many pseudogenes will be detected based on any combination of --length_pseudo
and --shared_hits
.
Similar to the reannotate module, the log file will be parsed for information about relevant files and parameters.
Usage:
pseudofinder.py visualize -g GENOME -log LOGFILE -op OUTPREFIX
With a single command, the entire Pseudofinder workflow can be run on the 139 kbp genome of Candidatus Tremblaya princeps strain PCIT (or optionally, you may provide your own genome).
Simply enter the following command:
python3 pseudofinder.py test --database /PATH/TO/NR/nr
The workflow will begin immediately and write the results to a timestamped folder found in /pseudo-finder/test/
.