Skip to content

5. Commands

Mitch Syberg-Olsen edited this page Jul 13, 2022 · 5 revisions

Annotate

Annotate is Pseudofinder's core command. Calling this command will identify pseudogene candidates in the input genome annotation, and will produce various output files, explained in detail below.

As with any other python script, there are two ways how to run it:

# Call it directly with python3 (or just python if python3 is your default).
python3 pseudofinder.py

# Or make the file executable and then rely on its shebang line [#!/usr/bin/env python3].
chmod u+x ./pseudofinder.py

Providing input files:

# Run the full pipeline on 16 processors (for BlastX/BlastP searches).
# Unless you have a $BLASTDB environmental variable set on your system, you have to provide a full path to the NR database.
python3 pseudofinder.py annotate --genome GENOME.GBF --outprefix PREFIX --database /PATH/TO/NR/nr --threads 16 

Output of Annotate:

Every run will produce the following files:

File Description
[prefix]_interactive.html Interactive plots which summarize the genome-wide analysis.
[prefix]_intact.gff Intact genes in GFF3 format.
[prefix]_intact.faa Intact genes in fasta format.
[prefix]_intergenic.fasta Intergenic regions in fasta format.
[prefix]_blastX_output.tsv Tab-delimited output of BLASTX run on intergenic regions.
[prefix]_log.txt Summary of all inputs, outputs, parameters and results.
[prefix]_map.pdf Concatenated chromosome map. Input genes appear on the inner track in blue, and candidate pseudogenes are shown in red on the outer track.
[prefix]_proteome.faa All protein sequences in fasta format.
[prefix]_blastP_output.tsv Tab-delimited output of BLASTP run on proteome.
[prefix]_pseudos.gff Candidate pseudogenes in GFF3 format.
[prefix]_pseudos.fasta Candidate pseudogenes in fasta format.

If you include a reference genome, the run will also produce:

File Description
[prefix]_interactive_dnds.html Interactive genome-wide dN/dS plot.
[prefix]_dnds Directory containing output from the dnds module: BLAST results, dN/dS summary file, and a folder containing the nucleotide, amino acids, and codon alignments that were used to calculate dN and dS values.

The interactive plot is a good place to start engaging with your data. Here you will find a summary of all data collected for each feature on the genome and if you hover over an individual feature, the popup will give you a more detailed look. Red bars indicate features which have been flagged as pseudogenes, and the popup will tell you specifically what kind of pseudogene. alt text

Sleuth

The sleuth command will compare a genome against another closely-related genome. After homologous genes are identified, this module runs PAML on aligned genes to generate codon alignments and calculate per-gene dN/dS values. These dN/dS values can be used to infer neutral selection and potential cryptic pseudogenes. This module can be invoked within the Annotate command by providing a closely-related reference genome using the -ref flag.

Usage:

# Call within annotate
python3 pseudofinder.py annotate --genome GENOME.GBF --reference REFERENCE.GBF --outprefix PREFIX --database /PATH/TO/NR/nr --threads 16 

# Stand alone dN/dS calcuation
pseudofinder.py sleuth -a GENOME_PROTS -n GENOME_GENES -ra REFERENCE-PROTS -rn REFERENCE_GENES

Whenever the sleuth module is invoked through the annotate command (use annotate with the --reference flag), an interactive dN/dS plot will automatically be generated. This plot is helpful to explore your data and refine your chosen parameters for determining pseudogenes. The plot will include a linear regression, a line indicating the chosen dN/dS cutoff (--max_dnds), and the calcuated genome-wide mean dN/dS value.

alt text

Reannotate

Reannotate will run the annotate workflow, beginning after the computationally intensive BLAST and codon alignment steps. This command can very quickly reannotate pseudogenes if you would like to change any downstream parameters. The log file from the previous run will be parsed for previous parameters and files, so please keep the files in the locations described in the log file.

Usage:

pseudofinder.py reannotate -g GENOME -log LOGFILE -op OUTPREFIX

Visualize

One strength of Pseudofinder is its ability to be fine-tuned to the user's preferences. To help visualize the effects of changing the parameters of this program, we have provided the visualize command. This command will display how many pseudogenes will be detected based on any combination of --length_pseudo and --shared_hits. Similar to the reannotate module, the log file will be parsed for information about relevant files and parameters.

Usage:

pseudofinder.py visualize -g GENOME -log LOGFILE -op OUTPREFIX 

alt text

Test

With a single command, the entire Pseudofinder workflow can be run on the 139 kbp genome of Candidatus Tremblaya princeps strain PCIT (or optionally, you may provide your own genome).

Simply enter the following command:

python3 pseudofinder.py test --database /PATH/TO/NR/nr

The workflow will begin immediately and write the results to a timestamped folder found in /pseudo-finder/test/.

Break

Clone this wiki locally