Skip to content

3. Input Files

Mitch Syberg-Olsen edited this page Feb 24, 2022 · 2 revisions

Pseudofinder requires the user to provide the genome in genbank format as well as a non-redundant protein database formatted for BlastP/BlastX searches. If possible, providing a reference genome allows Pseudofinder to include dN/dS calculations to identify pseudogenes.

Genome Assembly Recommendations

We recommend several rounds of Pilon polishing with Illumina reads to improve your consensus sequence [https://github.com/broadinstitute/pilon/wiki], particularly if you're interested to find pseudogenes in minION/PacBio assemblies. However, Pseudofinder can also help with finding sequencing/basecalling errors potentially breaking genes in MinION/PacBio-only assemblies.

Genomes closed into one or several circular-mapping molecules (chromosomes, plasmids, and phages) should be ideally oriented based on their origin of replication [e.g. by Ori-Finder 2; http://tubic.tju.edu.cn/Ori-Finder2/] to avoid broken genes on contigs randomly linearized by the genome assembler.

Annotation Recommendations

We recommend genbank (.gbf/.gbk) files generated by Prokka [https://github.com/tseemann/prokka] with the --compliant and --rfam flags. Annotating rRNAs, tRNAs, and other ncRNAs in Prokka is recommended to eliminate any false positive 'pseudogene' candidates. ORFs overlapping with non-coding RNAs such as rRNA can be sometimes misannotated in databases as 'hypothetical proteins'. The better your gene predictions are, the more reliable Pseudofinder results will be. If there's a full-lenghth protein-coding gene in your genome that was completely missed by the gene prediction algorithm (e.g. PRODIGAL), Pseudofinder will currently flag this 'intergenic region' as a potential pseudogene. Annotating signal peptides (--gram neg/pos option in Prokka) is also recommended, but please be aware that the signal peptide presence can vary between species (depending on if the protein is exported) and signal peptides are often missing from protein sequences in databases. Using very strict gene length cutt-offs in Pseudofinder (--length_pseudo >0.90) should be therefore avoided since it can lead to biased pseudogene calls in short proteins due to the signal peptide presence/absence (<35 AA difference).

prokka --compliant --rfam contigs.fa

Database Recommendations

Database selection is critical to the speed and sensitivity of Pseudofinder. Users can provide any database they would like provided it is a non-redundant protein database formatted for BlastP/BlastX searches, but must keep in mind that larger databases will increase runtime while smaller databases could suffer in sensitivity if they lack relevant protein sequences. For those who don't have manually curated databases tailored to their specific microbe, we recommend NCBI-NR (non-redundant) protein database (or similar such as SwissProt).

Also to be considered is that while the pipeline runs using vanilla BlastP/BlastX, we have integrated Diamond which can be invoked using the --diamond flag and will significantly reduce runtime.