You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Achilles heel of gene-centric methods is domains shared with other, functionally-unrelated protein families. As these methods annotate in a vaccuum, they fall susceptible to classifying non-homologous sequences which would otherwise be correctly classified to a different sequence or protein family by methods that use large databases (e.g. EggNOG-mapper or methods that use Refseq). When these domains are present in a protein family, the risk of classifying false positives increases greatly.
After building a reference package (RefPkg), users should be provided with a simple method for annotating these domains in their RefPkg, using a database that is sufficiently large. Based off of these annotated domains, treesapp assign may be able to filter spurious homologous queries and users would be better informed of their RefPkg's protein structure.
I propose that through the layer subcommand, users can use a '--domains' flag that will automatically use the PFam database to perform HMM-to-HMM alignment, identify protein domains found in the RefPkg's profile HMM used for searching, and annotate those loci as such. The annotated domains can be preserved in a new 'domains' attribute of a reference package, and possibly propagated to future updates of the RefPkg.
TODO list:
Add 'domains' attribute to ReferencePackage class
Add '--domains' flag to treesapp layer arguments
Add HHsuite to requirements (conda, Docker and relevant Wiki pages)
Automate downloading of the latest HHsuite PFam databaseto the installation's data/ directory when necessary
Develop workflow for searching for and annotating domains in a RefPkg's profile HMM
Allow users and/or treesapp assign to filter queries mapped to domains
The text was updated successfully, but these errors were encountered:
The Achilles heel of gene-centric methods is domains shared with other, functionally-unrelated protein families. As these methods annotate in a vaccuum, they fall susceptible to classifying non-homologous sequences which would otherwise be correctly classified to a different sequence or protein family by methods that use large databases (e.g. EggNOG-mapper or methods that use Refseq). When these domains are present in a protein family, the risk of classifying false positives increases greatly.
After building a reference package (RefPkg), users should be provided with a simple method for annotating these domains in their RefPkg, using a database that is sufficiently large. Based off of these annotated domains,
treesapp assign
may be able to filter spurious homologous queries and users would be better informed of their RefPkg's protein structure.I propose that through the
layer
subcommand, users can use a '--domains' flag that will automatically use the PFam database to perform HMM-to-HMM alignment, identify protein domains found in the RefPkg's profile HMM used for searching, and annotate those loci as such. The annotated domains can be preserved in a new 'domains' attribute of a reference package, and possibly propagated to future updates of the RefPkg.TODO list:
treesapp layer
argumentsdata/
directory when necessarytreesapp assign
to filter queries mapped to domainsThe text was updated successfully, but these errors were encountered: