You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that when creating reference packages that have guaranteed sequences from TIGRFAM, the header gets truncated and as a result, when querying the NCBI, the sequences gets misclassified as "r__Root".
For example for NapA, here's the base treesapp create command
If you look at both the accession table and any trees that are generated for this package
These both get truncated to:
SP| r__Root
When they should be:
Q56350 r__Root; d__Bacteria; p__Pseudomonadota; c__Alphaproteobacteria; o__Rhodobacterales; f__Paracoccaceae; g__Paracoccus; s__Paracoccus pantotrophus
When removing the prefix of the headers, this fixes the issue, however, I'm wondering if the header truncation needs to be addressed.
However, this doesn't seem to be an issue when running treesapp with these prefixes in the base fasta input file (Example for RadA), or when treesapp update is used, after which the final clustered sequences go to treesapp create.
TreeSAPP Version [e.g. 0.11.4]
The text was updated successfully, but these errors were encountered:
janstett
added
the
bug
Unexpected error raised? Weird results? Use this label.
label
Oct 1, 2024
I noticed that when creating reference packages that have guaranteed sequences from TIGRFAM, the header gets truncated and as a result, when querying the NCBI, the sequences gets misclassified as "r__Root".
For example for NapA, here's the base treesapp create command
treesapp create -c NapA -p 0.85 --min_taxonomic_rank c -n 16 -i RefPkgs/Nitrogen_metabolism/Denitrification/NapA/ENOG501NS3T.faa --guarantee RefPkgs/Nitrogen_metabolism/Denitrification/NapA/TIGR01706.faa --cluster --trim_align --outdet_align --headless --fast --overwrite -o TS_Make_Lin_Table_For_Eval/Base/NapA/ --profile RefPkgs/Nitrogen_metabolism/Denitrification/NapA/TIGR01706.HMM --deduplicate --min_seq_length 600
For the TIGRFAM file, here are the sequence headers:
If you look at both the accession table and any trees that are generated for this package
These both get truncated to:
SP| r__Root
When they should be:
Q56350 r__Root; d__Bacteria; p__Pseudomonadota; c__Alphaproteobacteria; o__Rhodobacterales; f__Paracoccaceae; g__Paracoccus; s__Paracoccus pantotrophus
P39185 r__Root; d__Bacteria; p__Pseudomonadota; c__Betaproteobacteria; o__Burkholderiales; f__Burkholderiaceae; g__Cupriavidus; s__Cupriavidus necator
When removing the prefix of the headers, this fixes the issue, however, I'm wondering if the header truncation needs to be addressed.
However, this doesn't seem to be an issue when running treesapp with these prefixes in the base fasta input file (Example for RadA), or when treesapp update is used, after which the final clustered sequences go to treesapp create.
The text was updated successfully, but these errors were encountered: