Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highquality_cluster30 - fragmented sequences split on undetermined aminoacid #53

Open
valentynbez opened this issue Mar 27, 2024 · 0 comments
Labels
help wanted Extra attention is needed

Comments

@valentynbez
Copy link

valentynbez commented Mar 27, 2024

Hello!
I've tried using highquality_clust30 as a reference and identified the following issue.
The database has around 200k repeated entries, they appear to be fragmented proteins split on X aminoacid.
(The additional information from headers was removed, only unique MG IDs are stored in my FASTAs for indexing with samtools-faidx)

Example 1

> grep "MGYP003384474486" highquality_clust30.lookup                                                                                                                        
32543322        MGYP003384474486        0
32543327        MGYP003384474486        0
32543390        MGYP003384474486        0
32543528        MGYP003384474486        0
32543587        MGYP003384474486        0
> zgrep -A 1 "MGYP003384474486" highquality_clust30.fasta.gz                                                                                                                         
>MGYP003384474486
MFSSKCNLCR
--
>MGYP003384474486
IDQER
--
>MGYP003384474486
KYNEVKIY
--
>MGYP003384474486
ETIIGIYDF
--
>MGYP003384474486
FLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK

When I query ESM API I get

{"sequence": "MFSSKCNLCRXIDQERXKYNEVKIYXETIIGIYDFXFLLLSFTYASGKEYEISNFVNLLSIQLGLTDTLYGIIK"}

Example 2

> grep "MGYP003343806611" highquality_clust30.lookup                                                                                                                        
31381065        MGYP003343806611        0
31381071        MGYP003343806611        0
>zgrep -A 1 "MGYP003343806611" highquality_clust30.fasta.gz
>MGYP003343806611
MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWA
--
>MGYP003343806611
YY

ESM API

{"sequence": "MLRIKITDADRAGRAGEWCQANLGRDDWNLYGHNLFTGTPYYEFEFTDSETAMMFALRWAXYYX"}
@valentynbez valentynbez changed the title highquality_cluster30 - fragmented sequences highquality_cluster30 - fragmented sequences split on X aminoacid Mar 27, 2024
@valentynbez valentynbez changed the title highquality_cluster30 - fragmented sequences split on X aminoacid highquality_cluster30 - fragmented sequences split on undetermined aminoacid Mar 27, 2024
valentynbez added a commit to bioinf-mcb/Metagenomic-DeepFRI that referenced this issue Mar 29, 2024
- a hack using `sed` to correct headers in the database
- 5 minute alignment of 5.7k proteins agains database - tolerable
- greatly increased coverage of structures
- silenced warnings in faidx - database contain fragmented sequnces steineggerlab/foldcomp#53
- fixes #80
@khb7840 khb7840 added the help wanted Extra attention is needed label Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants