Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subsetting databases #39

Open
patrickbryant1 opened this issue Aug 9, 2023 · 7 comments
Open

Subsetting databases #39

patrickbryant1 opened this issue Aug 9, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@patrickbryant1
Copy link

Hi,

Thank you for the great resource!

I am having trouble subsetting databases and decompressing subsets of the databases you provide here: https://foldcomp.steineggerlab.workers.dev

According to the instructions, I should be able to decompress a subset of a database given an "id_list.txt".

This is how I do it for e.g. A. thaliana:

head -n 1 data/a_thaliana.lookup
0 AF-A0A178UFC4-F1-model_v4.pdb 0

As I understand it, the ID here is "AF-A0A178UFC4-F1-model_v4".

Now, I write this into a file called id_list.txt, then I run the command:
foldcomp decompress --id-list id_list.txt data/a_thaliana

with the response:
Decompressing files in data/a_thaliana using 1 threads
Output directory: data/a_thaliana_pdb/
[Warning] AF-A0A178UFC4-F1-model_v4 not found in database.

I have tried many different ways of naming the ids based on what is in a_thaliana.lookup, but nothing seems to work. The same using mmseqs to subset the database:
"""
createsubdb --subdb-mode 0 --id-mode 1 id_list.txt a_thaliana test_sel/output_foldcomp_db

MMseqs Version: ad6dfc66d7bbc4fd626fc19adf10ba587bc137c4
Subdb mode 0
Database ID mode 1
Verbosity 3

Could not find name AF-A0A178UFC4-F1-model_v4 in lookup
Time for merging to output_foldcomp_db: 0h 0m 0s 1ms
Time for processing: 0h 0m 0s 34ms
"""

Can you please explain what I am doing wrong and how to properly specify the IDs?

Best,

Patrick

@patrickbryant1
Copy link
Author

I noticed, this seems to work with afdb_rep_v4. Perhaps something is missing from the reference genomes?

khb7840 added a commit that referenced this issue Aug 9, 2023
github-actions bot pushed a commit that referenced this issue Aug 9, 2023
@khb7840 khb7840 added the bug Something isn't working label Aug 9, 2023
@khb7840
Copy link
Member

khb7840 commented Aug 9, 2023

I'm sorry there was a bug at assigning mode for database reading. Thank you for notifying this and please check if this is solved in the latest version.

@patrickbryant1
Copy link
Author

Hi,
Great - thanks.
What do you mean with the latest version:

  1. Of the database from https://foldcomp.steineggerlab.workers.dev
  2. Of Foldcomp
  3. Something else(?)

@khb7840
Copy link
Member

khb7840 commented Aug 9, 2023

Latest version of Foldcomp. Subsetting 'a_thaliana' should work with foldcomp of latest commit.

@patrickbryant1
Copy link
Author

patrickbryant1 commented Aug 9, 2023

Ok, great. Does this include the binaries you distribute or only the pip installation/git clone?
Do you know why mmseqs2 seems to fail on the same files? Is there something missing in the subsetting instructions there as well?

@khb7840
Copy link
Member

khb7840 commented Aug 9, 2023

Please use git clone to get the latest updare. Python distribution is not updated with the latest commit. For the mmseqs2 part, I'm not sure what happened. I'll check this with mmseqs2 developers.

@patrickbryant1
Copy link
Author

Ok, thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants