Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about cdhit.sh generate .clstr file and clustalo.sh #11

Open
xfy9 opened this issue Mar 1, 2022 · 4 comments
Open

Question about cdhit.sh generate .clstr file and clustalo.sh #11

xfy9 opened this issue Mar 1, 2022 · 4 comments

Comments

@xfy9
Copy link

xfy9 commented Mar 1, 2022

Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them.
And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them.
I'd appreciate it if you helped me.

@lxie21
Copy link
Collaborator

lxie21 commented Mar 1, 2022 via email

@thomasly
Copy link
Collaborator

thomasly commented Mar 1, 2022

Hi,
You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json


class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0	23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0	1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster
            
            
hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)
    
short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k
    
written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)

@xfy9
Copy link
Author

xfy9 commented Mar 2, 2022

Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json


class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0	23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0	1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster
            
            
hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)
    
short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k
    
written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)

Hi,thank you for your patience to answer me. But I have a problem with the script you provided. How do I generate filtered_fasta.json, I have found it in thesis-works and MicrobiomeMeta, but I am not find the method that generate filtered_fasta.json. I want to konw the content of filtered_fasta.json, and how do I generate filtered_fasta.json.
I would be very grateful if you answer my questions.

@thomasly
Copy link
Collaborator

thomasly commented Mar 2, 2022

It is a json file with protein IDs as keys and fasta sequences as values. "Filtered" simply means we did a filtration for the sequences to keep only the sequences we want for our experiments. You should have your own script to generate that data in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants