Question about cdhit.sh generate .clstr file and clustalo.sh #11

xfy9 · 2022-03-01T12:49:28Z

Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them.
And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them.
I'd appreciate it if you helped me.

lxie21 · 2022-03-01T16:16:13Z

Hi, You need to perform multiple sequence alignment after the clustering. Best, Lei

…

On Tue, Mar 1, 2022 at 7:49 AM xfy9 ***@***.***> wrote: Hi, I have ran cdhit.sh and generated hmdb_cluster30.clstr,hmdb_cluster60.clstr,hmdb_cluster90.clstr,hmdb_cluster9060.clstr,hmdb_cluster906030.clstr, but I am confused about how to use them. Can you tell me how to use them. And I also confued clustalo.sh , I want to know what file that HMDB_clustered_fasta stored, how to get them. I'd appreciate it if you helped me. — Reply to this email directly, view it on GitHub <#11>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABZSBCSSUQJ53IQSZZSPAMDU5YG6PANCNFSM5PT6SS5A> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

thomasly · 2022-03-01T17:09:41Z

Hi,
You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json


class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0	23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0	1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster
            
            
hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)
    
short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k
    
written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)

xfy9 · 2022-03-02T02:15:14Z

Hi, You may need a script as below to retrieve clusters from the cluster906030 file. I will add the script to our repo later. Thank you for pointing this out.

import os
import json


class ClstrAnalyzer:
    def __init__(self, handler):
        """ Analyze the cluster file output from cd-hit

        Args:
            handler (File object): the file handler of the .clstr file
        """
        self.handler = handler

    def get_clusters(self, id_sep="..."):
        """ Get the clusters in the file.

        Args:
            id_sep (str): the string used to separate the id from the cluster line. E.g.
                in case of "0	23217aa, >HMPREF1486_05052... *", the separator should
                be "...", while in case of "0	1364aa, >A0A1I7RXC9_BURXY/84-1447|A0A1I7
                RXC9_BURXY|84-1447|A0A1I7RXC9.1... *", the separator should be "|".
                Default is "...".

        Return (iterator):
            The method returns a iterator of the entry ids in lists.
        """
        ids = list()
        for line in self.handler:
            if line.startswith(">"):
                if len(ids) > 0:
                    yield ids
                ids = list()
            else:
                ids.append(line.split(">")[1].split(id_sep)[0])
        if len(ids) > 0:
            yield ids  # the last cluster
            
            
hmp_root = os.path.join("..", "Data", "HMP_metagenome")
saving_root = os.path.join(hmp_root, "HMP_clustered_fasta")
filtered_hmp_fastas = os.path.join(hmp_root, "filtered_fasta.json")
clusters_file = os.path.join(
    hmp_root,
    "HMP_clustered_corpora",
    "hmp_cdhit_clusters906030",
    "filtered_hmp_cluster906030.clstr",
)

with open(filtered_hmp_fastas, "r") as f:
    fastas = json.load(f)
    
short_keys = {}
for k in fastas.keys():
    short = k.split()[0]
    short_keys[short] = k
    
written = set()
with open(clusters_file, "r") as f:
    clusters = ClstrAnalyzer(f).get_clusters(id_sep="...")
    for i, cluster in enumerate(clusters):
        with open(os.path.join(saving_root, f"cluster{i}.fasta"), "w") as outf:
            for prot_id in cluster:
                if prot_id not in written:
                    full_key = short_keys[prot_id]
                    outf.write(">"+full_key+"\n")
                    outf.write(fastas[full_key]+"\n")
                    written.add(prot_id)

Hi,thank you for your patience to answer me. But I have a problem with the script you provided. How do I generate filtered_fasta.json, I have found it in thesis-works and MicrobiomeMeta, but I am not find the method that generate filtered_fasta.json. I want to konw the content of filtered_fasta.json, and how do I generate filtered_fasta.json.
I would be very grateful if you answer my questions.

thomasly · 2022-03-02T15:18:36Z

It is a json file with protein IDs as keys and fasta sequences as values. "Filtered" simply means we did a filtration for the sequences to keep only the sequences we want for our experiments. You should have your own script to generate that data in practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about cdhit.sh generate .clstr file and clustalo.sh #11

Question about cdhit.sh generate .clstr file and clustalo.sh #11

xfy9 commented Mar 1, 2022

lxie21 commented Mar 1, 2022 via email

thomasly commented Mar 1, 2022

xfy9 commented Mar 2, 2022

thomasly commented Mar 2, 2022

Question about cdhit.sh generate .clstr file and clustalo.sh #11

Question about cdhit.sh generate .clstr file and clustalo.sh #11

Comments

xfy9 commented Mar 1, 2022

lxie21 commented Mar 1, 2022 via email

thomasly commented Mar 1, 2022

xfy9 commented Mar 2, 2022

thomasly commented Mar 2, 2022