Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to dealing with big gene dataset? #136

Open
B-1991-ing opened this issue Jan 26, 2023 · 4 comments
Open

How to dealing with big gene dataset? #136

B-1991-ing opened this issue Jan 26, 2023 · 4 comments

Comments

@B-1991-ing
Copy link

Dear cdhit development team,

I have a big dataset. Because I predicted genes for each metagenome fasta using the prodigal software, then I got 16 prodigal annotated gene files, each file size is around 40GB. So if I merge all 16 faa files into one faa file, it would be around 640GB. Is it possible to use cd-hit to remove the duplicated genes from this kind of big dataset?

If not possible, could you give me some suggestions on removing duplicate genes from big gene files?

Thank you very much.

Best,

Bing

@B-1991-ing
Copy link
Author

Update

I tried to remove duplicated genes on another big dataset --- 589G. But , finally an error occurred after two hours' running.
Screenshot 2023-01-27 at 23 56 59

@unavailable-2374
Copy link

hello

why don't you try to split the file, one by one to use cd-hit, then cat them together?

In my opinion, there is no bias if you do so.

@B-1991-ing
Copy link
Author

Thanks for the reply.

Even if I split my 600GB big files, do the dereplication, then combine the duplicated files, the combined file will still need to be dereplicated, but the file is still too big, let's say 200GB.

Bing

@KJ-Ma
Copy link

KJ-Ma commented Oct 29, 2023

Hello

I meet the same trouble with you, the total geneset is about 200GB, although afters 12h there is no error occurred, but it was too slow to run it out, only 1500000/275299588 sequence have been done.

So, have you soloved this trouble? And how?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants