How to dealing with big gene dataset? #136

B-1991-ing · 2023-01-26T22:58:50Z

Dear cdhit development team,

I have a big dataset. Because I predicted genes for each metagenome fasta using the prodigal software, then I got 16 prodigal annotated gene files, each file size is around 40GB. So if I merge all 16 faa files into one faa file, it would be around 640GB. Is it possible to use cd-hit to remove the duplicated genes from this kind of big dataset?

If not possible, could you give me some suggestions on removing duplicate genes from big gene files?

Thank you very much.

Best,

Bing

B-1991-ing · 2023-01-27T22:57:13Z

Update

I tried to remove duplicated genes on another big dataset --- 589G. But , finally an error occurred after two hours' running.

unavailable-2374 · 2023-03-09T02:29:42Z

hello

why don't you try to split the file, one by one to use cd-hit, then cat them together?

In my opinion, there is no bias if you do so.

B-1991-ing · 2023-03-09T17:50:05Z

Thanks for the reply.

Even if I split my 600GB big files, do the dereplication, then combine the duplicated files, the combined file will still need to be dereplicated, but the file is still too big, let's say 200GB.

Bing

KJ-Ma · 2023-10-29T04:26:31Z

Hello

I meet the same trouble with you, the total geneset is about 200GB, although afters 12h there is no error occurred, but it was too slow to run it out, only 1500000/275299588 sequence have been done.

So, have you soloved this trouble? And how?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to dealing with big gene dataset? #136

How to dealing with big gene dataset? #136

B-1991-ing commented Jan 26, 2023

B-1991-ing commented Jan 27, 2023

unavailable-2374 commented Mar 9, 2023

B-1991-ing commented Mar 9, 2023

KJ-Ma commented Oct 29, 2023

How to dealing with big gene dataset? #136

How to dealing with big gene dataset? #136

Comments

B-1991-ing commented Jan 26, 2023

B-1991-ing commented Jan 27, 2023

unavailable-2374 commented Mar 9, 2023

B-1991-ing commented Mar 9, 2023

KJ-Ma commented Oct 29, 2023