-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to dealing with big gene dataset? #136
Comments
hello why don't you try to split the file, one by one to use cd-hit, then cat them together? In my opinion, there is no bias if you do so. |
Thanks for the reply. Even if I split my 600GB big files, do the dereplication, then combine the duplicated files, the combined file will still need to be dereplicated, but the file is still too big, let's say 200GB. Bing |
Hello I meet the same trouble with you, the total geneset is about 200GB, although afters 12h there is no error occurred, but it was too slow to run it out, only 1500000/275299588 sequence have been done. So, have you soloved this trouble? And how? Thanks |
Dear cdhit development team,
I have a big dataset. Because I predicted genes for each metagenome fasta using the prodigal software, then I got 16 prodigal annotated gene files, each file size is around 40GB. So if I merge all 16 faa files into one faa file, it would be around 640GB. Is it possible to use cd-hit to remove the duplicated genes from this kind of big dataset?
If not possible, could you give me some suggestions on removing duplicate genes from big gene files?
Thank you very much.
Best,
Bing
The text was updated successfully, but these errors were encountered: