Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

Open
charlesfoster opened this issue Jul 3, 2024 · 1 comment

Comments

@charlesfoster
Copy link

charlesfoster commented Jul 3, 2024

Hi,

Thanks for the useful tool. I would like to request a new feature/option to come into play when using fastp for deduplication. It would be useful if the IDs of duplicate reads could be saved as well as the ID of the 'representative' read that each is a duplicate of. This would mimic a useful feature of another tool:

In addition to the de-duplicated FASTA or FASTQ outputs, czid-dedup also outputs a cluster file which makes it possible to identify clusters of duplicate reads. The file lists the representative cluster read ID for each initial read ID, where the representative cluster read ID is the read ID that makes it into the output file. If a read is found to be a duplicate of a previous read, it will be filtered out of the FASTA/FASTQ output and paired with the read ID of the previous duplicate read in the cluster output file. Representative cluster read IDs are paired with themselves. The order of the input files is preserved. The representative read will always be the first read of its type.

Thanks

@Thomieh73
Copy link

Thomieh73 commented Jul 11, 2024

To follow up on the point from @charlesfoster and the issue #528. fastp is very useful, and I do like the option to do deduplication.

I tried to output the reads that got filtered to see if they contained duplicated reads with the option --failed_out. I found that it will only output reads that are filtered because they are too short, too many ambigious, etc.. The output file does not contain the duplicated reads.

for my use case with shotgun metagenomic data, I am not so interested in the number of clusters of sequences, but I would be happy if I could see which reads are duplicated. Than I can decide if that would affect downstream analyses.

Of course I can retrieve the duplicate reads from the raw data, by identifying the missing reads in the clean data and filter them out with seqtk.

Or I can run the entire dataset through VSEARCH and generate the clusters of sequences in that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants