Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

charlesfoster · 2024-07-03T02:08:54Z

Hi,

Thanks for the useful tool. I would like to request a new feature/option to come into play when using fastp for deduplication. It would be useful if the IDs of duplicate reads could be saved as well as the ID of the 'representative' read that each is a duplicate of. This would mimic a useful feature of another tool:

In addition to the de-duplicated FASTA or FASTQ outputs, czid-dedup also outputs a cluster file which makes it possible to identify clusters of duplicate reads. The file lists the representative cluster read ID for each initial read ID, where the representative cluster read ID is the read ID that makes it into the output file. If a read is found to be a duplicate of a previous read, it will be filtered out of the FASTA/FASTQ output and paired with the read ID of the previous duplicate read in the cluster output file. Representative cluster read IDs are paired with themselves. The order of the input files is preserved. The representative read will always be the first read of its type.

Thanks

The text was updated successfully, but these errors were encountered:

Thomieh73 · 2024-07-11T11:09:41Z

To follow up on the point from @charlesfoster and the issue #528. fastp is very useful, and I do like the option to do deduplication.

I tried to output the reads that got filtered to see if they contained duplicated reads with the option --failed_out. I found that it will only output reads that are filtered because they are too short, too many ambigious, etc.. The output file does not contain the duplicated reads.

for my use case with shotgun metagenomic data, I am not so interested in the number of clusters of sequences, but I would be happy if I could see which reads are duplicated. Than I can decide if that would affect downstream analyses.

Of course I can retrieve the duplicate reads from the raw data, by identifying the missing reads in the clean data and filter them out with seqtk.

Or I can run the entire dataset through VSEARCH and generate the clusters of sequences in that way.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

charlesfoster commented Jul 3, 2024 •

edited

Loading

Thomieh73 commented Jul 11, 2024 •

edited

Loading

Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

Feature request: output file listing the IDs of duplicate reads and associated "representative reads" #569

Comments

charlesfoster commented Jul 3, 2024 • edited Loading

Thomieh73 commented Jul 11, 2024 • edited Loading

charlesfoster commented Jul 3, 2024 •

edited

Loading

Thomieh73 commented Jul 11, 2024 •

edited

Loading