-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to filter fastq reads based on Qscore before paired end merging #575
Comments
Hi @simmimourya in your python script, the option With
As quality tends to decrease towards the 3' end of reads, the usual approach is to trim reads when the cumulated expected error (EE) drops below a certain threshold, or after the first position with a Q-value equal or lesser than a certain threshold. For example, you can trim reads (and read pairs) after the first low Q-value with vsearch \
--fastx_filter <(printf "@fwd\nAA\n+\nI?\n") \
--reverse <(printf "@rev\nTT\n+\nII\n") \
--quiet \
--fastq_truncqual 30 \
--fastqout - \
--fastqout_rev -
A Q-value of If you decide to go that way and to use vsearch \
--fastq_mergepairs <(printf "@fwd\nAAAAAAAATA\n+\nIIIIIIIIIH\n") \
--reverse <(printf "@rev\nTATTTTTTTT\n+\nIIIIIIIIII\n") \
--fastq_truncqual 39 \
--fastqout -
|
Thank you for your response. I was able to come up with something similar yesterday, glad to see that our approaches align. I saw documentation about truncqual here and came up with this. def filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False):
filtered_r1_path = os.path.join(output_dir, "filtered_r1.fastq")
filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
# ---------------------- Code change -------------------------------------------
# Filter R1 and R2 reads
subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_truncqual {qscore_threshold}", shell=True, check=True)
subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_truncqual {qscore_threshold}", shell=True, check=True)
# ---------------------- Code change -------------------------------------------
# Merge filtered reads
merged_output_prefix = os.path.join(output_dir, "merged")
merged_output_file = f"{merged_output_prefix}.fastq"
if not os.path.exists(merged_output_file) or merge_override:
subprocess.run(f"vsearch --fastq_mergepairs {filtered_r1_path} --reverse {filtered_r2_path} --fastqout {merged_output_file}", shell=True, check=True)
else:
print(f"Using existing merged file: {merged_output_file}")
return merged_output_file
# Example usage
r1_path = "forward.fastq.gz"
r2_path = "reverse.fastq.gz"
output_dir = '../results/'
filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False) Which helped me filter the reads but I'm seeing this issue now:
Filtering before merging can lead to issues due to unequal read lengths, which can prevent successful merging. Instead, I switched to merging first and then filtering, and this approach works really well! Now, I want to do a sanity check on some of the filtered sequences to assess metrics like Qscore distribution, cumulative percentage (AccPct), mean quality, etc. I’ve already generated the stats using --fastq-stats. What key metrics would you recommend examining to verify the quality of the FASTQ data after filtering and merging? # generate stats
def fastq_stats(fastq_file, output_log, fastq_ascii='sanger', fastq_qmin=0, fastq_qmax=40):
command = (
f"vsearch --fastq_stats {fastq_file} "
f"--log {output_log}"
)
subprocess.run(command, shell=True, check=True)
print(f"Statistics generated and saved to {output_log}")
# Example usage
fastq_file = '../results/merged_filtered.fastq'
output_log = '../results/merged_filtered_output_stats.log'
fastq_stats(fastq_file, output_log) |
This is not exactly what caused the second issue When working with pairs of fastq files, vsearch \
--fastx_filter <(printf "@fwd\nAA\n+\nI?\n") \
--reverse <(printf "@rev\nTT\n+\nII\n") \
--quiet \
--fastq_truncqual 30 \
--fastqout - \
--fastqout_rev -
There are several available quality metrics (see FastQC). I work mostly with metabarcoding data, and I tend to focus on the expected error (EE). It is a measure of the risk of errors in a given sequence. I use it to filter out sequences that are both rare and probably erroneous (high EE, or high (EE / length), as EE is length-dependent). |
I have forward and reverse FASTQ files from Illumina paired-end sequencing, and I'm merging these reads using
vsearch
, which works great. However, I need to filter the reads based on a quality score (Qscore) threshold of 30 before merging. Specifically, within each forward and reverse FASTQ file, I want to discard any reads that have a quality score below this threshold. Once the reads are filtered, I want to perform a paired-end merge on the filtered reads.Did I miss something? Also, is this the right way to approach the problem?
Here’s the code I’m using and the error I encountered during filtering:
This is the error I'm seeing upon running the script:
The text was updated successfully, but these errors were encountered: