Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to filter fastq reads based on Qscore before paired end merging #575

Open
simmimourya opened this issue Oct 2, 2024 · 3 comments
Labels

Comments

@simmimourya
Copy link

simmimourya commented Oct 2, 2024

I have forward and reverse FASTQ files from Illumina paired-end sequencing, and I'm merging these reads using vsearch, which works great. However, I need to filter the reads based on a quality score (Qscore) threshold of 30 before merging. Specifically, within each forward and reverse FASTQ file, I want to discard any reads that have a quality score below this threshold. Once the reads are filtered, I want to perform a paired-end merge on the filtered reads.

Did I miss something? Also, is this the right way to approach the problem?

Here’s the code I’m using and the error I encountered during filtering:

def filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False):

    filtered_r1_path = os.path.join(output_dir, "filtered_r1.fastq")
    filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
    
    # Filter R1 and R2 reads
    subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
    subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
    
    # Merge filtered reads
    merged_output_prefix = os.path.join(output_dir, "merged")
    merged_output_file = f"{merged_output_prefix}.fastq"
    
    if not os.path.exists(merged_output_file) or merge_override:
        subprocess.run(f"vsearch --fastq_mergepairs {filtered_r1_path} --reverse {filtered_r2_path} --fastqout {merged_output_file}", shell=True, check=True)
    else:
        print(f"Using existing merged file: {merged_output_file}")
    
    return merged_output_file

# Example usage
r1_path = "forward.fastq.gz"
r2_path = "reverse.fastq.gz"
output_dir = '../results/'
filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)

This is the error I'm seeing upon running the script:

vsearch v2.28.1_linux_x86_64, 124.5GB RAM, 16 cores
https://github.com/torognes/vsearch

Reading input file

Fatal error: FASTQ quality value (16) below qmin (30)
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[35], line 36
     34 r2_path = "reverse.fastq.gz"
     35 output_dir = '../results/'
---> 36 filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)

Cell In[35], line 19, in filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold, merge_override)
     16 filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
     18 # Filter R1 and R2 reads
---> 19 subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
     20 subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_qmin {qscore_threshold}", shell=True, check=True)
     22 # Merge filtered reads

File /opt/conda/lib/python3.9/subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)

CalledProcessError: Command 'vsearch --fastq_filter forward.fastq.gz --fastqout ../results/filtered_r1.fastq --fastq_qmin 30' returned non-zero exit status 1.
@frederic-mahe
Copy link
Collaborator

Hi @simmimourya

in your python script, the option --fastq_qmin 30 tells vsearch to stop with an error if a fastq file contains positions with Q<30. Hence the error you get when running it on your data.

With vsearch, you can filter reads based on length, number of unknown positions (Ns), expected error (EE), or the number of copies:

The sequences may be filtered using the options --fastq_maxee,
--fastq_maxee_rate, --fastq_maxlen, --fastq_maxns, --fastq_minlen
(default 1), --fastq_trunclen, --maxsize, and --min‐size. Sequences
not satisfying the requirements are discarded. For pairs of sequences,
both sequences in a pair must satisfy the requirements, otherwise both
are discarded.

As quality tends to decrease towards the 3' end of reads, the usual approach is to trim reads when the cumulated expected error (EE) drops below a certain threshold, or after the first position with a Q-value equal or lesser than a certain threshold. For example, you can trim reads (and read pairs) after the first low Q-value with --fastq_truncqual 30. Here is a toy-example:

vsearch \
    --fastx_filter <(printf "@fwd\nAA\n+\nI?\n") \
    --reverse <(printf "@rev\nTT\n+\nII\n") \
    --quiet \
    --fastq_truncqual 30 \
    --fastqout - \
    --fastqout_rev -
@fwd
A
+
I
@rev
TT
+
II

A Q-value of ? (30) is equal to our threshold, so fwd is truncated.

If you decide to go that way and to use --fastq_truncqual, you can use it directly when merging. It will truncate the reads before trying to merge them. Here is a toy-example:

vsearch \
    --fastq_mergepairs <(printf "@fwd\nAAAAAAAATA\n+\nIIIIIIIIIH\n") \
    --reverse <(printf "@rev\nTATTTTTTTT\n+\nIIIIIIIIII\n") \
    --fastq_truncqual 39 \
    --fastqout -
Pairs that failed merging due to various reasons:
         1  overlap too short

H (Q=39), so fwd is truncated, and is not long enough anymore to be merged (merging requires a minimal overlap of 10 nucleotides). With --fastq_truncqual 38, there is no trimming and reads are merged.

@simmimourya
Copy link
Author

simmimourya commented Oct 3, 2024

Thank you for your response. I was able to come up with something similar yesterday, glad to see that our approaches align. I saw documentation about truncqual here and came up with this.

def filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False):
    filtered_r1_path = os.path.join(output_dir, "filtered_r1.fastq")
    filtered_r2_path = os.path.join(output_dir, "filtered_r2.fastq")
    # ---------------------- Code change -------------------------------------------
    # Filter R1 and R2 reads
    subprocess.run(f"vsearch --fastq_filter {r1_path} --fastqout {filtered_r1_path} --fastq_truncqual {qscore_threshold}", shell=True, check=True)
    subprocess.run(f"vsearch --fastq_filter {r2_path} --fastqout {filtered_r2_path} --fastq_truncqual {qscore_threshold}", shell=True, check=True)
    # ---------------------- Code change -------------------------------------------
    # Merge filtered reads
    merged_output_prefix = os.path.join(output_dir, "merged")
    merged_output_file = f"{merged_output_prefix}.fastq"
    if not os.path.exists(merged_output_file) or merge_override:
        subprocess.run(f"vsearch --fastq_mergepairs {filtered_r1_path} --reverse {filtered_r2_path} --fastqout {merged_output_file}", shell=True, check=True)
    else:
        print(f"Using existing merged file: {merged_output_file}")
    return merged_output_file
# Example usage
r1_path = "forward.fastq.gz"
r2_path = "reverse.fastq.gz"
output_dir = '../results/'
filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)

Which helped me filter the reads but I'm seeing this issue now:

375836 sequences kept (of which 138630 truncated), 10387 sequences discarded.
vsearch v2.28.1_linux_x86_64, 124.5GB RAM, 16 cores
https://github.com/torognes/vsearch
Reading input file 100%
360321 sequences kept (of which 176234 truncated), 25902 sequences discarded.
vsearch v2.28.1_linux_x86_64, 124.5GB RAM, 16 cores
https://github.com/torognes/vsearch
Merging reads
Fatal error: More forward reads than reverse reads
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[36], line 38
     36 r2_path = "reverse.fastq.gz"
     37 output_dir = '../results/'
---> 38 filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold=30, merge_override=False)
Cell In[36], line 29, in filter_and_merge_reads(r1_path, r2_path, output_dir, qscore_threshold, merge_override)
     26 merged_output_file = f"{merged_output_prefix}.fastq"
     28 if not os.path.exists(merged_output_file) or merge_override:
---> 29     subprocess.run(f"vsearch --fastq_mergepairs {filtered_r1_path} --reverse {filtered_r2_path} --fastqout {merged_output_file}", shell=True, check=True)
     30 else:
     31     print(f"Using existing merged file: {merged_output_file}")
File /opt/conda/lib/python3.9/subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command 'vsearch --fastq_mergepairs ../results/filtered_r1.fastq --reverse ../results/filtered_r2.fastq --fastqout ../results/merged.fastq' returned non-zero exit status 1.

Filtering before merging can lead to issues due to unequal read lengths, which can prevent successful merging. Instead, I switched to merging first and then filtering, and this approach works really well! Now, I want to do a sanity check on some of the filtered sequences to assess metrics like Qscore distribution, cumulative percentage (AccPct), mean quality, etc. I’ve already generated the stats using --fastq-stats. What key metrics would you recommend examining to verify the quality of the FASTQ data after filtering and merging?

# generate stats
def fastq_stats(fastq_file, output_log, fastq_ascii='sanger', fastq_qmin=0, fastq_qmax=40):
    command = (
        f"vsearch --fastq_stats {fastq_file} "
        f"--log {output_log}"
    )

    subprocess.run(command, shell=True, check=True)
    print(f"Statistics generated and saved to {output_log}")

# Example usage
fastq_file = '../results/merged_filtered.fastq'
output_log = '../results/merged_filtered_output_stats.log'
fastq_stats(fastq_file, output_log)

@frederic-mahe
Copy link
Collaborator

Filtering before merging can lead to issues due to unequal read lengths, which can prevent successful merging.

This is not exactly what caused the second issue Fatal error: More forward reads than reverse reads. By filtering your forward and reverse datasets independently, you desync'ed them. Trimming probably resulted in empty fastq entries (zero-length reads) in your reverse dataset, and empty reads are discarded by vsearch. The trimmed reverse dataset now contains less entries than the trimmed forward dataset, which triggers an error when trying to merge the two.

When working with pairs of fastq files, vsearch can process them simultaneously, to preserve synchronization (note the --reverse):

vsearch \
    --fastx_filter <(printf "@fwd\nAA\n+\nI?\n") \
    --reverse <(printf "@rev\nTT\n+\nII\n") \
    --quiet \
    --fastq_truncqual 30 \
    --fastqout - \
    --fastqout_rev -

What key metrics would you recommend examining to verify the quality of the FASTQ data after filtering and merging?

There are several available quality metrics (see FastQC). I work mostly with metabarcoding data, and I tend to focus on the expected error (EE). It is a measure of the risk of errors in a given sequence. I use it to filter out sequences that are both rare and probably erroneous (high EE, or high (EE / length), as EE is length-dependent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants