Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--fastq_ascii and --fastq_qmax must be no more than 126 #564

Open
goodcouscous opened this issue Jun 14, 2024 · 1 comment
Open

--fastq_ascii and --fastq_qmax must be no more than 126 #564

goodcouscous opened this issue Jun 14, 2024 · 1 comment
Labels

Comments

@goodcouscous
Copy link

if (opt_fastq_ascii + opt_fastq_qmax > 126)

what happens to reads with a q20 score when i set fastq_max to 90?

using this command
vsearch --fastq_filter file_q20.fastq --fastq_qmax 90 -fastaout file_q20.fasta

@torognes
Copy link
Owner

Hi

i'll try to explain:

In a FASTQ file with sequencing reads the quality of each base is represented by a value that indicates the probability that the base is wrong.

The quality (Q) of a base is calculated by the formula Q = - 10 * log10( p ), where p is the error probability. For example, if the probability that the base is wrong is 0.01 (or 1%), the quality value Q = - 10 * log10( 0.01) = - 10 * -2 = 20.

This quality value is represented in the FASTQ file by a character (on the fourth line in each entry). For this character to be printable and visible in an ordinary plain text file, the character must be in the range from the character "!" (exclamation point) (ascii value 33) to "~" (tilde) (ascii value 126). To convert from the quality value Q to the character shown in the FASTQ file, the value 33 is usually added to the quality value. A quality value of 20 will therefore be represented by character number 53 in the ascii table, which is the character "5".

Wikipedia has a nice page about this: https://en.wikipedia.org/wiki/FASTQ_format

The "magic" number 33 that is added to the quality value is the same as the value that is specified with the "--fastq_ascii" option in VSEARCH. (In some old FASTQ files the value may be 64.)

To keep the character that represents the quality within the usable range (33-126), the qualty value (Q) must be kept within the range 0 to 93. Usually Q values above 41 are not used with Illumina sequencers, and some software will not handle higher Q values. However, some new sequencing technologies (Nanopore, PacBio) or merging of reads may give higher quality sequences and corresponding higher Q values. The maximum Q value that will be used by VSEARCH can be limited with the "--fastq_qmax" and "--fastq_qmax_out" options. The default is 41, but you may increase it to 93 if you want to, usually without problems. However, this is rarely necessary. We have considered increasing the default to 93, but have so far left it as it is.

Using the "--fastq_qmax_out" option in VSEARCH will adjust down any quality values at the upper bound specified, a Q value of 50 will be adjusted down to 41 by default. If a higher Q value than specified with "--fastx_qmax" is found in the input file, I think VSEARCH will issue a warning.

When you run "vsearch --fastx_filter" with ordinary reads you usually don't have to specify neither the "--fastq_ascii " nor the "--fastq_qmax" options. If you have unusually high quality reads, you should specify "--fastq_qmax 93".

The "--fastx_filter" in VSEARCH does not filter directly on Q values, but you can truncate sequences (strip off bad quality sequence in the 3' end) with the "--fastq_truncqual" option. However, you can discard sequences with an overall low quality using the "--fastq_maxee" or "--fastq_maxee_rate" options. For instance "--fastq_maxee 1" will discard sequences with a total expected error above 1. The "--fastq_maxee_rate 0.01" will discard sequences with an average of 1% error per base. This roughly corresponds to an average Q value of 20.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants