coverage-based instead of counter-based normalisation #71
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request is to address normalisation problems we encountered while experimenting with sequencing SARS-CoV2 using long amplicons (https://www.biorxiv.org/content/10.1101/2020.05.28.122648v3) and rapid sequencing kits. In these cases, the amplicon coverage essentially follows a normal distribution and counter-based normalisation often leads to low coverage terminal regions close to the overlaps of two amplicons.
Instead of simply counting the number of reads for each primer pair, the coverage of both strands is tracked in terms of start and end points of alignments. A read is dropped only if the strand-specific coverage of every position in the aligned region is already equal to or above the requested normalisation threshold. In most cases, this should only marginally influence the behaviour of the align_trim script in that it makes the normalisation threshold a lower boundary instead of an upper boundary.
While the coverage is tracked for each strand individually, it is currently not tracked individually for each amplicon in overlap regions. Even though I cannot think of a scenario where this might be problematic, I wanted to mention this in case this is of importance in any use case.