Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDuplicates strategy of flow based reads that looks only at the qualities close to the end of the read #1942

Merged

Conversation

ilyasoifer
Copy link
Contributor

@ilyasoifer ilyasoifer commented Feb 10, 2024

Description

  1. We found that when the duplication rate is very high, we sometimes also get variability at the start position (although usually it is more precise than the end position). In this version we added a parameter that also allows for an uncertainty in the read start position to be counted as duplicate. We introduced a parameter UNPAIRED_START_UNCERTAINTY in flow mode.
  2. When duplication rate is very high for large cluster of reads when all reads (start,end) are (A,B) we will sometimes get read with (start,end) at (A-1,B). In the default configuration, we will select the read with the best quality sum over all the reads. However, we found that if it is desirable to select read that starts and ends on (A,B) it is better to only look at the qualities close to the end. We thus introduced a new strategy: FLOW_END_QUALITY_STRATEGY.

Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on github actions

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

.gitignore Show resolved Hide resolved
public enum FLOW_DUPLICATE_SELECTION_STRATEGY {
FLOW_QUALITY_SUM_STRATEGY,
FLOW_END_QUALITY_STRATEGY
}
@Argument(doc = "enable parameters and behavior specific to flow based reads.", optional = true)
public boolean FLOW_MODE = false;

@Argument(doc = "Use specific quality summing strategy for flow based reads. The strategy ensures that the same " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this doc still up to date with the new enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -236,8 +273,8 @@ static protected int getFlowSumOfBaseQualities(final SAMRecord rec, final int th
final byte[] bases = rec.getReadBases();

// create iteration range and direction
final int startingOffset = !rec.getReadNegativeStrandFlag() ? 0 : bases.length;
final int endOffset = !rec.getReadNegativeStrandFlag() ? bases.length : 0;
final int startingOffset = !rec.getReadNegativeStrandFlag() ? 0 : bases.length - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this previously a bug? I don't see how this change is related to allowing a range in the start position or changing the quality score calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a bug indeed

@@ -59,6 +60,7 @@ public class MarkDuplicatesForFlowHelper implements MarkDuplicatesHelper {
public static final char[] CLIPPING_TAG_CONTAINS_AQ = {'A', 'Q'};
public static final char[] CLIPPING_TAG_CONTAINS_QZ = {'Q', 'Z'};

public static final int DIST_FROM_END = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth mentioning that the distance from the end is hard coded as 10 bases for the FLOW_END_QUALITY_STRATEGY

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned in the parameter description. Is this OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, perfect

Copy link
Contributor Author

@ilyasoifer ilyasoifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meganshand - comments addressed, please take another look

Copy link
Contributor

@meganshand meganshand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -59,6 +60,7 @@ public class MarkDuplicatesForFlowHelper implements MarkDuplicatesHelper {
public static final char[] CLIPPING_TAG_CONTAINS_AQ = {'A', 'Q'};
public static final char[] CLIPPING_TAG_CONTAINS_QZ = {'Q', 'Z'};

public static final int DIST_FROM_END = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, perfect

@kockan kockan merged commit c026f94 into broadinstitute:master Feb 14, 2024
6 checks passed
@ilyasoifer ilyasoifer deleted the ilyasoifer/md.improvements.flow branch June 1, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants