-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repeatSubunitLength usage clarification #339
Comments
I also am observing some behavior in non-repetitive regions that may not be quite right. This is a single base deletion, not in a repeating region. It has a state with length=0 and repeatSubunitLength=1.
A single base deletion where the up or downstream base is the same nucleotide is also treated as a repeat (which makes a little more sense):
|
Following this discussion. Here are the current rules
|
Thanks for the link to the spec @korikuzma, it matches the current implementation. I still think it leaves two questions for me:
@ahwagner and @larrybabb would like to hear your thoughts. |
Hey @ehclark.
@larrybabb proposed a length 0 referenceLengthExpression as a means of clarifying the distinction between "this is a sequence deletion" and "I accidentally referenced an empty string here". I think it makes sense, though the most important thing is consistency in implementation–this or an empty string would be okay by me.
Is this the documentation you are referencing? If so, would it be clearer to change:
to
This new class is still in alpha, and if |
Thanks for the additional context @ahwagner. I am probably opening up already well trodden ground here, so I appreciate in advance your patience. When I look at
The derived properties do provide some additional validation by checking concordance. But then why not have a As far as the third item, differentiating unambiguous and ambiguous alleles I think is valuable. But the use of Lastly, I think it would be helpful to avoid the term "repeat", which I think will be misunderstood by many. Ambiguous vs non-ambiguous to me is much clearer. Especially since the normalization/expansion process will often include partial patterns that flank the repeated region, e.g. "acTGTTGTTGaa". My proposal would be:
|
@ehclark I think you have the gist of what the class is expressing, and I appreciate you digging in and asking questions. For the first two defining characteristics you list, I will flip them to give you a sense about how we think about using this class: The This unique feature of The optional The third point you raised about this concept as the normalized (disambiguated) form of multiple different sequence changes (as opposed to there being only a single sequence change that can result in this allele) is true, but not a difference between RLE and LSE. In fact, an RLE is identical to a VOCA normalized Hopefully this is all clear. Your other proposal: Use |
Thanks for the detailed explanation @ahwagner. I agree that supporting large repetitive regions without having to include the literal sequence is an important feature. And the RLE to LSE conversion code explains why using "repeat" as part of the property name makes sense. One follow up here on this point:
My observation is that
Or is my example flawed in some way? |
@ehclark My rationale for choosing RLE over LSE for unambiguous deletions was first based on the idea that we should only pick one way to represent these deletions (which I assume we all agree with). That said, I think we discussed whether it was important or useful to express a I saw an opportunity to use the RLE as a mechanism for unambigous deletions, which I am willing to back off of in favor of using LSE as long as we are clear that we cannot use RLE for 0 length expressions. @ahwagner maybe you have some thoughts in terms of where to draw the line? Would it be reasonable to consider deletions that are the result of the loss of a repeating subunit to be expressed as an RLE with 0 length and all others to be LSE's with an empty string sequence? |
Yes, I agree that the most important concept here is consistency in our representation conventions. The nice thing about the current solution is that it already has an understandable consistency to it: anytime a variant can be derived solely from the reference, you use an RLE. And optionally, you can populate the However, it would add considerable complexity to draw a line / find a middle ground where representation of a complete deletion differs between states like "complete deletion of a repeating region" against "complete deletion of a non-repeating region". For example, how would we treat the deletion of a As I said, I don't have any strong opinions about this (though I slightly favor the status quo), and it sounds like @larrybabb also doesn't have a strong opinion so long as consistency is upheld. It might be worthwhile, however, to express the benefits of the LSE |
I think the real issue here is that the documentation is not in sync with the the RLE concept. When I first encountered the RLE, I went and looked at the documentation in the model, which states that an RLE is:
In that frame of reference, using an RLE for an unambiguous deletion is an improper mixing of concepts. And now with @ahwagner further clarifying that an RLE is:
That definition aligns the concept and the implementation and provides a clear distinction between RLE and LSE. So I think that all we need here is to update the documentation. The implementation is good. |
Cool! Thanks @ehclark. I'm closing this thread and will make a note about this specific component of RLE documentation at (ga4gh/vrs#426). |
I am seeing something I did not expect and wondering if this is the expected behavior. It is related to the ‘repeatSubunitLength’ property of the ReferenceLengthExpression. I am generating VRS objects for insertions and deletions in a microsatellite with a trinucleotide repeat. My expectation is that the repeatSubunitLength should be the same regardless of the number of repeats I add or remove. But that is not what I am seeing in the output. Instead repeatSubunitLength is equal to the number of bases inserted or deleted.
The text was updated successfully, but these errors were encountered: