Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repeatSubunitLength usage clarification #339

Closed
ehclark opened this issue Feb 5, 2024 · 12 comments
Closed

repeatSubunitLength usage clarification #339

ehclark opened this issue Feb 5, 2024 · 12 comments
Assignees

Comments

@ehclark
Copy link
Contributor

ehclark commented Feb 5, 2024

I am seeing something I did not expect and wondering if this is the expected behavior. It is related to the ‘repeatSubunitLength’ property of the ReferenceLengthExpression. I am generating VRS objects for insertions and deletions in a microsatellite with a trinucleotide repeat. My expectation is that the repeatSubunitLength should be the same regardless of the number of repeats I add or remove. But that is not what I am seeing in the output. Instead repeatSubunitLength is equal to the number of bases inserted or deleted.

>>> allele = tlr.translate_from("NC_000023.11:g.147912058_147912060del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.Cx3MdA7VamLE17Eqafa4v9b296QpuGYS', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 28, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 3}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912077_147912078insCGG", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.ULNfkzfUQ7Kp4t8dg1d_oiV56S9cToF5', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 34, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 3}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912058_147912063del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.EmkBPxnh8j492hYzvtgbM0P8TsZvOzFL', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 25, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 6}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912059_147912060insCGGCGG", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.pBsVgXLMnj6eUOHtGL1mWdW_SyD9mf4h', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 37, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 6}}
@ehclark
Copy link
Contributor Author

ehclark commented Feb 5, 2024

I also am observing some behavior in non-repetitive regions that may not be quite right.

This is a single base deletion, not in a repeating region. It has a state with length=0 and repeatSubunitLength=1.

>>> allele = tlr.translate_from("NC_000011.10:g.47331510del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.Xp2OgO2bHJRdlMeIwraBzMDbGCvgzloN', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.Ve1J7qpOWKbedMNCKKAlRwCQWCDYAbwg', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.2NkFm8HK88MqeNkCgj78KidCAXgnsfV1'}, 'start': 47331509, 'end': 47331510}, 'state': {'type': 'ReferenceLengthExpression', 'length': 0, 'sequence': '', 'repeatSubunitLength': 1}}

A single base deletion where the up or downstream base is the same nucleotide is also treated as a repeat (which makes a little more sense):

>>> allele = tlr.translate_from("NC_000011.10:g.47331507del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.WSzSnW_eavgwBgxHQ_XpzbtDLHdbvzOw', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.8kYza2lLqBNFHCqWpMoXvjsgqYEDhbHe', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.2NkFm8HK88MqeNkCgj78KidCAXgnsfV1'}, 'start': 47331506, 'end': 47331509}, 'state': {'type': 'ReferenceLengthExpression', 'length': 2, 'sequence': 'CC', 'repeatSubunitLength': 1}}

@korikuzma
Copy link
Contributor

Following this discussion. Here are the current rules

2c
one is empty, the input Allele is an insertion (empty reference sequence) or a deletion (empty alternate sequence). Store the length of the non-empty sequence: this is the Repeat Subunit Length.
...
4b
Otherwise, return a new Allele using a reference length expression, using a Location specified by the coordinates of the left_roll_bound and right_roll_bound, a length specified by the length of the alternate allele, and a repeat subunit length as determined in step 2c.

@ehclark
Copy link
Contributor Author

ehclark commented Feb 5, 2024

Thanks for the link to the spec @korikuzma, it matches the current implementation.

I still think it leaves two questions for me:

  1. If a LiteralSequenceExpression is used for an unambiguous insertion, why isn't it used for an unambiguous deletion?
  2. According to the spec, the repeatSubunitLength is actually the number of inserted or deleted bases. It seems to me that no matter how many repetitions a repeated region is expanded or contracted by it should have the same repeatSubunitLength value. Or maybe the "repeat" term is implying something here that it should not be?

@ahwagner and @larrybabb would like to hear your thoughts.

@ahwagner
Copy link
Member

ahwagner commented Feb 5, 2024

Hey @ehclark.

  1. If a LiteralSequenceExpression is used for an unambiguous insertion, why isn't it used for an unambiguous deletion?

@larrybabb proposed a length 0 referenceLengthExpression as a means of clarifying the distinction between "this is a sequence deletion" and "I accidentally referenced an empty string here". I think it makes sense, though the most important thing is consistency in implementation–this or an empty string would be okay by me.

  1. According to the spec, the repeatSubunitLength is actually the number of inserted or deleted bases. It seems to me that no matter how many repetitions a repeated region is expanded or contracted by it should have the same repeatSubunitLength value. Or maybe the "repeat" term is implying something here that it should not be?

Is this the documentation you are referencing? If so, would it be clearer to change:

The number of residues in the repeat subunit.

to

The length of the residue subunit that is expanded or contracted to length.

This new class is still in alpha, and if repeatSubunitLength would be clearer as subunitLength to accompany the above definition change, I think that would be an improvement and I would support it.

@ehclark
Copy link
Contributor Author

ehclark commented Feb 5, 2024

Thanks for the additional context @ahwagner. I am probably opening up already well trodden ground here, so I appreciate in advance your patience.

When I look at LiteralSequenceExpression vs ReferenceLengthExpression, I see 3 things that set apart the latter:

  1. The length property, which can be derived from the sequence property
  2. The repeatSubunitLength property, which can be derived via abs(location.end - location.start - length(state.sequence))
  3. The concept that this allele is the normalized (disambiguated) form of multiple different sequence changes (as opposed to there being only a single sequence change that can result in this allele)

The derived properties do provide some additional validation by checking concordance. But then why not have a length property on the LiteralSequenceExpression and SequenceLocation? Seems inconsistent. Or they could be properties that are not serialized and digested to create the ID? That would eliminate the derived properties as a source of compatibility problems in the future as things evolve.

As far as the third item, differentiating unambiguous and ambiguous alleles I think is valuable. But the use of ReferenceLengthExpression for unambiguous deletions makes that less straightforward.

Lastly, I think it would be helpful to avoid the term "repeat", which I think will be misunderstood by many. Ambiguous vs non-ambiguous to me is much clearer. Especially since the normalization/expansion process will often include partial patterns that flank the repeated region, e.g. "acTGTTGTTGaa".

My proposal would be:

  • Drop the length and repeatSubunitLength properties from ReferenceLengthExpression
  • Use LiteralSequenceExpression for unambiguous deletions

@ahwagner
Copy link
Member

ahwagner commented Feb 6, 2024

@ehclark I think you have the gist of what the class is expressing, and I appreciate you digging in and asking questions. For the first two defining characteristics you list, I will flip them to give you a sense about how we think about using this class:

The length and repeatSubunitLength properties may be used with the SequenceLocation in the parent object to derive sequence.

This unique feature of ReferenceLengthExpression allows us to express VRS alleles compactly (omitting sequence) when dealing with ambiguous changes in large (sometimes very large) repeating regions. It also lets us fix some dissatisfying behaviors around HGVS-style repeating sequences and provides a means for variable-length repeats (see here for writeup).

The optional sequence attribute in a ReferenceLengthExpression is only used for passing the state explicitly if so desired (i.e. when length is small), but it is omitted from the object when serializing to compute the digest and is therefore never required. This is, in essence, an implementation of the inverse of your first proposal (Drop the length and repeatSubunitLength properties from ReferenceLengthExpression).

The third point you raised about this concept as the normalized (disambiguated) form of multiple different sequence changes (as opposed to there being only a single sequence change that can result in this allele) is true, but not a difference between RLE and LSE. In fact, an RLE is identical to a VOCA normalized Allele using a LiteralSequenceExpression. Here is some example code that demonstrates how to derive a VOCA-normalized LSE from the RLE (without RLE sequence).

Hopefully this is all clear.

Your other proposal: Use LiteralSequenceExpression for unambiguous deletions I will leave to @larrybabb to comment on; I have no strong feelings one way or the other.

@ehclark
Copy link
Contributor Author

ehclark commented Feb 6, 2024

Thanks for the detailed explanation @ahwagner. I agree that supporting large repetitive regions without having to include the literal sequence is an important feature. And the RLE to LSE conversion code explains why using "repeat" as part of the property name makes sense.

One follow up here on this point:

...but it (sequence) is omitted from the object when serializing to compute the digest and is therefore never required

My observation is that sequence is being included in the digest:

>>> allele = tlr.translate_from("NC_000001.11:g.3069064_3069065insCGGC", "hgvs")
>>> ga4gh_digest(allele)
'4qmFaYtg5WumZyfZyXwTLpT0Ep6JcB1G'
>>> ga4gh_serialize(allele)
b'{"location":"h-hgjB9jx0D8PfLIjJA9PoZ34u7NNf83","state":{"length":8,"repeatSubunitLength":4,"sequence":"CGGCCGGC","type":"ReferenceLengthExpression"},"type":"Allele"}'
>>> sha512t24u(b'{"location":"h-hgjB9jx0D8PfLIjJA9PoZ34u7NNf83","state":{"length":8,"repeatSubunitLength":4,"sequence":"CGGCCGGC","type":"ReferenceLengthExpression"},"type":"Allele"}')
'4qmFaYtg5WumZyfZyXwTLpT0Ep6JcB1G'

Or is my example flawed in some way?

@ahwagner
Copy link
Member

ahwagner commented Feb 6, 2024

That is a good observation @ehclark; this is a known issue (#337) and has an open PR to address (#336).

@ahwagner ahwagner changed the title Problem with repeatSubunitLength calculation repeatSubunitLength usage clarification Feb 6, 2024
@larrybabb
Copy link
Contributor

Your other proposal: Use LiteralSequenceExpression for unambiguous deletions I will leave to @larrybabb to comment on; I have no strong feelings one way or the other.

@ehclark My rationale for choosing RLE over LSE for unambiguous deletions was first based on the idea that we should only pick one way to represent these deletions (which I assume we all agree with). That said, I think we discussed whether it was important or useful to express a repeatingSubunitLength and sequence in situations whereby the repeated subunit was completely deleted. I also was considering the idea that expressing a length of 0 was much more explicit in terms of positively expressing a deletion. Finally, we had wrestled a while ago with the notion of whether an LSE sequence should be NULL or an empty string "" and whether that might cause concern since all other attributes require a value.

I saw an opportunity to use the RLE as a mechanism for unambigous deletions, which I am willing to back off of in favor of using LSE as long as we are clear that we cannot use RLE for 0 length expressions.

@ahwagner maybe you have some thoughts in terms of where to draw the line? Would it be reasonable to consider deletions that are the result of the loss of a repeating subunit to be expressed as an RLE with 0 length and all others to be LSE's with an empty string sequence?

@ahwagner
Copy link
Member

ahwagner commented Feb 7, 2024

Yes, I agree that the most important concept here is consistency in our representation conventions.

The nice thing about the current solution is that it already has an understandable consistency to it: anytime a variant can be derived solely from the reference, you use an RLE. And optionally, you can populate the .sequence attribute with the derived sequence. This works for complete deletions, repeat region extension/retraction, and reference-match alleles. An alternative that is nearly as simple (adding only a short constant-time operation) is the alternative @ehclark proposed: treat all complete (unambiguous) deletions as LSEs with state: "". I think if we are going to depart from our current solution, the proposed alternative is sound.

However, it would add considerable complexity to draw a line / find a middle ground where representation of a complete deletion differs between states like "complete deletion of a repeating region" against "complete deletion of a non-repeating region". For example, how would we treat the deletion of a CGC, when it is technically a length 3 extension of a CG repeat subunit? Or treat deletion of AA if it is a length 2 extension of an A? We quickly would find ourselves in the murky waters of assigning arbitrary thresholds to repeatSubunitLength and length to know when we should use an RLE or an LSE.

As I said, I don't have any strong opinions about this (though I slightly favor the status quo), and it sounds like @larrybabb also doesn't have a strong opinion so long as consistency is upheld. It might be worthwhile, however, to express the benefits of the LSE.sequence: "" solution before committing to a change. @ehclark would you mind elaborating a bit on how changing this representation would help with your implementation, to help us weigh the pros and cons? Would populating the RLE.sequence attribute with "" for unambiguous deletions meet your use case?

@ehclark
Copy link
Contributor Author

ehclark commented Feb 7, 2024

I think the real issue here is that the documentation is not in sync with the the RLE concept. When I first encountered the RLE, I went and looked at the documentation in the model, which states that an RLE is:

An expression of a length of a sequence from a repeating reference.

In that frame of reference, using an RLE for an unambiguous deletion is an improper mixing of concepts. And now with @ahwagner further clarifying that an RLE is:

anytime a variant can be derived solely from the reference, you use an RLE

That definition aligns the concept and the implementation and provides a clear distinction between RLE and LSE.

So I think that all we need here is to update the documentation. The implementation is good.

@ahwagner
Copy link
Member

ahwagner commented Feb 7, 2024

Cool! Thanks @ehclark. I'm closing this thread and will make a note about this specific component of RLE documentation at (ga4gh/vrs#426).

@ahwagner ahwagner closed this as completed Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants