repeatSubunitLength usage clarification #339

ehclark · 2024-02-05T15:51:42Z

I am seeing something I did not expect and wondering if this is the expected behavior. It is related to the ‘repeatSubunitLength’ property of the ReferenceLengthExpression. I am generating VRS objects for insertions and deletions in a microsatellite with a trinucleotide repeat. My expectation is that the repeatSubunitLength should be the same regardless of the number of repeats I add or remove. But that is not what I am seeing in the output. Instead repeatSubunitLength is equal to the number of bases inserted or deleted.

>>> allele = tlr.translate_from("NC_000023.11:g.147912058_147912060del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.Cx3MdA7VamLE17Eqafa4v9b296QpuGYS', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 28, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 3}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912077_147912078insCGG", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.ULNfkzfUQ7Kp4t8dg1d_oiV56S9cToF5', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 34, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 3}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912058_147912063del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.EmkBPxnh8j492hYzvtgbM0P8TsZvOzFL', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 25, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 6}}

>>> allele = tlr.translate_from("NC_000023.11:g.147912059_147912060insCGGCGG", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.pBsVgXLMnj6eUOHtGL1mWdW_SyD9mf4h', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.YOp1PlTsk_ZizADZJQ_kVREeSMYPrNIR', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP'}, 'start': 147912049, 'end': 147912080}, 'state': {'type': 'ReferenceLengthExpression', 'length': 37, 'sequence': 'GCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGGCGG', 'repeatSubunitLength': 6}}

The text was updated successfully, but these errors were encountered:

ehclark · 2024-02-05T18:20:18Z

I also am observing some behavior in non-repetitive regions that may not be quite right.

This is a single base deletion, not in a repeating region. It has a state with length=0 and repeatSubunitLength=1.

>>> allele = tlr.translate_from("NC_000011.10:g.47331510del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.Xp2OgO2bHJRdlMeIwraBzMDbGCvgzloN', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.Ve1J7qpOWKbedMNCKKAlRwCQWCDYAbwg', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.2NkFm8HK88MqeNkCgj78KidCAXgnsfV1'}, 'start': 47331509, 'end': 47331510}, 'state': {'type': 'ReferenceLengthExpression', 'length': 0, 'sequence': '', 'repeatSubunitLength': 1}}

A single base deletion where the up or downstream base is the same nucleotide is also treated as a repeat (which makes a little more sense):

>>> allele = tlr.translate_from("NC_000011.10:g.47331507del", "hgvs")
>>> allele.model_dump(exclude_none=True)
{'id': 'ga4gh:VA.WSzSnW_eavgwBgxHQ_XpzbtDLHdbvzOw', 'type': 'Allele', 'location': {'id': 'ga4gh:SL.8kYza2lLqBNFHCqWpMoXvjsgqYEDhbHe', 'type': 'SequenceLocation', 'sequenceReference': {'type': 'SequenceReference', 'refgetAccession': 'SQ.2NkFm8HK88MqeNkCgj78KidCAXgnsfV1'}, 'start': 47331506, 'end': 47331509}, 'state': {'type': 'ReferenceLengthExpression', 'length': 2, 'sequence': 'CC', 'repeatSubunitLength': 1}}

korikuzma · 2024-02-05T18:29:59Z

Following this discussion. Here are the current rules

2c
one is empty, the input Allele is an insertion (empty reference sequence) or a deletion (empty alternate sequence). Store the length of the non-empty sequence: this is the Repeat Subunit Length.
...
4b
Otherwise, return a new Allele using a reference length expression, using a Location specified by the coordinates of the left_roll_bound and right_roll_bound, a length specified by the length of the alternate allele, and a repeat subunit length as determined in step 2c.

ehclark · 2024-02-05T19:19:02Z

Thanks for the link to the spec @korikuzma, it matches the current implementation.

I still think it leaves two questions for me:

If a LiteralSequenceExpression is used for an unambiguous insertion, why isn't it used for an unambiguous deletion?
According to the spec, the repeatSubunitLength is actually the number of inserted or deleted bases. It seems to me that no matter how many repetitions a repeated region is expanded or contracted by it should have the same repeatSubunitLength value. Or maybe the "repeat" term is implying something here that it should not be?

@ahwagner and @larrybabb would like to hear your thoughts.

ahwagner · 2024-02-05T21:21:38Z

Hey @ehclark.

If a LiteralSequenceExpression is used for an unambiguous insertion, why isn't it used for an unambiguous deletion?

@larrybabb proposed a length 0 referenceLengthExpression as a means of clarifying the distinction between "this is a sequence deletion" and "I accidentally referenced an empty string here". I think it makes sense, though the most important thing is consistency in implementation–this or an empty string would be okay by me.

According to the spec, the repeatSubunitLength is actually the number of inserted or deleted bases. It seems to me that no matter how many repetitions a repeated region is expanded or contracted by it should have the same repeatSubunitLength value. Or maybe the "repeat" term is implying something here that it should not be?

Is this the documentation you are referencing? If so, would it be clearer to change:

The number of residues in the repeat subunit.

to

The length of the residue subunit that is expanded or contracted to length.

This new class is still in alpha, and if repeatSubunitLength would be clearer as subunitLength to accompany the above definition change, I think that would be an improvement and I would support it.

ehclark · 2024-02-05T22:52:59Z

Thanks for the additional context @ahwagner. I am probably opening up already well trodden ground here, so I appreciate in advance your patience.

When I look at LiteralSequenceExpression vs ReferenceLengthExpression, I see 3 things that set apart the latter:

The length property, which can be derived from the sequence property
The repeatSubunitLength property, which can be derived via abs(location.end - location.start - length(state.sequence))
The concept that this allele is the normalized (disambiguated) form of multiple different sequence changes (as opposed to there being only a single sequence change that can result in this allele)

The derived properties do provide some additional validation by checking concordance. But then why not have a length property on the LiteralSequenceExpression and SequenceLocation? Seems inconsistent. Or they could be properties that are not serialized and digested to create the ID? That would eliminate the derived properties as a source of compatibility problems in the future as things evolve.

As far as the third item, differentiating unambiguous and ambiguous alleles I think is valuable. But the use of ReferenceLengthExpression for unambiguous deletions makes that less straightforward.

Lastly, I think it would be helpful to avoid the term "repeat", which I think will be misunderstood by many. Ambiguous vs non-ambiguous to me is much clearer. Especially since the normalization/expansion process will often include partial patterns that flank the repeated region, e.g. "acTGTTGTTGaa".

My proposal would be:

Drop the length and repeatSubunitLength properties from ReferenceLengthExpression
Use LiteralSequenceExpression for unambiguous deletions

ahwagner · 2024-02-06T01:54:12Z

@ehclark I think you have the gist of what the class is expressing, and I appreciate you digging in and asking questions. For the first two defining characteristics you list, I will flip them to give you a sense about how we think about using this class:

The length and repeatSubunitLength properties may be used with the SequenceLocation in the parent object to derive sequence.

This unique feature of ReferenceLengthExpression allows us to express VRS alleles compactly (omitting sequence) when dealing with ambiguous changes in large (sometimes very large) repeating regions. It also lets us fix some dissatisfying behaviors around HGVS-style repeating sequences and provides a means for variable-length repeats (see here for writeup).

The optional sequence attribute in a ReferenceLengthExpression is only used for passing the state explicitly if so desired (i.e. when length is small), but it is omitted from the object when serializing to compute the digest and is therefore never required. This is, in essence, an implementation of the inverse of your first proposal (Drop the length and repeatSubunitLength properties from ReferenceLengthExpression).

The third point you raised about this concept as the normalized (disambiguated) form of multiple different sequence changes (as opposed to there being only a single sequence change that can result in this allele) is true, but not a difference between RLE and LSE. In fact, an RLE is identical to a VOCA normalized Allele using a LiteralSequenceExpression. Here is some example code that demonstrates how to derive a VOCA-normalized LSE from the RLE (without RLE sequence).

Hopefully this is all clear.

Your other proposal: Use LiteralSequenceExpression for unambiguous deletions I will leave to @larrybabb to comment on; I have no strong feelings one way or the other.

ehclark · 2024-02-06T14:38:20Z

Thanks for the detailed explanation @ahwagner. I agree that supporting large repetitive regions without having to include the literal sequence is an important feature. And the RLE to LSE conversion code explains why using "repeat" as part of the property name makes sense.

One follow up here on this point:

...but it (sequence) is omitted from the object when serializing to compute the digest and is therefore never required

My observation is that sequence is being included in the digest:

>>> allele = tlr.translate_from("NC_000001.11:g.3069064_3069065insCGGC", "hgvs")
>>> ga4gh_digest(allele)
'4qmFaYtg5WumZyfZyXwTLpT0Ep6JcB1G'
>>> ga4gh_serialize(allele)
b'{"location":"h-hgjB9jx0D8PfLIjJA9PoZ34u7NNf83","state":{"length":8,"repeatSubunitLength":4,"sequence":"CGGCCGGC","type":"ReferenceLengthExpression"},"type":"Allele"}'
>>> sha512t24u(b'{"location":"h-hgjB9jx0D8PfLIjJA9PoZ34u7NNf83","state":{"length":8,"repeatSubunitLength":4,"sequence":"CGGCCGGC","type":"ReferenceLengthExpression"},"type":"Allele"}')
'4qmFaYtg5WumZyfZyXwTLpT0Ep6JcB1G'

Or is my example flawed in some way?

ahwagner · 2024-02-06T17:07:26Z

That is a good observation @ehclark; this is a known issue (#337) and has an open PR to address (#336).

larrybabb · 2024-02-06T21:26:01Z

Your other proposal: Use LiteralSequenceExpression for unambiguous deletions I will leave to @larrybabb to comment on; I have no strong feelings one way or the other.

@ehclark My rationale for choosing RLE over LSE for unambiguous deletions was first based on the idea that we should only pick one way to represent these deletions (which I assume we all agree with). That said, I think we discussed whether it was important or useful to express a repeatingSubunitLength and sequence in situations whereby the repeated subunit was completely deleted. I also was considering the idea that expressing a length of 0 was much more explicit in terms of positively expressing a deletion. Finally, we had wrestled a while ago with the notion of whether an LSE sequence should be NULL or an empty string "" and whether that might cause concern since all other attributes require a value.

I saw an opportunity to use the RLE as a mechanism for unambigous deletions, which I am willing to back off of in favor of using LSE as long as we are clear that we cannot use RLE for 0 length expressions.

@ahwagner maybe you have some thoughts in terms of where to draw the line? Would it be reasonable to consider deletions that are the result of the loss of a repeating subunit to be expressed as an RLE with 0 length and all others to be LSE's with an empty string sequence?

ahwagner · 2024-02-07T02:08:58Z

Yes, I agree that the most important concept here is consistency in our representation conventions.

The nice thing about the current solution is that it already has an understandable consistency to it: anytime a variant can be derived solely from the reference, you use an RLE. And optionally, you can populate the .sequence attribute with the derived sequence. This works for complete deletions, repeat region extension/retraction, and reference-match alleles. An alternative that is nearly as simple (adding only a short constant-time operation) is the alternative @ehclark proposed: treat all complete (unambiguous) deletions as LSEs with state: "". I think if we are going to depart from our current solution, the proposed alternative is sound.

However, it would add considerable complexity to draw a line / find a middle ground where representation of a complete deletion differs between states like "complete deletion of a repeating region" against "complete deletion of a non-repeating region". For example, how would we treat the deletion of a CGC, when it is technically a length 3 extension of a CG repeat subunit? Or treat deletion of AA if it is a length 2 extension of an A? We quickly would find ourselves in the murky waters of assigning arbitrary thresholds to repeatSubunitLength and length to know when we should use an RLE or an LSE.

As I said, I don't have any strong opinions about this (though I slightly favor the status quo), and it sounds like @larrybabb also doesn't have a strong opinion so long as consistency is upheld. It might be worthwhile, however, to express the benefits of the LSE.sequence: "" solution before committing to a change. @ehclark would you mind elaborating a bit on how changing this representation would help with your implementation, to help us weigh the pros and cons? Would populating the RLE.sequence attribute with "" for unambiguous deletions meet your use case?

ehclark · 2024-02-07T13:53:21Z

I think the real issue here is that the documentation is not in sync with the the RLE concept. When I first encountered the RLE, I went and looked at the documentation in the model, which states that an RLE is:

An expression of a length of a sequence from a repeating reference.

In that frame of reference, using an RLE for an unambiguous deletion is an improper mixing of concepts. And now with @ahwagner further clarifying that an RLE is:

anytime a variant can be derived solely from the reference, you use an RLE

That definition aligns the concept and the implementation and provides a clear distinction between RLE and LSE.

So I think that all we need here is to update the documentation. The implementation is good.

ahwagner · 2024-02-07T18:39:20Z

Cool! Thanks @ehclark. I'm closing this thread and will make a note about this specific component of RLE documentation at (ga4gh/vrs#426).

ehclark assigned ahwagner and larrybabb Feb 5, 2024

ahwagner changed the title ~~Problem with repeatSubunitLength calculation~~ repeatSubunitLength usage clarification Feb 6, 2024

ahwagner closed this as completed Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repeatSubunitLength usage clarification #339

repeatSubunitLength usage clarification #339

ehclark commented Feb 5, 2024

ehclark commented Feb 5, 2024

korikuzma commented Feb 5, 2024

ehclark commented Feb 5, 2024

ahwagner commented Feb 5, 2024 •

edited

Loading

ehclark commented Feb 5, 2024

ahwagner commented Feb 6, 2024 •

edited

Loading

ehclark commented Feb 6, 2024

ahwagner commented Feb 6, 2024

larrybabb commented Feb 6, 2024

ahwagner commented Feb 7, 2024 •

edited

Loading

ehclark commented Feb 7, 2024

ahwagner commented Feb 7, 2024

repeatSubunitLength usage clarification #339

repeatSubunitLength usage clarification #339

Comments

ehclark commented Feb 5, 2024

ehclark commented Feb 5, 2024

korikuzma commented Feb 5, 2024

ehclark commented Feb 5, 2024

ahwagner commented Feb 5, 2024 • edited Loading

ehclark commented Feb 5, 2024

ahwagner commented Feb 6, 2024 • edited Loading

ehclark commented Feb 6, 2024

ahwagner commented Feb 6, 2024

larrybabb commented Feb 6, 2024

ahwagner commented Feb 7, 2024 • edited Loading

ehclark commented Feb 7, 2024

ahwagner commented Feb 7, 2024

ahwagner commented Feb 5, 2024 •

edited

Loading

ahwagner commented Feb 6, 2024 •

edited

Loading

ahwagner commented Feb 7, 2024 •

edited

Loading