Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cigar utility method get_alignment_offsets #188

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
44 changes: 43 additions & 1 deletion fgpyo/sam/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -368,7 +368,7 @@ class CigarOp(enum.Enum):
code (int): The `~pysam` cigar operator code.
character (int): The single character cigar operator.
consumes_query (bool): True if this operator consumes query bases, False otherwise.
consumes_target (bool): True if this operator consumes target bases, False otherwise.
consumes_reference (bool): True if this operator consumes reference bases, False otherwise.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yfarjoun This uses target intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.

Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively

Screenshot 2024-09-19 at 11 23 12 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% on different PR. Though I'll also argue there for maintaining "target".

Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛

"""

M = (0, "M", True, True) #: Match or Mismatch the reference
Expand Down Expand Up @@ -547,6 +547,48 @@ def length_on_target(self) -> int:
"""Returns the length of the alignment on the target sequence."""
return sum([elem.length_on_target for elem in self.elements])

def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is quite ambiguous given that we're dealing with an alignment between a query and a target. How about one of the following:

get_query_alignment_offsets
get_query_alignment_bounds
get_query_alignment_range

Also: consider returning a range object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a python range ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat idea!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions prefixed with get_ are usually reserved for true property getters.

Can we drop the prefix too?

Suggested change
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:
def alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]:

https://google.github.io/styleguide/pyguide.html#315-getters-and-setters

"""
Get the starting and ending offsets for the alignment based on the CIGAR string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole docstring needs to be out-dented to start at the same column as the """.

Copy link
Contributor Author

@yfarjoun yfarjoun Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this and mkdocs responds: 🤮

<SNIP>
^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
    {% extends "_base/children.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 73, in top-level template code
    {% include class|get_template with context %}
^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/class.html.jinja", line 1, in top-level template code
    {% extends "_base/class.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 105, in top-level template code
    {% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 179, in block 'contents'
    {% block children scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 186, in block 'children'
    {% include "children"|get_template with context %}
^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
    {% extends "_base/children.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 94, in top-level template code
    {% include function|get_template with context %}
^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/function.html.jinja", line 1, in top-level template code
    {% extends "_base/function.html.jinja" %}
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 105, in top-level template code
    {% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'contents'
    {% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'docstring'
    {% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/jinja2/environment.py", line 487, in getattr
    return getattr(obj, attribute)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/functools.py", line 993, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 119, in parsed
    return self.parse()
           ^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 136, in parse
    return parse(self, parser or self.parser, **(options or self.parser_options))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/parsers.py", line 41, in parse
    return parsers[parser](docstring, **options)  # type: ignore[operator]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 809, in parse
    section, offset = reader(docstring, offset=offset + 1, **options)  # type: ignore[operator]
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 474, in _read_returns_section
    annotation = annotation.slice.elements[index]
                 ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Get the starting and ending offsets for the alignment based on the CIGAR string.
Get the 0-based, end-exclusive positions of the first and last aligned base in the query.


Args:
reverse: If True, count from the end of the read sequence.
Otherwise (default behavior), count from the beginning of the read sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear to me - does it just count from the end? Or does it also swap the order of "start" vs. "end"? I think you're saying that:

cigar.get_alignment_offsets(reverse=True) == cigar.reverse.get_alignment_offsets()

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.


Returns:
A tuple (start, end), defining the start and end offsets of the aligned part
of the read. These offsets are 0-based and open-ended, with respect to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of the read. These offsets are 0-based and open-ended, with respect to the
of the query. These offsets are 0-based and open-ended, with respect to the

beginning of the read sequence. (If 'reverse' is True, the offsets are with
respect to the end of the read sequence.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
respect to the end of the read sequence.)
respect to the end of the query sequence.)

If the Cigar contains no alignment operators that consume sequence bases, or
only clipping operators, the start and end offsets will be the same value
(indicating an empty region). This shared value will be the offset to the first
base consumed by a non-clipping operator or the length of the read sequence if
there is no such base.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that either a) this function should return Optional[...] and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue in favor of Optional vs raising an error, but would be comfortable either way


#
""" # TODO: figure out how to remove the '#' from the documentation above without
# breaking the build
start_offset: int = 0
end_offset: int = 0
element: CigarElement
alignment_began = False
elements = self.elements if not reverse else reversed(self.elements)
for element in elements:
if element.operator.is_clipping and not alignment_began:
# We are in the clipping operators preceding the alignment
start_offset += element.length_on_query
end_offset += element.length_on_query
elif not element.operator.is_clipping:
# We are within the alignment
alignment_began = True
end_offset += element.length_on_query
Comment on lines +618 to +621
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ... I think you have to consider edge-case cigars here. What does this function report for:

  • 76D
  • 76I
  • 10P76S
  • 50S1000N50S

else:
# We have exited the alignment and are in the clipping operators after the alignment
break

return start_offset, end_offset

def __getitem__(
self, index: Union[int, slice]
) -> Union[CigarElement, Tuple[CigarElement, ...]]:
Expand Down
48 changes: 48 additions & 0 deletions tests/fgpyo/sam/test_cigar.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,51 @@ def test_bad_index_raises_index_error(index: int) -> None:
def test_bad_index_raises_type_error(index: int) -> None:
with pytest.raises(TypeError):
cigar[index]


@pytest.mark.parametrize(
("cigar_string", "start", "end"),
{
("10M", 0, 10),
("10M10I", 0, 20),
("10X10I", 0, 20),
("10X10D", 0, 10),
("10=10D", 0, 10),
("10S10M", 10, 20),
("10H10M", 0, 10),
("10H10S10M", 10, 20),
("10H10S10M5S", 10, 20),
("10H10S10M5S10H", 10, 20),
msto marked this conversation as resolved.
Show resolved Hide resolved
("10H", 0, 0),
("10S", 10, 10),
("10S10H", 10, 10),
("5H10S10H", 10, 10),
msto marked this conversation as resolved.
Show resolved Hide resolved
},
)
def test_get_alignments(cigar_string: str, start: int, end: int) -> None:
cig = Cigar.from_cigarstring(cigar_string)
assert Cigar.get_alignment_offsets(cig, reverse=False) == (start, end)


@pytest.mark.parametrize(
("cigar_string", "start", "end"),
{
("10M", 0, 10),
("10M10I", 0, 20),
("10X10I", 0, 20),
("10X10D", 0, 10),
("10=10D", 0, 10),
("10S10M", 0, 10),
("10H10M", 0, 10),
("10H10S10M", 0, 10),
("10H10S10M5S", 5, 15),
("10H10S10M5S10H", 5, 15),
msto marked this conversation as resolved.
Show resolved Hide resolved
("10H", 0, 0),
("10S", 10, 10),
("10S10H", 10, 10),
("5H10S10H", 10, 10),
},
)
def test_get_alignments_reversed(cigar_string: str, start: int, end: int) -> None:
cig = Cigar.from_cigarstring(cigar_string)
assert Cigar.get_alignment_offsets(cig, reverse=True) == (start, end)
Loading