-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add cigar utility method get_alignment_offsets #188
base: main
Are you sure you want to change the base?
Conversation
yfarjoun
commented
Sep 19, 2024
- add cigar utility method get_alignment_offsets to find the offsets of the aligned segment in the read
- tests included
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, once the ruff/mypy CI stuff passes
fgpyo/sam/__init__.py
Outdated
Get the starting and ending offset for the alignment based on the CIGAR string. | ||
|
||
Arguments: | ||
reverse: bool Whether to count from the end of the read sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor nit: reverse (bool): Whether...
for consistency with other docstrings
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #188 +/- ##
==========================================
+ Coverage 89.35% 89.41% +0.06%
==========================================
Files 18 18
Lines 2094 2126 +32
Branches 464 471 +7
==========================================
+ Hits 1871 1901 +30
- Misses 146 148 +2
Partials 77 77 ☔ View full report in Codecov by Sentry. |
not sure what's going on the mkdocs. I had to comment out some lines to get it to compile.... if someone knows how to fix this please let me know... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a good addition to the package, thanks for adding it!
I think the documentation could be a bit clearer, and it looks like there's a bug when you have adjacent hard/soft clips
fgpyo/sam/__init__.py
Outdated
0-based, open-ended offsets (to the read sequence.) | ||
If 'reverse' is True, the offsets count from the end of the read sequence. | ||
|
||
# TODO: FIGURE OUT WHY the following three lines causes mkdocs to fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FIXME please fix and remove the TODO before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it like this because I was unable to resolve...when I uncomment mkdocs barfs....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 474, in _read_returns_section
annotation = annotation.slice.elements[index]
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
…of the aligned segment in the read - tests included
Co-authored-by: Matt Stone <[email protected]>
Co-authored-by: Matt Stone <[email protected]>
36be5bb
to
f371b6c
Compare
Thanks for the detailed review @msto , much appreciated! mkdocs is giving me grief and I was unable to make it play nice.... if someone knows why it barfs when I change the indentation or remove the final "comment" line... please let me know. this is the best (passing) state I could put it into.... |
fgpyo/sam/__init__.py
Outdated
@@ -368,7 +368,7 @@ class CigarOp(enum.Enum): | |||
code (int): The `~pysam` cigar operator code. | |||
character (int): The single character cigar operator. | |||
consumes_query (bool): True if this operator consumes query bases, False otherwise. | |||
consumes_target (bool): True if this operator consumes target bases, False otherwise. | |||
consumes_reference (bool): True if this operator consumes reference bases, False otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unrelated fix. if deemed imporatant I could extract it to a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun This uses target
intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne While not all alignments are between a read and a reference, the CIGAR specification uses the terms "query" and "reference". I agree with Yossi, I find it clearer to have our documentation consistent with spec.
Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
alignment to step along the query sequence and the reference sequence respectively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I was correcting a broken link.... if this is intentional, I'll bring it back and let's haved the discussion in a different PR (if at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% on different PR. Though I'll also argue there for maintaining "target".
Little known fact is that CIGAR actually predates SAM/BAM and Google believes it was first documented as part of exonerate which uses "query" and "target" 😛
fgpyo/sam/__init__.py
Outdated
@@ -368,7 +368,7 @@ class CigarOp(enum.Enum): | |||
code (int): The `~pysam` cigar operator code. | |||
character (int): The single character cigar operator. | |||
consumes_query (bool): True if this operator consumes query bases, False otherwise. | |||
consumes_target (bool): True if this operator consumes target bases, False otherwise. | |||
consumes_reference (bool): True if this operator consumes reference bases, False otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yfarjoun This uses target
intentionally - not all alignments are between a read and a reference - what if you align two reads vs. each other, or two contigs vs. each other?
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,48 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]: | |||
""" | |||
Get the starting and ending offsets for the alignment based on the CIGAR string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole docstring needs to be out-dented to start at the same column as the """
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this and mkdocs responds: 🤮
<SNIP>
^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
{% extends "_base/children.html.jinja" %}
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 73, in top-level template code
{% include class|get_template with context %}
^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/class.html.jinja", line 1, in top-level template code
{% extends "_base/class.html.jinja" %}
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 105, in top-level template code
{% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 179, in block 'contents'
{% block children scoped %}
^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/class.html.jinja", line 186, in block 'children'
{% include "children"|get_template with context %}
^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/children.html.jinja", line 1, in top-level template code
{% extends "_base/children.html.jinja" %}
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/children.html.jinja", line 94, in top-level template code
{% include function|get_template with context %}
^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/function.html.jinja", line 1, in top-level template code
{% extends "_base/function.html.jinja" %}
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 105, in top-level template code
{% block contents scoped %}
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'contents'
{% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/mkdocstrings_handlers/python/templates/material/_base/function.html.jinja", line 112, in block 'docstring'
{% block docstring scoped %}
^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/jinja2/environment.py", line 487, in getattr
return getattr(obj, attribute)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/functools.py", line 993, in __get__
val = self.func(instance)
^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 119, in parsed
return self.parse()
^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/dataclasses.py", line 136, in parse
return parse(self, parser or self.parser, **(options or self.parser_options))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/parsers.py", line 41, in parse
return parsers[parser](docstring, **options) # type: ignore[operator]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 809, in parse
section, offset = reader(docstring, offset=offset + 1, **options) # type: ignore[operator]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/yossifarjoun/micromamba/envs/fgpyo/lib/python3.12/site-packages/griffe/docstrings/google.py", line 474, in _read_returns_section
annotation = annotation.slice.elements[index]
~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,48 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is quite ambiguous given that we're dealing with an alignment between a query and a target. How about one of the following:
get_query_alignment_offsets
get_query_alignment_bounds
get_query_alignment_range
Also: consider returning a range
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean a python range ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions prefixed with get_
are usually reserved for true property getters.
Can we drop the prefix too?
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]: | |
def alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]: |
https://google.github.io/styleguide/pyguide.html#315-getters-and-setters
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,48 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def get_alignment_offsets(self, reverse: bool = False) -> Tuple[int, int]: | |||
""" | |||
Get the starting and ending offsets for the alignment based on the CIGAR string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get the starting and ending offsets for the alignment based on the CIGAR string. | |
Get the 0-based, end-exclusive positions of the first and last aligned base in the query. |
fgpyo/sam/__init__.py
Outdated
reverse: If True, count from the end of the read sequence. | ||
Otherwise (default behavior), count from the beginning of the read sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear to me - does it just count from the end? Or does it also swap the order of "start" vs. "end"? I think you're saying that:
cigar.get_alignment_offsets(reverse=True) == cigar.reverse.get_alignment_offsets()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.
fgpyo/sam/__init__.py
Outdated
A tuple (start, end), defining the start and end offsets of the aligned part | ||
of the read. These offsets are 0-based and open-ended, with respect to the | ||
beginning of the read sequence. (If 'reverse' is True, the offsets are with | ||
respect to the end of the read sequence.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
respect to the end of the read sequence.) | |
respect to the end of the query sequence.) |
fgpyo/sam/__init__.py
Outdated
If the Cigar contains no alignment operators that consume sequence bases, or | ||
only clipping operators, the start and end offsets will be the same value | ||
(indicating an empty region). This shared value will be the offset to the first | ||
base consumed by a non-clipping operator or the length of the read sequence if | ||
there is no such base. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that either a) this function should return Optional[...]
and return None when invoked on a cigar without alignment operations, or raise an error in that situation. There is no aligned start/end if there are no alignment operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue in favor of Optional
vs raising an error, but would be comfortable either way
elif not element.operator.is_clipping: | ||
# We are within the alignment | ||
alignment_began = True | ||
end_offset += element.length_on_query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So ... I think you have to consider edge-case cigars here. What does this function report for:
76D
76I
10P76S
50S1000N50S
6ab8245
to
b07c3e4
Compare
fgpyo/sam/__init__.py
Outdated
@@ -547,6 +547,47 @@ def length_on_target(self) -> int: | |||
"""Returns the length of the alignment on the target sequence.""" | |||
return sum([elem.length_on_target for elem in self.elements]) | |||
|
|||
def query_alignment_offsets(self, reverse: bool = False) -> Optional[range]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne I disagree with the suggestion to use a range
object.
- This is an unconventional way to use
range
- The stated purpose of the function is to return the start and end positions of the alignment within the query sequence. I assume that in most contexts where this function will be used, the start and end positions will be used to slice the query sequence.
range
cannot be used to slice a string, nor does it support tuple unpacking.
I think it would be more ergonomic to return a tuple[int, int]
, though I wouldn't mind seeing a use case where range
provides advantages. (Nor would I be opposed to using a NamedTuple
with start
and end
attributes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opinionated here, but am noticing that range does have start-stop access, so in that regard it will be equivalent to the named NamedTuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(almost stop!=end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you can't unpack the range
i.e.
# this is valid
from typing import NamedTuple
class Offsets(NamedTuple):
start: int
end: int
offsets = Offsets(start=1, end=200)
start, end = offsets
# this is invalid
offsets = range(1, 200)
start, end = offsets
So when calling the method, you can either accept the tuple or unpack it immediately, depending on which makes more sense in your context
e.g. I imagine most use cases will look like the following
start, end = cigar.alignment_offsets()
aligned_sequence = query[start:end]
Args: | ||
reverse: If True, count from the end of the query. i.e. find the offsets | ||
using the reversed elements of the cigar. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -360,6 +361,46 @@ class _CigarOpUtil: | |||
} | |||
|
|||
|
|||
@dataclass(frozen=True) | |||
class RangeOfBases: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clever but I still think returning a tuple[int, int]
is simpler and easier for the user