Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend path algorithm (extend_edgesV2) #2938

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

hfr1tz3
Copy link
Contributor

@hfr1tz3 hfr1tz3 commented Apr 25, 2024

New extension (pun intended) to the extend_edges algorithm now called extend_paths (subject to change). We noticed that certain examples within unary regions would not be extended simply with extend edges. We propose this algorithm to handle examples like the following:
For edge path $a \to b \to c$ in tree $T_1$ and edge $a \to d \to c$ in $T_2$, assuming $d\notin T_1$ and $b\notin T_2$ and $t(b)>t(d)$ we should have extended path $a\to b \to d \to c$ in tree $T_2$.
Unit tests still need to be built, and the algorithm needs to be cleaned.

We hypothesize that using this alongside extend edges could be the optimal strategy for minimizing edges in tree sequences.

Copy link

codecov bot commented Apr 25, 2024

Codecov Report

Attention: Patch coverage is 89.85507% with 35 lines in your changes missing coverage. Please review.

Project coverage is 89.72%. Comparing base (4170933) to head (659218e).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
c/tskit/trees.c 89.73% 20 Missing and 15 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2938      +/-   ##
==========================================
+ Coverage   89.71%   89.72%   +0.01%     
==========================================
  Files          29       29              
  Lines       31355    31567     +212     
  Branches     6077     6113      +36     
==========================================
+ Hits        28130    28324     +194     
- Misses       1840     1852      +12     
- Partials     1385     1391       +6     
Flag Coverage Δ
c-tests 86.55% <89.73%> (+0.06%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 89.01% <100.00%> (+0.01%) ⬆️
python-tests 99.01% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/_tskitmodule.c 89.01% <100.00%> (+0.01%) ⬆️
python/tskit/trees.py 98.78% <100.00%> (ø)
c/tskit/trees.c 90.45% <89.73%> (-0.02%) ⬇️

... and 1 file with indirect coverage changes

@petrelharp
Copy link
Contributor

petrelharp commented May 15, 2024

Proposal for the basis for a test: here is a definition of what should be extendable by this method; so we can check if there remain no extendable edges (when run until no further changes are made).

A right extendable path from $x$ to $y$ in a tree $T_k$ is a path $x &lt;_k a_1 &lt;_k \cdots &lt;_k a_n &lt;_k y$ (where $&lt;_k$ denotes "below in tree $T_k$" that has the properties:

  1. $x <{k+1} b_1 <{k+1} \cdots <{k+1} b_m <{k+1} y$ for some $b_i$ (the endpoints of the path are in the next tree)
  2. for $1 &lt; j &lt; n$, $a_j$ is not in $T_{k+1}$, and for $1 &lt; j &lt; m$, $b_j$ is not in $T_{k}$ (none of the intermediate nodes from the two paths are in the other tree).

The nodes $a_i$ are right extendable, and the nodes $b_i$ are left extendable.

Why is this the right definition? Well, first it's clear we can't extend nodes that are in the next tree. Next, suppose that there is a $b_j$ on the path from $x$ to $y$ in $T_{k+1}$ that is also in $T_k$, but not on the path from $x$ to $y$. If $b_j &lt;_k x$ then we can take $b_j$ as the other endpoint of the extendable path instead of $y$. If $b_j \nless_k x$, then that implies ancestry above $b_j$ was inherited along a different path back to the time of $x$, so we can't use inheritance from $x$ as a criteria for extending. A similar argument holds if $y &lt;_k b_j$ or and $y \nless_k b_j$.

@petrelharp
Copy link
Contributor

I've added two things:

  1. a simple check based on the previous comment for whether a ts is no longer extendable; so if our algorithm terminates before max_iter, then that test should agree; and
  2. an implementation of extend paths that just rebuilds the whole tree sequence after every tree, so is a lot simpler. (I called it "naive", but it's not actually simple enough to be 'naive'; I had to deal with quite a few fiddly things for it.) It's real slow, so can't be used on anything at all big.

Next step: compare it to the non-naive version.

@petrelharp
Copy link
Contributor

This is hopefully ready for you to look at, @nspope!

One question is that currently max_iter is required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)?

@hfr1tz3
Copy link
Contributor Author

hfr1tz3 commented Sep 5, 2024

One question is that currently max_iter is required; but, should we provide a default? Should we do a warning if it hasn't terminated (probably)?
I think that makes sense. I haven't checked if we each reach the max_iter value for the long sequences we look at, but I feel like it terminates before that value in most cases.

@petrelharp
Copy link
Contributor

Okay - now it's ready for reals, @nspope. The 'partial coverage' things that codecov flags are cases where we're like

if (A && B) { ... }

but then not all four combinations of A and B ever get seen. This let me identify one place that could be simplified, but I've looked at the others and I don't think any of the others can. They are getting hit by the python tests, and constructing C tests for all of them would be a Big Pain.

Copy link
Contributor

@nspope nspope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, caught a few places where return codes should be checked

c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Outdated Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
@petrelharp petrelharp marked this pull request as ready for review September 13, 2024 20:10
@petrelharp
Copy link
Contributor

Okay - @jeromekelleher, I think this is ready for a quick look (and then we can merge it! 🎉)

@petrelharp
Copy link
Contributor

Wait two more things:

  • provide some kind of feedback for max_iter
  • one remaining uncovered code bit

c/tskit/trees.c Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

Okay - now I've legit hit everything we can with tests (and simplified some things along the way). @jeromekelleher - could you just throw a quick eyeball at this?

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very nice, I like the new design. I spotted a few small problems and a couple of suggestions on layout.

If you're looking for performance, I think there's a modest gain on the table from taking pointers to stuff within a function rather than lots of self->x->y within loops. I wouldn't bother for now though, except in places where it'll also improve readability.

c/tskit/trees.c Outdated

tsk_memset(&edge_list_heap, 0, sizeof(edge_list_heap));
tsk_memset(&tree_pos, 0, sizeof(tree_pos));
double *sites_position = tables->sites.position;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be const?

c/tskit/trees.c Outdated
self->edges_out_head = NULL;
self->edges_out_tail = NULL;

tsk_memset(&self->edge_list_heap, 0, sizeof(self->edge_list_heap));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to memset the edge_list_heap to 0 here as its already been done above when you zero'd the top level struct.

c/tskit/trees.c Outdated
return ret;
}

// static void
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to make the print_state a part of the public interface so that it can be tested and it stays up to date. I guess the struct here is not exposed in the header so there's no way to access it. We worked around this for simplify by adding a TSK_DEBUG option here

} else {
self->near_side = self->edges->right;
self->far_side = self->edges->left;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe allocate the new edge list node here and pass to edge_list_append_entry rather than passing around the edge_list_heap? It's not much extra code, and will save you some a few unreachable lines. It seemed odd to me on reading to pass the allocator in rather than the actual pointer.

c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Outdated

p_out = self->parent_out[c];
p_in = self->parent_in[c];
t_out = self->ts->tables->nodes.time[p_out];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - you don't want to be chasing 3 levels of pointers here in a loop. Grab a const pointer at the start of the function and the compiler will be able to do a way better job.

c/tskit/trees.c Outdated
edge_list_t *ex_in;
tsk_id_t e_in, c, e;
tsk_size_t num_edges;
tsk_bool_t *keep;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to initialise keep to NULL

@petrelharp
Copy link
Contributor

petrelharp commented Sep 18, 2024

Thanks for the pointer pointers! I think this is ready to merge (assuming tests pass).
Screenshot from 2024-09-17 22-08-27

@jeromekelleher
Copy link
Member

Excellent! I think you just need to cast to (int) for the printfs, give the commits a bit of a squash and it's ready to go in.

@petrelharp
Copy link
Contributor

Argh, yes - I just missed one cast. Ready now, hopefully!

renamed to extend_haplotypes
@petrelharp petrelharp added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Sep 18, 2024
@jeromekelleher jeromekelleher merged commit dcee409 into tskit-dev:main Sep 18, 2024
19 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants