Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unweighted_rf_distance (symmetric_distance) function and tests #2643

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Billyzhang1229
Copy link

Add Unweighted Robinson–Foulds Distance (symmetric difference of two trees) function with tests

See also:

@codecov
Copy link

codecov bot commented Nov 21, 2022

Codecov Report

Merging #2643 (16b0767) into main (fd72573) will decrease coverage by 9.02%.
The diff coverage is 12.50%.

❗ Current head 16b0767 differs from pull request most recent head 9894a2f. Consider uploading reports for the commit 9894a2f to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2643      +/-   ##
==========================================
- Coverage   89.83%   80.82%   -9.02%     
==========================================
  Files          30       29       -1     
  Lines       28613    27798     -815     
  Branches     5587     1316    -4271     
==========================================
- Hits        25705    22467    -3238     
- Misses       1654     5235    +3581     
+ Partials     1254       96    -1158     
Flag Coverage Δ
c-tests 92.26% <ø> (+6.01%) ⬆️
lwt-tests 89.05% <ø> (+8.91%) ⬆️
python-c-tests 71.11% <12.50%> (+3.93%) ⬆️
python-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 45.91% <12.50%> (-52.85%) ⬇️

... and 24 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd72573...9894a2f. Read the comment docs.

@benjeffery
Copy link
Member

benjeffery commented Jan 23, 2023

@Billyzhang1229 The docs error here is fixed in main you just need to rebase. I can do that if needed - just ask.

@Billyzhang1229
Copy link
Author

@Billyzhang1229 The docs error here is fixed in main you just need to rebase. I can do that if needed - just ask.

@benjeffery Thank you so much! I think I could do the rebase part. But I have question about the coverage report, the report shows some decrease for the files that I didn't change, do I need to worry about it?

@benjeffery
Copy link
Member

@Billyzhang1229 The docs error here is fixed in main you just need to rebase. I can do that if needed - just ask.

@benjeffery Thank you so much! I think I could do the rebase part. But I have question about the coverage report, the report shows some decrease for the files that I didn't change, do I need to worry about it?

codecov has been pretty unreliable in reporting numbers. The comments on the PR are usually good. I'll double check this before merge, so don't worry about it.

@Billyzhang1229 Billyzhang1229 marked this pull request as ready for review January 23, 2023 22:53
@jeromekelleher
Copy link
Member

This looks great @Billyzhang1229, thanks!

I think the main thing we need here is a bit more testing to reassure ourselves that what we're computing is the same as what'd get using existing methods. How about a small simulation based test like:

@pytest.mark.parametrize("n", [2, 3, 5, 10, 20])
def test_rf_distance_against_dendropy(n):
      trees = []
      for seed in [42, 43]:
           ts = msprime.sim_ancestry(n, ploidy=1, random_seed=seed)
           trees.append(ts.first())
       rf1 = trees[0].rf_distance(trees[1])
       rf2 = dendropy_rf_distance(*trees)
       assert rf1 == rf2

Other than that, we just need to think about the naming and be a bit more precise about the documentation - but I think myself and @hyanwong can help with that. @hyanwong - thoughts about function naming here?

@Billyzhang1229
Copy link
Author

@jeromekelleher Thanks for the feedback! I agree that additional testing would be beneficial to ensure the accuracy of our computations.

Naming

I am currently using symmetric_distance() and unweighted_rf_distance() for function names, the proper name should be unweighted_rf_distance. The term symmetric_distance is a made-up term used to represent the symmetric difference between trees.

I'm not sure which one is better for naming. So I just wrapped the symmetric_distance() with unweighted_rf_distance().

def unweighted_rf_distance(self, other):
    return self.symmetric_distance(other)

In the future, we may also want to consider adding a weighted_rf_distance() function or simply rf_distance()?


If there is anything else you would like me to add or change, please let me know.

@jeromekelleher
Copy link
Member

I think rf_distance is a good because it aligns with our existing Tree.kc_distance method. We can document that it's the unweighted distance, and if we like later can allow for the weighted version via a weighted argument.

I don't think symmetric_distance as a synonym helps much and I would drop it, as people are familiar with RF distance.

We will need to be quite precise about what we're computing here in the documentation, but I can do that in a follow-on commit.

@hyanwong
Copy link
Member

I agree with Jerome. rf_distance allows us to implement rf_distance(weighted=True) later. Everyone calls it the RF distance. We can mention the term "symmetric_distance" in the docstring to make it searchable.

@hyanwong
Copy link
Member

In fact, if we want to, we could allow the weighted parameter and simply raise a NotImplementedError if it's True, simply to give us a reminder to implement it every time we use it! But that's probably an abuse of the API :)

@jeromekelleher
Copy link
Member

In fact, if we want to, we could allow the weighted parameter and simply raise a NotImplementedError if it's True, simply to give us a reminder to implement it every time we use it! But that's probably an abuse of the API :)

I don't see much point in this - it's forward compatible by not having the argument and we don't need it for this PR.

Add more tests using other existing package
@hyanwong
Copy link
Member

Sorry that we haven't got around to merging this yet @Billyzhang1229. It would be great to get this in, and I see you've added extra tests. Do you know how this performs in the case of trees with polytomies? There's an interesting publication at https://onlinelibrary.wiley.com/doi/full/10.1111/cla.12540 that sounds like it might give some useful pointers in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants