Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-order terms in division to avoid periodic decimal and rounding issues in Centroid#add, fixing out-of-order centroids #209

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kinow
Copy link
Contributor

@kinow kinow commented May 8, 2023

Closes #169

Hi @tdunning 👋

I thought this issue was interesting and would allow me to understand more about the t-digest. Tweaking the values used in the tests, I found that changing the compression to 5 (in the factory() method) and using this would always reproduce the bug.

    public void testSorted2() {
        final TDigest digest = factory().create();
        ((AVLTreeDigest)digest).gen.setSeed(84);
        int[] weights = new int[] { 7 };
        double[] xs = new double[] {
                0.8046947099707735 };
        assertEquals(weights.length, xs.length);
        for (int i = 0; i < weights.length; ++i) {
            int w = weights[i];
            double x = xs[i];
            for (int j = 0; j < w; j++) {
                digest.add(x);
            }
        }
        Centroid previous = null;
        for (Centroid centroid : digest.centroids()) {
            if (previous != null) {
                assertTrue(previous.mean() <= centroid.mean());
            }
            previous = centroid;
        }
        System.out.println("OK!");
    }

Debugging a little more, I realized the issue was happening when the count[] array was updated, and the value of a centroid became 3 (search the line with count = 3, note how the value is out of order).

Screenshot from 2023-05-08 14-28-17

A little more debugging and I think I found what was happening. Apparently the Centroid#add method could end up having to calculate this centroid += w * (x - centroid) / count;, which would not always return the expected value (see the third/sixth/eleventh lines)

1 * (0.8046947099707735 - 0.0) / 1 = 0.8046947099707735
2 * (0.8046947099707735 - 0.0) / 2 = 0.8046947099707735
3 * (0.8046947099707735 - 0.0) / 3 = 0.8046947099707736
4 * (0.8046947099707735 - 0.0) / 4 = 0.8046947099707735
5 * (0.8046947099707735 - 0.0) / 5 = 0.8046947099707735
6 * (0.8046947099707735 - 0.0) / 6 = 0.8046947099707736
7 * (0.8046947099707735 - 0.0) / 7 = 0.8046947099707735
8 * (0.8046947099707735 - 0.0) / 8 = 0.8046947099707735
9 * (0.8046947099707735 - 0.0) / 9 = 0.8046947099707735
10 * (0.8046947099707735 - 0.0) / 10 = 0.8046947099707735
11 * (0.8046947099707735 - 0.0) / 11 = 0.8046947099707734

I believe it's due to multiplying by 3 to then divide by 3 not being computed in a stable way. I shuffled the terms in the expression, hopefully without changing what it was doing, and I think at least that error should now be fixed? I also tried to add a simple test to reproduce it.

Fixing this should produce centroids with the correct values and order, thus fixing #169

Cheers

Bruno

…al and rounding issues in Centroid#add, fixing out-of-order centroids
@kinow kinow changed the title Close #169 Re-order terms in division to avoid periodic decimal and rounding issues in Centroid#add, fixing out-of-order centroids Re-order terms in division to avoid periodic decimal and rounding issues in Centroid#add, fixing out-of-order centroids May 8, 2023
@kinow
Copy link
Contributor Author

kinow commented May 8, 2023

Not sure if the Windows errors are legit or just an issue with Windows GH actions.

@tdunning
Copy link
Owner

tdunning commented May 8, 2023

Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example. The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended. The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root.

In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit.

@kinow
Copy link
Contributor Author

kinow commented May 9, 2023

Your change might help, but there is a fundamental problem in AVLTreeDigest that order is not absolutely guaranteed due to an entire class of problems of which the one that you point out is only an example.

👍

The problem is far worse when you have several centroids with identical means. In such a case, any change to the mean should result in potentially large changes to the order in the tree, but the code doesn't handle this well. This (and my inability to find enough time to dig in on the problem) is one of the two major reasons that MergingDigest is strongly recommended.

Good to know!!

The other reason is that AVLTreeDigest hasn't had the serious attention to detail regarding interpolation special cases that MergingDigest has had. Again, scarcity of time is at the root

In any case, AVLTreeDigest requires dynamic allocation and there is no chance of a zero serialization reader or writer. These are secondary issues, but still weigh in a bit.

I believe that could explain why the other implementations I've seen so far appear to be significantly simpler than both AVL & Merge digests here. Looks to me like FaceBook's C++ impl, Arrow, CamDavidsonPilon/tdigest, the other one for Postgres, etc., have adapted and also simplified the code.

But any way, happy to help with the AVL if there are easy issues, otherwise I will probably start playing with the merge t-digest now 🙂

Let me know if you'd like me to close this PR, or leave it open (maybe as draft?) for some future work.

Thanks @tdunning !

@tdunning
Copy link
Owner

tdunning commented May 10, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AVLTreeDigest can have centroids out of order
2 participants