Add ICDF for the Kumaraswamy distribution #6642

gokuld · 2023-03-31T20:44:54Z

What is this PR about?
Adding the ICDF function for the Kumaraswamy distribution.
Issue: #6612
Creation of tests for the added ICDF function.

Reference: https://en.wikipedia.org/wiki/Kumaraswamy_distribution
SciPy does not have an implementation for the Kumaraswamy distribution so this ICDF was coded explicitly.

Checklist

Explain important implementation details 👆
Make sure that the pre-commit linting/style checks pass.
Link relevant issues (preferably in nice commit messages)
Are the changes covered by tests and docstrings?
Fill out the short summary sections 👇

Major / Breaking Changes

...

New features

ICDF function for the Kumaraswamy distribution.

Bugfixes

...

Documentation

...

Maintenance

Extended test_continuous.py with corresponding tests.

📚 Documentation preview 📚: https://pymc--6642.org.readthedocs.build/en/6642/

codecov · 2023-03-31T20:56:52Z

Codecov Report

Merging #6642 (a82fe4d) into main (f3df36b) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head a82fe4d differs from pull request most recent head 729c289. Consider uploading reports for the commit 729c289 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6642      +/-   ##
==========================================
+ Coverage   91.99%   92.01%   +0.02%     
==========================================
  Files          95       95              
  Lines       16205    16246      +41     
==========================================
+ Hits        14907    14948      +41     
  Misses       1298     1298

Impacted Files	Coverage Δ
pymc/distributions/continuous.py	`97.74% <100.00%> (+<0.01%)`	⬆️
pymc/testing.py	`92.64% <100.00%> (+0.61%)`	⬆️

ricardoV94 · 2023-04-01T08:15:45Z

tests/distributions/test_continuous.py

+        check_icdf(
+            pm.Kumaraswamy,
+            {"a": Rplus, "b": Rplus},
+            lambda q, a, b: (1 - (1 - q) ** (1 / b)) ** (1 / a),


Doesn't scipy have a ppf function? The point of this test is more to compare with an external reference to be sure we implemented it correctly.

I see you mention it doesn't. In that case an alternative would be to check that for some values exp(logcdf(dist, icdf(dist, value))) == value

Or something along those lines, basically testing that the inverse property holds.

You could add that as a util test function in testing.py.

We do something similar for the consistency between discrete logp and logcdf helper.

It's a bit more work but it will be very helpful for testing other distributions whose ppf is missing in scipy.

No, SciPy doesn't have a ppf function for the Kumaraswamy distribution.

Sure, I will look into adding the util test function for testing when scipy doesn't have an implementation. I will get back on this if I need help, or if not, with the next commit.

@ricardoV94 I added a util function for testing the inverse of the icdf matches exp logcdf;
However I found it is failing tests for small values of distribution parameter $b$ due to numerical issues.

Example test case where this is failing:

$a=1$, $b=0.01$, $value=0.4$
The icdf expression is:
$(1 - (1 - value) ^ \frac{1}{b}) ^ \frac{1}{a} $

For the above values for $a$ and $b$, the inner term:
$t = (1 - value) ^ \frac{1}{b}) = 0.6^{100} = 6.533186235000685e-23$

For such a small value of $t$, $1 - t$ is now represented as 1, due to numerical issues with adding 1 to such a small number. Thus the result of the ICDF expression, $(1 - t)^\frac{1}{a}$ is also 1.

This fails the self consistency test (between icdf and cdf) for small values of $b$.

Do you have any idea to make this numerically stable? Working with the logarithms of numbers doesn't make sense as there is addition happening in the ICDF expression which also causes the numerical stability issue.

The other alternative is to simply skip the tests for small values of $b$. I would like to avoid that if another way is possible.

Edit: I am now thinking on the lines of using scipy.special.powm1 or scipy.special.log1p.

How are you testing? We don't want to test equality directly, but using something like np.testing.assert_almost_equal or something that allows for some wiggle room.

It may make sense to compare que log of the icdf value if that's more stable.

But yes, we don't need to test values very close to 0 or 1. It would be good to check if the distributions that already have icdf pass as well.

I am already using np.testing.assert_almost_equal. This still doesn't work because the probability returned by the cdf of the value returned by the icdf, is $1$, instead of a value 0.4, because of the numerical issue.

It may make sense to compare que log of the icdf value if that's more stable.

log of the icdf does make sense, but we have to implement a function like logicdf everywhere. log(icdf) will not work as icdf already returns the less precise value, so it has to be implemented directly as logicdf.

But yes, we don't need to test values very close to 0 or 1.

Okay, in that case I will change the domain of testing to exclude small values of $b$.

It would be good to check if the distributions that already have icdf pass as well.

The other continuous distributions (uniform and normal) with icdf are passing the icdf self consistency test, even with small values of the distribution parameters.
For the discrete case, the test passed for the discrete uniform distribution, but failed for the geometric distribution for small values of the distribution parameter ($0.001$).

We don't need to implement log-icdf, that's just the log of icdf which we already have.

My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value), if that makes sense. Instead of exponentiating the logcdf

My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value)

No that doesn't work, I already tried it before I narrowed down to the root cause.

The numerical error starts at the output of icdf itself for very small parameters for distributions involving power terms with this parameter. This the log outside does not help.

I will proceed now by excluding very small values of the distribution parameters for the self consistency tests for icdf, if that sounds good.

Well actually I see that for some of the domains, domains excluding small parameters is already defined.
For example, using Rplusbig instead of Rplus for the a and b distribution parameters for the Kumaraswamy distribution already passed the test.

I could simply use these domains for the tests. Does this sound good?

Oh while this inverse icdf -cdf self test works for continuous distributions, it doesn't work in the same way for discrete distributions.
This is because there is a band of probability for which the icdf has the same value and this this need not match the probability from the cdf.

I will now implement another util test function for discrete distributions which checks that value = icdf(dist, exp(logcdf(dist, value))) for a sample of integer values in the domain of the discrete distribution.

ricardoV94

Shaping up really well!

Some suggestions below

ricardoV94 · 2023-04-02T05:59:27Z

pymc/distributions/continuous.py

@@ -1287,6 +1287,16 @@ def logcdf(value, a, b):
            msg="a > 0, b > 0",
        )

+    def icdf(value, a, b):
+        res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a)


Looking at this, would it help to compute this icdf on the log scale (to get rid of these exponents) and only exponentiate the final result?

Suggested change

res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a)

res = exp(log(1 - (1 - value) * pt.reciprocal(b) * pt.reciprocal(a))

Or something along those lines (might have made a mistake)

ricardoV94 · 2023-04-02T06:00:48Z

pymc/testing.py

@@ -212,6 +212,7 @@ def RandomPdMatrix(n):
 Rplusbig = Domain([0, 0.5, 0.9, 0.99, 1, 1.5, 2, 20, np.inf])
 Rminusbig = Domain([-np.inf, -2, -1.5, -1, -0.99, -0.9, -0.5, -0.01, 0])
 Unit = Domain([0, 0.001, 0.1, 0.5, 0.75, 0.99, 1])
+Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1])


The edges aren't included, do to test 0.1 you need to add something before

Suggested change

Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1])

Unitbig = Domain([0, 0.1, 0.5, 0.75, 0.99, 1])

Unitbig isn't used now, so I am discarding this.

It was added to test the self-consistency of ICDF-logcdf for the uniform distribution, but we are not using this as they have scipy ppf functions available.

ricardoV94 · 2023-04-02T06:03:27Z

tests/distributions/test_continuous.py

@@ -284,6 +289,10 @@ def test_normal(self):
            {"mu": R, "sigma": Rplus},
            lambda q, mu, sigma: st.norm.ppf(q, mu, sigma),
        )
+        check_selfconsistency_icdf(


We don't need to test this for scipy distributions that have ppf, I meant just for you to confirm it's working well

Done.
I've removed these tests where scipy ppf functions are available.

ricardoV94

Sorry, I forgot about this one.

Is #6669 a blocker for this PR?

pymc/distributions/continuous.py

ricardoV94 · 2023-04-13T10:48:17Z

pymc/testing.py

+    dist_logcdf_fn = compile_pymc(list(inputvars(dist_logcdf)), dist_logcdf)
+
+    domains = paramdomains.copy()
+    domains["value"] = Domain(np.linspace(0.1, 1, 100))


Probably enough?

Suggested change

domains["value"] = Domain(np.linspace(0.1, 1, 100))

domains["value"] = Domain(np.linspace(0, 1, 10))

gokuld · 2023-04-13T11:17:53Z

Sorry, I forgot about this one.

Is #6669 a blocker for this PR?

It is a blocker for discrete ICDF self consistency checks, whose code I haven't pushed yet.

It is not a blocker for code I have pushed so far in this PR, which includes continuous ICDF self consistency checks.

ricardoV94 · 2023-04-13T11:21:00Z

@gokuld How do you prefer to proceed? Get this done or wait and also get the discrete cases sorted out?

I would suggest we wait, but up to you :)

gokuld · 2023-04-13T11:21:34Z

Let us wait, no problem.

gokuld · 2023-04-29T06:06:55Z

@ricardoV94 I've added self consistency tests for discrete ICDF, now that the blocker #6671 is closed.

Though SciPy PPFs are available for the discrete uniform and discrete geometric distributions and used with check_icdf, I have still added the icdf-logcdf self consistency check for these two distributions as it caught a bug in the implementation of discrete uniform ICDF that was not caught by check_icdf.

I have also fixed the bug and pushed the commit here.

gokuld · 2023-04-29T08:04:28Z

pytest -v tests/distributions/test_discrete.py -k 'test_geometric' passes on my machine. Looking into this.

ricardoV94 · 2023-04-29T08:05:53Z

Try setting n_samples=-1 in check_logp and alike helpers to make sure all parameter combinations are tested locally

gokuld · 2023-04-29T08:20:23Z

Try setting n_samples=-1 in check_logp and alike helpers to make sure all parameter combinations are tested locally

It is passing locally even with n_samples=-1.

The failure upstream is happening in check_selfconsistency_discrete_icdf for some reason, for the discrete geometric distribution for p=0.75. The expected ICDF of CDF is 3 but is receiving 4. On my machine it is 3 all the way and passes the test.

ricardoV94 · 2023-04-29T08:28:04Z

Does this branch already include the fixes from the other PR?

gokuld · 2023-04-29T08:34:01Z

Does this branch already include the fixes from the other PR?

Forgot to sync my forked repo. :/ I have synced it now. Do you think this was the issue?

I can see commit history on this branch now, on Github with the fixes from the other PR.

ricardoV94 · 2023-04-29T09:34:26Z

You might want to rebase from main and then force push so only the new commits show up (if that's the problem). Make sure to checkout a backup branch first in case something goes wrong.

ricardoV94 · 2023-04-29T12:01:48Z

pymc/testing.py

+            computed_value = dist_icdf_fn(
+                **point, value=np.exp(dist_logcdf_fn(**point, value=value))
+            )
+            assert (


np.testing.assert_almost_equal will give you a nice printing when it fails out of the box

gokuld · 2023-04-29T17:26:47Z

@ricardoV94 This passed the tests in test_discrete.py now. Do you think we should proceed with the truncation approach, within check_selfconsistency_discrete_icdf ?

Some thoughts:

At the moment I check that ICDF(CDF(value) == value, where value is an integer for discrete distributions. It is possible to replace this test condition (note the swap) by CDF(ICDF(value) == value, here value being a real number 0 <= value <=1. However in this case, value is a real number and not an integer and has to be compared using np.testing.assert_almost_equal which effectively truncates the value to k places and checks. There seems to simple be no way to totally avoid the truncation during testing.
As an example, for the geometric distribution with p=0.75 and n=3, the CDF computed in a straightforward way without log and exp is $0.75 + (1 - 0.75) (0.75) + (1-0.75) ^2 0.75$ which is 0.984375. Especially given that log and exp are involved in the computation of the CDF via logcdf, it is possible to get a numerical value of the CDF like 0.9843750000000001 as obtained in the tests above. Whlie the ICDF of 0.984375 is 3, the ICDF of 0.9843750000000001 is 4, not 3. This is because the ICDF of q is the smallest integer whose CDF is greater than or equal to q .

This reverts commit b0c93b7. Avoid mypy test failure. This is not needed, we have identified the root cause now.

ricardoV94 · 2023-06-16T07:14:33Z

At the moment I check that ICDF(CDF(value) == value, where value is an integer for discrete distributions. It is possible to replace this test condition (note the swap) by CDF(ICDF(value) == value, here value being a real number 0 <= value <=1. However in this case, value is a real number and not an integer and has to be compared using np.testing.assert_almost_equal which effectively truncates the value to k places and checks. There seems to simple be no way to totally avoid the truncation during testing.

What is the disadvantage of the current approach ICDF(CDF)? If it works, let's keep it?

Does this PR include changes that were already committed elsewhere?

gokuld · 2023-06-16T08:26:24Z

What is the disadvantage of the current approach ICDF(CDF)? If it works, let's keep it?

One issue is that after computing the CDF, if the result has the tiniest numerical error, it can create a huge error in the ICDF (exactly a 1 integer size error). This is because ICDF is integer valued but is very sensitive at the places in the CDF graph where there is a vertical jump. (The CDF function has a staircase like appearance with multiple vertical jumps.) This means, ICDF(CDF) is not enough, but we need to truncate it this way: ICDF(truncate(CDF)).

And while truncating we need to decide the number of decimal places which is another magic number / parameter, makes it less elegant.

This kind of error, was the root cause I identified with the issue where it passed tests locally on my machine while failed on the github actions in the cloud. The difference in machine hardware could change the numerical value of the CDF ever so slightly and cause a huge integer change in the ICDF that is applied on top of it. I verified and solved this root cause, by adding truncation after CDF and before ICDF as discussed above ( ICDF(truncate(CDF))).

Does this PR include changes that were already committed elsewhere?

No these changes are made only in this PR.

ricardoV94 · 2023-06-16T08:39:28Z

Then let's do it the other way around. cdf(icdf(value))? You're saying that will be less discontinuous?

For your other question, its' fine to check with allclose and a specific rtol and atol. That's always the case when comparing floats.

gokuld · 2023-06-17T19:46:48Z

Then let's do it the other way around. cdf(icdf(value))? You're saying that will be less discontinuous?

An issue with this other way round cdf(icdf(value)) is that the CDF of discrete distributions lies in a set of discrete values. These values are dependent on the distribution itself (and it's parameters). When using a generic discrete set of CDF values for testing, cdf(icdf(value)) would 'snap' the value to the set of discrete possible values for the particular distribution, and value may not match cdf(icdf(value)). Because of this, the test would fail though it should pass.

gokuld force-pushed the kumaraswamy_icdf branch from bf69389 to b25c46b Compare April 1, 2023 08:13

ricardoV94 reviewed Apr 1, 2023

View reviewed changes

gokuld force-pushed the kumaraswamy_icdf branch from b25c46b to 9f9215b Compare April 1, 2023 08:25

ricardoV94 reviewed Apr 2, 2023

View reviewed changes

gokuld force-pushed the kumaraswamy_icdf branch from b90090e to bf4fb36 Compare April 3, 2023 19:46

ricardoV94 reviewed Apr 13, 2023

View reviewed changes

gokuld mentioned this pull request Apr 27, 2023

Improve numerical precision of discrete uniform and geometric ICDFs #6671

Merged

5 tasks

gokuld force-pushed the kumaraswamy_icdf branch from bf4fb36 to 22d2ad6 Compare April 29, 2023 06:03

gokuld force-pushed the kumaraswamy_icdf branch from 22d2ad6 to f754a9f Compare April 29, 2023 06:14

gokuld force-pushed the kumaraswamy_icdf branch 2 times, most recently from 244bf1e to 59e0c27 Compare April 29, 2023 09:50

ricardoV94 reviewed Apr 29, 2023

View reviewed changes

gokuld added 5 commits June 1, 2023 14:29

Added ICDF for the Kumaraswamy distribution.

69058cf

Added icdf - logcdf consistency tests.

fe5f30b

Tmp testing with np assert_almost_equal print messages.

005edbe

Testing a fix by truncating cdf to precision.

61f2de5

Revert "Tmp testing with np assert_almost_equal print messages."

729c289

This reverts commit b0c93b7. Avoid mypy test failure. This is not needed, we have identified the root cause now.

gokuld force-pushed the kumaraswamy_icdf branch from a82fe4d to 729c289 Compare June 1, 2023 09:07

ricardoV94 changed the title ~~Added ICDF for the Kumaraswamy distribution.~~ Add ICDF for the Kumaraswamy distribution Jun 16, 2023

ricardoV94 mentioned this pull request Jun 16, 2023

Add icdf functions for distributions #6612

Open

38 tasks

	res = (1 - (1 - value) pt.reciprocal(b)) pt.reciprocal(a)
	res = exp(log(1 - (1 - value) * pt.reciprocal(b) * pt.reciprocal(a))

	Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1])
	Unitbig = Domain([0, 0.1, 0.5, 0.75, 0.99, 1])

	domains["value"] = Domain(np.linspace(0.1, 1, 100))
	domains["value"] = Domain(np.linspace(0, 1, 10))

Add ICDF for the Kumaraswamy distribution #6642

Are you sure you want to change the base?

Add ICDF for the Kumaraswamy distribution #6642

Conversation

gokuld commented Mar 31, 2023 • edited by github-actions bot Loading

Major / Breaking Changes

New features

Bugfixes

Documentation

Maintenance

codecov bot commented Mar 31, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

ricardoV94 Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

gokuld Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

gokuld Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

ricardoV94 Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

gokuld Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokuld Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokuld Apr 1, 2023 • edited Loading

Choose a reason for hiding this comment

ricardoV94 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ricardoV94 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokuld commented Apr 13, 2023 • edited Loading

ricardoV94 commented Apr 13, 2023

gokuld commented Apr 13, 2023 • edited Loading

gokuld commented Apr 29, 2023

gokuld commented Apr 29, 2023 • edited Loading

ricardoV94 commented Apr 29, 2023

gokuld commented Apr 29, 2023 • edited Loading

ricardoV94 commented Apr 29, 2023

gokuld commented Apr 29, 2023 • edited Loading

ricardoV94 commented Apr 29, 2023

Choose a reason for hiding this comment

gokuld commented Apr 29, 2023 • edited Loading

ricardoV94 commented Jun 16, 2023

gokuld commented Jun 16, 2023 • edited Loading

ricardoV94 commented Jun 16, 2023

gokuld commented Jun 17, 2023 • edited Loading

gokuld commented Mar 31, 2023 •

edited by github-actions bot

Loading

codecov bot commented Mar 31, 2023 •

edited

Loading

ricardoV94 Apr 1, 2023 •

edited

Loading

gokuld Apr 1, 2023 •

edited

Loading

gokuld Apr 1, 2023 •

edited

Loading

ricardoV94 Apr 1, 2023 •

edited

Loading

gokuld Apr 1, 2023 •

edited

Loading

gokuld Apr 1, 2023 •

edited

Loading

gokuld Apr 1, 2023 •

edited

Loading

ricardoV94 left a comment •

edited

Loading

gokuld commented Apr 13, 2023 •

edited

Loading

gokuld commented Apr 13, 2023 •

edited

Loading

gokuld commented Apr 29, 2023 •

edited

Loading

gokuld commented Apr 29, 2023 •

edited

Loading

gokuld commented Apr 29, 2023 •

edited

Loading

gokuld commented Apr 29, 2023 •

edited

Loading

gokuld commented Jun 16, 2023 •

edited

Loading

gokuld commented Jun 17, 2023 •

edited

Loading