Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ICDF for the Kumaraswamy distribution #6642

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gokuld
Copy link
Contributor

@gokuld gokuld commented Mar 31, 2023

What is this PR about?
Adding the ICDF function for the Kumaraswamy distribution.
Issue: #6612
Creation of tests for the added ICDF function.

Reference: https://en.wikipedia.org/wiki/Kumaraswamy_distribution
SciPy does not have an implementation for the Kumaraswamy distribution so this ICDF was coded explicitly.

Checklist

Major / Breaking Changes

  • ...

New features

  • ICDF function for the Kumaraswamy distribution.

Bugfixes

  • ...

Documentation

  • ...

Maintenance

  • Extended test_continuous.py with corresponding tests.

📚 Documentation preview 📚: https://pymc--6642.org.readthedocs.build/en/6642/

@codecov
Copy link

codecov bot commented Mar 31, 2023

Codecov Report

Merging #6642 (a82fe4d) into main (f3df36b) will increase coverage by 0.02%.
The diff coverage is 100.00%.

❗ Current head a82fe4d differs from pull request most recent head 729c289. Consider uploading reports for the commit 729c289 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6642      +/-   ##
==========================================
+ Coverage   91.99%   92.01%   +0.02%     
==========================================
  Files          95       95              
  Lines       16205    16246      +41     
==========================================
+ Hits        14907    14948      +41     
  Misses       1298     1298              
Impacted Files Coverage Δ
pymc/distributions/continuous.py 97.74% <100.00%> (+<0.01%) ⬆️
pymc/testing.py 92.64% <100.00%> (+0.61%) ⬆️

check_icdf(
pm.Kumaraswamy,
{"a": Rplus, "b": Rplus},
lambda q, a, b: (1 - (1 - q) ** (1 / b)) ** (1 / a),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't scipy have a ppf function? The point of this test is more to compare with an external reference to be sure we implemented it correctly.

Copy link
Member

@ricardoV94 ricardoV94 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you mention it doesn't. In that case an alternative would be to check that for some values exp(logcdf(dist, icdf(dist, value))) == value

Or something along those lines, basically testing that the inverse property holds.

You could add that as a util test function in testing.py.

We do something similar for the consistency between discrete logp and logcdf helper.

It's a bit more work but it will be very helpful for testing other distributions whose ppf is missing in scipy.

Copy link
Contributor Author

@gokuld gokuld Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, SciPy doesn't have a ppf function for the Kumaraswamy distribution.

Sure, I will look into adding the util test function for testing when scipy doesn't have an implementation. I will get back on this if I need help, or if not, with the next commit.

Copy link
Contributor Author

@gokuld gokuld Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricardoV94 I added a util function for testing the inverse of the icdf matches exp logcdf;
However I found it is failing tests for small values of distribution parameter $b$ due to numerical issues.

Example test case where this is failing:

$a=1$, $b=0.01$, $value=0.4$
The icdf expression is:
$(1 - (1 - value) ^ \frac{1}{b}) ^ \frac{1}{a} $

For the above values for $a$ and $b$, the inner term:
$t = (1 - value) ^ \frac{1}{b}) = 0.6^{100} = 6.533186235000685e-23$

For such a small value of $t$, $1 - t$ is now represented as 1, due to numerical issues with adding 1 to such a small number. Thus the result of the ICDF expression, $(1 - t)^\frac{1}{a}$ is also 1.

This fails the self consistency test (between icdf and cdf) for small values of $b$.

Do you have any idea to make this numerically stable? Working with the logarithms of numbers doesn't make sense as there is addition happening in the ICDF expression which also causes the numerical stability issue.

The other alternative is to simply skip the tests for small values of $b$. I would like to avoid that if another way is possible.

Edit: I am now thinking on the lines of using scipy.special.powm1 or scipy.special.log1p.

Copy link
Member

@ricardoV94 ricardoV94 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you testing? We don't want to test equality directly, but using something like np.testing.assert_almost_equal or something that allows for some wiggle room.

It may make sense to compare que log of the icdf value if that's more stable.

But yes, we don't need to test values very close to 0 or 1. It would be good to check if the distributions that already have icdf pass as well.

Copy link
Contributor Author

@gokuld gokuld Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am already using np.testing.assert_almost_equal. This still doesn't work because the probability returned by the cdf of the value returned by the icdf, is $1$, instead of a value 0.4, because of the numerical issue.

It may make sense to compare que log of the icdf value if that's more stable.

log of the icdf does make sense, but we have to implement a function like logicdf everywhere. log(icdf) will not work as icdf already returns the less precise value, so it has to be implemented directly as logicdf.

But yes, we don't need to test values very close to 0 or 1.

Okay, in that case I will change the domain of testing to exclude small values of $b$.

It would be good to check if the distributions that already have icdf pass as well.

The other continuous distributions (uniform and normal) with icdf are passing the icdf self consistency test, even with small values of the distribution parameters.
For the discrete case, the test passed for the discrete uniform distribution, but failed for the geometric distribution for small values of the distribution parameter ($0.001$).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to implement log-icdf, that's just the log of icdf which we already have.

My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value), if that makes sense. Instead of exponentiating the logcdf

Copy link
Contributor Author

@gokuld gokuld Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal was to compare logcdf(dist, icdf(dist, value)) = log(value)

No that doesn't work, I already tried it before I narrowed down to the root cause.

The numerical error starts at the output of icdf itself for very small parameters for distributions involving power terms with this parameter. This the log outside does not help.

I will proceed now by excluding very small values of the distribution parameters for the self consistency tests for icdf, if that sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually I see that for some of the domains, domains excluding small parameters is already defined.
For example, using Rplusbig instead of Rplus for the a and b distribution parameters for the Kumaraswamy distribution already passed the test.

I could simply use these domains for the tests. Does this sound good?

Copy link
Contributor Author

@gokuld gokuld Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh while this inverse icdf -cdf self test works for continuous distributions, it doesn't work in the same way for discrete distributions.
This is because there is a band of probability for which the icdf has the same value and this this need not match the probability from the cdf.

I will now implement another util test function for discrete distributions which checks that value = icdf(dist, exp(logcdf(dist, value))) for a sample of integer values in the domain of the discrete distribution.

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shaping up really well!

Some suggestions below

@@ -1287,6 +1287,16 @@ def logcdf(value, a, b):
msg="a > 0, b > 0",
)

def icdf(value, a, b):
res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, would it help to compute this icdf on the log scale (to get rid of these exponents) and only exponentiate the final result?

Suggested change
res = (1 - (1 - value) ** pt.reciprocal(b)) ** pt.reciprocal(a)
res = exp(log(1 - (1 - value) * pt.reciprocal(b) * pt.reciprocal(a))

Or something along those lines (might have made a mistake)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pymc/testing.py Outdated
@@ -212,6 +212,7 @@ def RandomPdMatrix(n):
Rplusbig = Domain([0, 0.5, 0.9, 0.99, 1, 1.5, 2, 20, np.inf])
Rminusbig = Domain([-np.inf, -2, -1.5, -1, -0.99, -0.9, -0.5, -0.01, 0])
Unit = Domain([0, 0.001, 0.1, 0.5, 0.75, 0.99, 1])
Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edges aren't included, do to test 0.1 you need to add something before

Suggested change
Unitbig = Domain([0.1, 0.5, 0.75, 0.99, 1])
Unitbig = Domain([0, 0.1, 0.5, 0.75, 0.99, 1])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unitbig isn't used now, so I am discarding this.

It was added to test the self-consistency of ICDF-logcdf for the uniform distribution, but we are not using this as they have scipy ppf functions available.

@@ -284,6 +289,10 @@ def test_normal(self):
{"mu": R, "sigma": Rplus},
lambda q, mu, sigma: st.norm.ppf(q, mu, sigma),
)
check_selfconsistency_icdf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to test this for scipy distributions that have ppf, I meant just for you to confirm it's working well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I've removed these tests where scipy ppf functions are available.

Copy link
Member

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot about this one.

Is #6669 a blocker for this PR?

pymc/distributions/continuous.py Show resolved Hide resolved
pymc/testing.py Outdated
dist_logcdf_fn = compile_pymc(list(inputvars(dist_logcdf)), dist_logcdf)

domains = paramdomains.copy()
domains["value"] = Domain(np.linspace(0.1, 1, 100))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably enough?

Suggested change
domains["value"] = Domain(np.linspace(0.1, 1, 100))
domains["value"] = Domain(np.linspace(0, 1, 10))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@gokuld
Copy link
Contributor Author

gokuld commented Apr 13, 2023

Sorry, I forgot about this one.

Is #6669 a blocker for this PR?

It is a blocker for discrete ICDF self consistency checks, whose code I haven't pushed yet.

It is not a blocker for code I have pushed so far in this PR, which includes continuous ICDF self consistency checks.

@ricardoV94
Copy link
Member

@gokuld How do you prefer to proceed? Get this done or wait and also get the discrete cases sorted out?

I would suggest we wait, but up to you :)

@gokuld
Copy link
Contributor Author

gokuld commented Apr 13, 2023

Let us wait, no problem.

@gokuld
Copy link
Contributor Author

gokuld commented Apr 29, 2023

@ricardoV94 I've added self consistency tests for discrete ICDF, now that the blocker #6671 is closed.

Though SciPy PPFs are available for the discrete uniform and discrete geometric distributions and used with check_icdf, I have still added the icdf-logcdf self consistency check for these two distributions as it caught a bug in the implementation of discrete uniform ICDF that was not caught by check_icdf.

I have also fixed the bug and pushed the commit here.

@gokuld
Copy link
Contributor Author

gokuld commented Apr 29, 2023

pytest -v tests/distributions/test_discrete.py -k 'test_geometric' passes on my machine. Looking into this.

@ricardoV94
Copy link
Member

Try setting n_samples=-1 in check_logp and alike helpers to make sure all parameter combinations are tested locally

@gokuld
Copy link
Contributor Author

gokuld commented Apr 29, 2023

Try setting n_samples=-1 in check_logp and alike helpers to make sure all parameter combinations are tested locally

It is passing locally even with n_samples=-1.

The failure upstream is happening in check_selfconsistency_discrete_icdf for some reason, for the discrete geometric distribution for p=0.75. The expected ICDF of CDF is 3 but is receiving 4. On my machine it is 3 all the way and passes the test.

@ricardoV94
Copy link
Member

Does this branch already include the fixes from the other PR?

@gokuld
Copy link
Contributor Author

gokuld commented Apr 29, 2023

Does this branch already include the fixes from the other PR?

Forgot to sync my forked repo. :/ I have synced it now. Do you think this was the issue?

I can see commit history on this branch now, on Github with the fixes from the other PR.

@ricardoV94
Copy link
Member

You might want to rebase from main and then force push so only the new commits show up (if that's the problem). Make sure to checkout a backup branch first in case something goes wrong.

@gokuld gokuld force-pushed the kumaraswamy_icdf branch 2 times, most recently from 244bf1e to 59e0c27 Compare April 29, 2023 09:50
computed_value = dist_icdf_fn(
**point, value=np.exp(dist_logcdf_fn(**point, value=value))
)
assert (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.testing.assert_almost_equal will give you a nice printing when it fails out of the box

@gokuld
Copy link
Contributor Author

gokuld commented Apr 29, 2023

@ricardoV94 This passed the tests in test_discrete.py now. Do you think we should proceed with the truncation approach, within check_selfconsistency_discrete_icdf ?

Some thoughts:

  • At the moment I check that ICDF(CDF(value) == value, where value is an integer for discrete distributions. It is possible to replace this test condition (note the swap) by CDF(ICDF(value) == value, here value being a real number 0 <= value <=1. However in this case, value is a real number and not an integer and has to be compared using np.testing.assert_almost_equal which effectively truncates the value to k places and checks. There seems to simple be no way to totally avoid the truncation during testing.
  • As an example, for the geometric distribution with p=0.75 and n=3, the CDF computed in a straightforward way without log and exp is $0.75 + (1 - 0.75) (0.75) + (1-0.75) ^2 0.75$ which is 0.984375. Especially given that log and exp are involved in the computation of the CDF via logcdf, it is possible to get a numerical value of the CDF like 0.9843750000000001 as obtained in the tests above. Whlie the ICDF of 0.984375 is 3, the ICDF of 0.9843750000000001 is 4, not 3. This is because the ICDF of q is the smallest integer whose CDF is greater than or equal to q .

@ricardoV94
Copy link
Member

At the moment I check that ICDF(CDF(value) == value, where value is an integer for discrete distributions. It is possible to replace this test condition (note the swap) by CDF(ICDF(value) == value, here value being a real number 0 <= value <=1. However in this case, value is a real number and not an integer and has to be compared using np.testing.assert_almost_equal which effectively truncates the value to k places and checks. There seems to simple be no way to totally avoid the truncation during testing.

What is the disadvantage of the current approach ICDF(CDF)? If it works, let's keep it?

Does this PR include changes that were already committed elsewhere?

@ricardoV94 ricardoV94 changed the title Added ICDF for the Kumaraswamy distribution. Add ICDF for the Kumaraswamy distribution Jun 16, 2023
@gokuld
Copy link
Contributor Author

gokuld commented Jun 16, 2023

What is the disadvantage of the current approach ICDF(CDF)? If it works, let's keep it?

One issue is that after computing the CDF, if the result has the tiniest numerical error, it can create a huge error in the ICDF (exactly a 1 integer size error). This is because ICDF is integer valued but is very sensitive at the places in the CDF graph where there is a vertical jump. (The CDF function has a staircase like appearance with multiple vertical jumps.) This means, ICDF(CDF) is not enough, but we need to truncate it this way: ICDF(truncate(CDF)).

And while truncating we need to decide the number of decimal places which is another magic number / parameter, makes it less elegant.

This kind of error, was the root cause I identified with the issue where it passed tests locally on my machine while failed on the github actions in the cloud. The difference in machine hardware could change the numerical value of the CDF ever so slightly and cause a huge integer change in the ICDF that is applied on top of it. I verified and solved this root cause, by adding truncation after CDF and before ICDF as discussed above ( ICDF(truncate(CDF))).

Does this PR include changes that were already committed elsewhere?

No these changes are made only in this PR.

@ricardoV94
Copy link
Member

Then let's do it the other way around. cdf(icdf(value))? You're saying that will be less discontinuous?

For your other question, its' fine to check with allclose and a specific rtol and atol. That's always the case when comparing floats.

@gokuld
Copy link
Contributor Author

gokuld commented Jun 17, 2023

Then let's do it the other way around. cdf(icdf(value))? You're saying that will be less discontinuous?

An issue with this other way round cdf(icdf(value)) is that the CDF of discrete distributions lies in a set of discrete values. These values are dependent on the distribution itself (and it's parameters). When using a generic discrete set of CDF values for testing, cdf(icdf(value)) would 'snap' the value to the set of discrete possible values for the particular distribution, and value may not match cdf(icdf(value)). Because of this, the test would fail though it should pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants