Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Workflow Surrounding Specification of Y Argument #122

Open
flyaflya opened this issue Oct 11, 2023 · 3 comments
Open

Better Workflow Surrounding Specification of Y Argument #122

flyaflya opened this issue Oct 11, 2023 · 3 comments

Comments

@flyaflya
Copy link

Background

The Modeling Heteroscedasticity with BART example showcases two amazing abilities of PyMC-BART:

  1. The ability to have non-normal error distributions for BART RV's like the pm.Gamma distribution that is used, and
  2. the ability to model functions other than the mean like the modeling of non-constant variance.

While the first ability is easily understood and a small step from existing work, the latter ability is both novel and amazing. I would love to use this ability in my work, I just think the implementation feels a little wonky to me for lack of a better word. Additionally, the documentation in pymc-bart’s paper [Quiroga et al., 2022] does not completely explain how this works.

The Issue

It is my understanding that when growing trees for computing estimated variance, every proposed tree has leaf values drawn from $N(\mu_{pred},\epsilon^2)$ where $\mu_{pred}$ is computed as the mean of the current sum of trees divided by the number of trees $m$. The problem I see is that $\mu_{pred}$ is (at least initially) based off of $Y$ values that seem inappropriate as initial values; the $Y$ values in the example are initial guesses at "sales" and are likely to be very BAD initial guesses for "standard deviation of sales".

For the example, these initial guesses of "sales" are much higher than one would guess for initial guesses of standard deviation. Hence, in this picture:

it is not surprising that the 94% HDI is too wide. We should expect around 12 observations falling outside of the 94% HDI band, but in reality only 2 or 3 observations fall outside of the band. I am guessing the estimate of standard deviation is systematically too high because of the initial conditions.

Thoughts on Implementation

So, with all of the above I am requesting:

  1. Allow for a PYMC RV prior (or multiple priors when size > 1) to be specified on leaf-node values, and/or
  2. Better document the existing functionality on how leaf-nodes are computed so at least the mechanism is more transperent.

Thanks for all the hard-work, plugging BART models in as components of larger probabilistic programs is very much a winning idea and should dominate applied workflows where uncertainty quantification is important. I would love to see this cleaned up a little bit and I am happy to help with documentation or code changes, but need more direction on the math/sampling side of things. Thanks again.

@aloctavodia
Copy link
Member

Thanks for the comments and interest in using and improving PyMC-BART, your feedback is very useful.

I have been thinking about this for some time, and I am still not sure about a good solution. Having to pass Y is already a source of confusion, and as you mention the mean and std of Y are not always good guesses for the initial values of the mean function and standard deviation of the proposal distribution at the leaf nodes. Having said that I remember the results being relatively robust to those initial guesses.

maybe a not-too-difficult-to-implement alternative could be to replace the "Y" argument by two arguments init_mean and init_sd, that can be scalars or vectors

I also agree that we need to improve the documentation. Any help you can offer will be more than welcome.

In the meantime, and if you have time, you can explore passing other values of Y using a made-up vector instead of Y, something like np.log(pz.Gamma(mu=3, sigma=1.5).rvs(200)), instead of np.log(Y).

@flyaflya
Copy link
Author

flyaflya commented Nov 7, 2023

Cool. Thanks for the reply. I will experiment with passing other values of Y to see what happens. I will also see if I can make a contribution to the documentation. Will keep you posted!

@aloctavodia
Copy link
Member

Thanks @flyaflya

As soon as I finish with some urgent tasks I will come back to this issue and rethinking a better, more flexible API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants