Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel sampling error when sampling trace #3884

Closed
nitishp25 opened this issue Apr 18, 2020 · 15 comments
Closed

Parallel sampling error when sampling trace #3884

nitishp25 opened this issue Apr 18, 2020 · 15 comments

Comments

@nitishp25
Copy link

Description of your problem

The following code gives an error when using PyMC3 development branch.

Please provide a minimal, self-contained, and reproducible example.

import numpy as np
import arviz as az
import pymc3 as pm
import time
start_time = time.time()
np.random.seed(42)

true_m = 0.5
true_b = -1.3
true_logs = np.log(0.3)
x = np.sort(np.random.uniform(0, 5, 20000))
y = true_b + true_m * x + np.exp(true_logs) * np.random.randn(len(x))

with pm.Model() as model:
    m = pm.Uniform("m", lower=-5, upper=5)
    b = pm.Uniform("b", lower=-5, upper=5)
    logs = pm.Uniform("logs", lower=-5, upper=5)
    pm.Normal("obs", mu=m*x+b, sd=pm.math.exp(logs), observed=y)

    print("Getting trace...")
    trace = pm.sample(draws=1000, tune=1000, chains=2)

    print("Getting posterior_predictive...")
    posterior_predictive = pm.sample_posterior_predictive(trace)

    print("Getting idata...")
    idata = az.from_pymc3(
        trace=trace, posterior_predictive=posterior_predictive
    )

print("--- %s seconds ---" % (time.time() - start_time))

Please provide the full traceback.

Traceback (most recent call last):
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 128, in run
    self._start_loop()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 182, in _start_loop
    point, stats = self._compute_point()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 209, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 263, in step
    apoint, stats = self.astep(array)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 144, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 255, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `logs_interval__`.ravel()[0] is zero.
The derivative of RV `b_interval__`.ravel()[0] is zero.
The derivative of RV `m_interval__`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 128, in run
    self._start_loop()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 182, in _start_loop
    point, stats = self._compute_point()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 209, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 263, in step
    apoint, stats = self.astep(array)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 144, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 255, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `logs_interval__`.ravel()[0] is zero.
The derivative of RV `b_interval__`.ravel()[0] is zero.
The derivative of RV `m_interval__`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "logp_benchmarks.py", line 21, in <module>
    trace = pm.sample(draws=1000, tune=1000, chains=2)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/sampling.py", line 498, in sample
    trace = _mp_sample(**sample_args)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/sampling.py", line 1384, in _mp_sample
    for draw in sampler:
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 412, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 314, in recv_draw
    raise error from old_error
RuntimeError: Chain 1 failed.

Please provide any additional information below.

Note that this code originally worked for @OriolAbril on an older version of PyMC3 and still works for me on PyMC3 3.8 but using the development branch causes this error.

Versions and main components

  • PyMC3 Version: 3.8 (development branch)
  • Theano Version: 1.0.4
  • Python Version: 3.6.9
  • Operating system: Linux (Ubuntu)
  • How did you install PyMC3: pip
@AlexAndorra
Copy link
Contributor

AlexAndorra commented Apr 18, 2020

Hi Nitish,
I think this is a modeling error more than a bug: the priors are not regularizing enough. Uniform priors are usually not a good practice, especially when there is an exponential somewhere (numbers get really big, really fast and this messes up the sampler).

Knowing that, this model runs with PyMC3 master branch:

with pm.Model() as model:
    m = pm.Normal("m", 0., 1.)
    b = pm.Normal("b", 0., 1.5)
    sigma = pm.Exponential("sigma", 1.)
    
    pm.Normal("obs", mu=m * x + b, sigma=sigma observed=y)
    trace = pm.sample(chains=2)

Does this work for you too?

@nitishp25
Copy link
Author

nitishp25 commented Apr 18, 2020

Hi Nitish,
I think this is a modeling error more than a bug: the priors are not regularizing enough. Uniform priors are usually not a good practice, especially when there is an exponential somewhere (numbers get really big, really fast and this messes up the sampler).

Knowing that, this model runs with PyMC3 master branch:

with pm.Model() as model:
    m = pm.Normal("m", 0., 1.)
    b = pm.Normal("b", 0., 1.5)
    sigma = pm.Exponential("sigma", 1.)
    
    pm.Normal("obs", mu=m * x + b, sigma=sigma observed=y)
    trace = pm.sample(chains=2)

Does this work for you too? If yes, then we can close this issue.

Hi @AlexAndorra, thanks for the reply!

I understand now and yes this works for me but why was the original model still working in v3.8?

@michaelosthege
Copy link
Member

Could the occurence of ValueError: Mass matrix contains zeros on the diagonal. have something to do with changes related to new QuadPotentialFullAdapt features? @eigenfoo might know more about it

@AlexAndorra
Copy link
Contributor

AlexAndorra commented Apr 18, 2020

I can confirm that the original model runs with 3.8, so it looks like something happened on the dev branch, as Michael pointed out.

EDIT: There indeed seems to be a problem: this model from the documentation doesn't run anymore (same error as above), even with more regularizing priors centered on the true values.

@AlexAndorra
Copy link
Contributor

After exploring further, I noticed the following simple model (which comes from this doc NB) only runs when testval and init="adapt_diag" are supplied:

N = 10000
μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3], [-0.3, 1.]])
x = np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

with pm.Model() as m:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.Exponential.dist(1.0))
    L = pm.expand_packed_triangular(2, packed_L)
    Σ = pm.Deterministic('Σ', L.dot(L.T))
    
    μ = pm.Normal('μ', 0., 10., shape=2, testval=x.mean(axis=0))
    
    obs = pm.MvNormal('obs', μ, chol=L, observed=x)
    trace = pm.sample(init="adapt_diag")
  • Is it normal behavior, or should this model run without having to supply these two args?
  • If you don't supply both args, you get ValueError: Mass matrix contains zeros on the diagonal -- an error that seems to be more frequent on the master branch. Could that be related to recent changes on master?
    As @michaelosthege suggested, @eigenfoo and @junpenglao are probably the most knowledgeable on this. Tell me if anything is unclear, and thanks a lot in advance!

@rpgoldman
Copy link
Contributor

I have seen this many times, as well, to the point where I routinely supply 'adapt_diag' to overwrite the default 'jitter+adapt_diag'. I assume that this happens because PyMC3 can jitter the initial point to a place in the space that's infeasible?

@AlexAndorra
Copy link
Contributor

Yeah I think that's why. What's puzzling to me though, is that this same model runs smoothly in the publicly released 3.8, without having to change the initialization and supply testval 🤔

@junpenglao
Copy link
Member

I cannot reproduce your example @AlexAndorra
Also I dont think we are using QuadPotentialFullAdapt by default right now? it should still be adapt_diag (with full adapt as a new feature)

@AlexAndorra
Copy link
Contributor

@junpenglao, do you mean that for you the model above samples on the master branch without having to specify testval=x.mean(axis=0) and init="adapt_diag"?
For me, if I don't specify those, I get ValueError: Mass matrix contains zeros on the diagonal on the master branch. But it runs on 3.8.

@rpgoldman
Copy link
Contributor

I wonder -- is there some way to take an existing run that has this problem and capture the random seed, so that we can replicate the behavior we are seeeing?

It's frustrating to me that I see this problem all the time, but it can't be replicated, and so cannot be fixed.

@OriolAbril
Copy link
Member

If there is, it would be great to add how to do it here or somewhere else and add the link in arviz-devs/arviz#1083 too

@nkaimcaudle
Copy link

This code fails for me every time I run it.

Chain 2 fails after between 600 and 1400 total samples. I guess the number of completed samples from chains 0, 1 & 3 vary each time I run before the chain 2 failure.

I updated from PyMC3 master yesterday

N = 1000
μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3], [-0.3, 1.]])
x = np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

with pm.Model() as m:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.Exponential.dist(1.0))
    L = pm.expand_packed_triangular(2, packed_L)
    Σ = pm.Deterministic('Σ', L.dot(L.T))
    
    μ = pm.Normal('μ', 0., 10., shape=2)
    
    obs = pm.MvNormal('obs', μ, chol=L, observed=x)
    trace = pm.sample(random_seed=3)
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `μ`.ravel()[0] is zero.
The derivative of RV `μ`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[0] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[2] is zero.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `μ`.ravel()[0] is zero.
The derivative of RV `μ`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[0] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[2] is zero.

The above exception was the direct cause of the following exception:

@AlexAndorra
Copy link
Contributor

I'm closing this for now, because, after some discussions, this seems to be a random error (i.e false positive), but let's be attentive and deal with this if problems arise in the coming 3.9 release.

@NestorESA
Copy link

I'm closing this for now, because, after some discussions, this seems to be a random error (i.e false positive), but let's be attentive and deal with this if problems arise in the coming 3.9 release.

Why is it label as a random error? and how can we solve it? I am having this error, I tried what is mentioned here but still getting same error.

@michaelosthege
Copy link
Member

@NestorESA if you're using the latest version, please open a new issue with a minimal example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants