Parallel sampling error when sampling trace #3884

nitishp25 · 2020-04-18T06:53:47Z

Description of your problem

The following code gives an error when using PyMC3 development branch.

Please provide a minimal, self-contained, and reproducible example.

import numpy as np
import arviz as az
import pymc3 as pm
import time
start_time = time.time()
np.random.seed(42)

true_m = 0.5
true_b = -1.3
true_logs = np.log(0.3)
x = np.sort(np.random.uniform(0, 5, 20000))
y = true_b + true_m * x + np.exp(true_logs) * np.random.randn(len(x))

with pm.Model() as model:
    m = pm.Uniform("m", lower=-5, upper=5)
    b = pm.Uniform("b", lower=-5, upper=5)
    logs = pm.Uniform("logs", lower=-5, upper=5)
    pm.Normal("obs", mu=m*x+b, sd=pm.math.exp(logs), observed=y)

    print("Getting trace...")
    trace = pm.sample(draws=1000, tune=1000, chains=2)

    print("Getting posterior_predictive...")
    posterior_predictive = pm.sample_posterior_predictive(trace)

    print("Getting idata...")
    idata = az.from_pymc3(
        trace=trace, posterior_predictive=posterior_predictive
    )

print("--- %s seconds ---" % (time.time() - start_time))

Please provide the full traceback.

Traceback (most recent call last):
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 128, in run
    self._start_loop()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 182, in _start_loop
    point, stats = self._compute_point()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 209, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 263, in step
    apoint, stats = self.astep(array)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 144, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 255, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `logs_interval__`.ravel()[0] is zero.
The derivative of RV `b_interval__`.ravel()[0] is zero.
The derivative of RV `m_interval__`.ravel()[0] is zero.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 128, in run
    self._start_loop()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 182, in _start_loop
    point, stats = self._compute_point()
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 209, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 263, in step
    apoint, stats = self.astep(array)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 144, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 255, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `logs_interval__`.ravel()[0] is zero.
The derivative of RV `b_interval__`.ravel()[0] is zero.
The derivative of RV `m_interval__`.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "logp_benchmarks.py", line 21, in <module>
    trace = pm.sample(draws=1000, tune=1000, chains=2)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/sampling.py", line 498, in sample
    trace = _mp_sample(**sample_args)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/sampling.py", line 1384, in _mp_sample
    for draw in sampler:
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 412, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/home/nitish/env2/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 314, in recv_draw
    raise error from old_error
RuntimeError: Chain 1 failed.

Please provide any additional information below.

Note that this code originally worked for @OriolAbril on an older version of PyMC3 and still works for me on PyMC3 3.8 but using the development branch causes this error.

Versions and main components

PyMC3 Version: 3.8 (development branch)
Theano Version: 1.0.4
Python Version: 3.6.9
Operating system: Linux (Ubuntu)
How did you install PyMC3: pip

AlexAndorra · 2020-04-18T09:06:23Z

Hi Nitish,
I think this is a modeling error more than a bug: the priors are not regularizing enough. Uniform priors are usually not a good practice, especially when there is an exponential somewhere (numbers get really big, really fast and this messes up the sampler).

Knowing that, this model runs with PyMC3 master branch:

with pm.Model() as model:
    m = pm.Normal("m", 0., 1.)
    b = pm.Normal("b", 0., 1.5)
    sigma = pm.Exponential("sigma", 1.)
    
    pm.Normal("obs", mu=m * x + b, sigma=sigma observed=y)
    trace = pm.sample(chains=2)

Does this work for you too?

nitishp25 · 2020-04-18T11:02:38Z

Hi Nitish,
I think this is a modeling error more than a bug: the priors are not regularizing enough. Uniform priors are usually not a good practice, especially when there is an exponential somewhere (numbers get really big, really fast and this messes up the sampler).

Knowing that, this model runs with PyMC3 master branch:
with pm.Model() as model:
    m = pm.Normal("m", 0., 1.)
    b = pm.Normal("b", 0., 1.5)
    sigma = pm.Exponential("sigma", 1.)
    
    pm.Normal("obs", mu=m * x + b, sigma=sigma observed=y)
    trace = pm.sample(chains=2)
Does this work for you too? If yes, then we can close this issue.

Hi @AlexAndorra, thanks for the reply!

I understand now and yes this works for me but why was the original model still working in v3.8?

michaelosthege · 2020-04-18T11:56:10Z

Could the occurence of ValueError: Mass matrix contains zeros on the diagonal. have something to do with changes related to new QuadPotentialFullAdapt features? @eigenfoo might know more about it

AlexAndorra · 2020-04-18T12:41:54Z

I can confirm that the original model runs with 3.8, so it looks like something happened on the dev branch, as Michael pointed out.

EDIT: There indeed seems to be a problem: this model from the documentation doesn't run anymore (same error as above), even with more regularizing priors centered on the true values.

AlexAndorra · 2020-04-21T17:46:47Z

After exploring further, I noticed the following simple model (which comes from this doc NB) only runs when testval and init="adapt_diag" are supplied:

N = 10000
μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3], [-0.3, 1.]])
x = np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

with pm.Model() as m:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.Exponential.dist(1.0))
    L = pm.expand_packed_triangular(2, packed_L)
    Σ = pm.Deterministic('Σ', L.dot(L.T))
    
    μ = pm.Normal('μ', 0., 10., shape=2, testval=x.mean(axis=0))
    
    obs = pm.MvNormal('obs', μ, chol=L, observed=x)
    trace = pm.sample(init="adapt_diag")

Is it normal behavior, or should this model run without having to supply these two args?
If you don't supply both args, you get ValueError: Mass matrix contains zeros on the diagonal -- an error that seems to be more frequent on the master branch. Could that be related to recent changes on master?
As @michaelosthege suggested, @eigenfoo and @junpenglao are probably the most knowledgeable on this. Tell me if anything is unclear, and thanks a lot in advance!

rpgoldman · 2020-04-21T18:11:58Z

I have seen this many times, as well, to the point where I routinely supply 'adapt_diag' to overwrite the default 'jitter+adapt_diag'. I assume that this happens because PyMC3 can jitter the initial point to a place in the space that's infeasible?

AlexAndorra · 2020-04-22T13:11:58Z

Yeah I think that's why. What's puzzling to me though, is that this same model runs smoothly in the publicly released 3.8, without having to change the initialization and supply testval 🤔

junpenglao · 2020-04-22T14:03:50Z

I cannot reproduce your example @AlexAndorra
Also I dont think we are using QuadPotentialFullAdapt by default right now? it should still be adapt_diag (with full adapt as a new feature)

AlexAndorra · 2020-04-22T14:16:54Z

@junpenglao, do you mean that for you the model above samples on the master branch without having to specify testval=x.mean(axis=0) and init="adapt_diag"?
For me, if I don't specify those, I get ValueError: Mass matrix contains zeros on the diagonal on the master branch. But it runs on 3.8.

rpgoldman · 2020-04-22T15:49:50Z

I wonder -- is there some way to take an existing run that has this problem and capture the random seed, so that we can replicate the behavior we are seeeing?

It's frustrating to me that I see this problem all the time, but it can't be replicated, and so cannot be fixed.

OriolAbril · 2020-04-22T16:40:56Z

If there is, it would be great to add how to do it here or somewhere else and add the link in arviz-devs/arviz#1083 too

nkaimcaudle · 2020-04-23T00:16:08Z

This code fails for me every time I run it.

Chain 2 fails after between 600 and 1400 total samples. I guess the number of completed samples from chains 0, 1 & 3 vary each time I run before the chain 2 failure.

I updated from PyMC3 master yesterday

N = 1000
μ_actual = np.array([1, -2])
Σ_actual = np.array([[0.5, -0.3], [-0.3, 1.]])
x = np.random.multivariate_normal(μ_actual, Σ_actual, size=N)

with pm.Model() as m:
    packed_L = pm.LKJCholeskyCov('packed_L', n=2, eta=2., sd_dist=pm.Exponential.dist(1.0))
    L = pm.expand_packed_triangular(2, packed_L)
    Σ = pm.Deterministic('Σ', L.dot(L.T))
    
    μ = pm.Normal('μ', 0., 10., shape=2)
    
    obs = pm.MvNormal('obs', μ, chol=L, observed=x)
    trace = pm.sample(random_seed=3)

ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `μ`.ravel()[0] is zero.
The derivative of RV `μ`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[0] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[2] is zero.
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `μ`.ravel()[0] is zero.
The derivative of RV `μ`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[0] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[1] is zero.
The derivative of RV `packed_L_cholesky-cov-packed__`.ravel()[2] is zero.

The above exception was the direct cause of the following exception:

AlexAndorra · 2020-06-05T14:47:32Z

I'm closing this for now, because, after some discussions, this seems to be a random error (i.e false positive), but let's be attentive and deal with this if problems arise in the coming 3.9 release.

NestorESA · 2023-09-08T09:41:12Z

I'm closing this for now, because, after some discussions, this seems to be a random error (i.e false positive), but let's be attentive and deal with this if problems arise in the coming 3.9 release.

Why is it label as a random error? and how can we solve it? I am having this error, I tried what is mentioned here but still getting same error.

michaelosthege · 2023-09-09T09:47:57Z

@NestorESA if you're using the latest version, please open a new issue with a minimal example.

AlexAndorra mentioned this issue Apr 19, 2020

Simplify MvNormal Cholesky decomposition API #3881

Merged

This was referenced Apr 24, 2020

Update Parameter Estimation NBs for BCM pymc-devs/pymc-resources#82

Merged

Update Model Selection NBs for BCM pymc-devs/pymc-resources#83

Merged

AlexAndorra mentioned this issue May 12, 2020

changed ar1 logp to use ar1 precision instead of innovation precision #3899

Merged

AlexAndorra closed this as completed Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling error when sampling trace #3884

Parallel sampling error when sampling trace #3884

nitishp25 commented Apr 18, 2020

AlexAndorra commented Apr 18, 2020 •

edited

Loading

nitishp25 commented Apr 18, 2020 •

edited

Loading

michaelosthege commented Apr 18, 2020

AlexAndorra commented Apr 18, 2020 •

edited

Loading

AlexAndorra commented Apr 21, 2020

rpgoldman commented Apr 21, 2020

AlexAndorra commented Apr 22, 2020

junpenglao commented Apr 22, 2020

AlexAndorra commented Apr 22, 2020

rpgoldman commented Apr 22, 2020

OriolAbril commented Apr 22, 2020

nkaimcaudle commented Apr 23, 2020

AlexAndorra commented Jun 5, 2020

NestorESA commented Sep 8, 2023

michaelosthege commented Sep 9, 2023

Parallel sampling error when sampling trace #3884

Parallel sampling error when sampling trace #3884

Comments

nitishp25 commented Apr 18, 2020

Description of your problem

Versions and main components

AlexAndorra commented Apr 18, 2020 • edited Loading

nitishp25 commented Apr 18, 2020 • edited Loading

michaelosthege commented Apr 18, 2020

AlexAndorra commented Apr 18, 2020 • edited Loading

AlexAndorra commented Apr 21, 2020

rpgoldman commented Apr 21, 2020

AlexAndorra commented Apr 22, 2020

junpenglao commented Apr 22, 2020

AlexAndorra commented Apr 22, 2020

rpgoldman commented Apr 22, 2020

OriolAbril commented Apr 22, 2020

nkaimcaudle commented Apr 23, 2020

AlexAndorra commented Jun 5, 2020

NestorESA commented Sep 8, 2023

michaelosthege commented Sep 9, 2023

AlexAndorra commented Apr 18, 2020 •

edited

Loading

nitishp25 commented Apr 18, 2020 •

edited

Loading

AlexAndorra commented Apr 18, 2020 •

edited

Loading