Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

OssiGalkin · 2018-04-26T12:25:32Z

Summary:

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model. Notebook is related to course CS-E4070 in Aalto University.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
Ossi Galkin

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD3 (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

…part.

cagatayyildiz · 2018-05-04T13:00:46Z

Overall, the assignment is well-done. The notebook is pretty easy to follow. There are a few details to note though:

Why do the ELFI priors cover different regions? Why not all of them are, let's say, within [-10 10]? Also, is there any particular reason why the interval size (4) is that small?
I think simulator could be implemented in such a way that batch_size could be larger than 1. Suppose we are interested in 3D Gaussians. Let's say the first batch has mean (-1,0,1) and the second batch has (-100,0,100). Suppose covariance is 1e-10*np.eye(3). Then, the following code gives you a 2x3 matrix whose rows contain multivariate Gaussian random variables with specified mean and covariace. Of course, the dimensions below should not be set by hand but it is trivial to extend the code to variable dimensions.

m1 = np.array([-1,-100])
m2 = np.array([0,0])
m3 = np.array([1,100])
cov = 1e-10 * np.eye(3)
ss.multivariate_normal.rvs(np.hstack((m1,m2,m3)),np.kron(np.eye(2),cov)).reshape((3,2)).T

The figure of # of simulations vs dimensions plot would be more intuitive if the axes are reversed.
Because the error is a function of cost and the dimensionality and there is already a good code base, plotting error vs dimensionality plot should not be that hard. I am wondering if we can empirically confirm error∼cost−2/(q+4) as well.

adhaka · 2018-05-07T12:21:07Z

The assignment was easy to follow and understand.
I had the following observations(some of which have already been written above).
The interval of the priors was a bit too narrow in my opinion( making it quite an informative prior), and testing it with non-informative/weakly informative priors could have been an interesting experiment as well and observe as to how that affects the number of simulations.
Sampling from a multivariate Gaussian with a nearly spherical isotropic covariance can be done in the way suggested above, so there is no need to sample from univariate Gaussian multiple times, and that way you can have a higher batch size.

vuolleko

Great and important topic, but the notebook needs more work. Some comments:

Remove the checkpoint file
Excellent to have theoretical background with references
Maybe comment what "optimal choice of tolerance" means, especially since typically unknown
"This example follows same conventions that are used in ELFI tutorial." More specific, please
What is setting the tolerance to 0.5 based on?
Note elfi.examples.gauss.gauss_nd_mean
Perhaps the best way to pass constant arguments to a simulator is using functools.partial (i.e. simulator_with_const = partial(simulator, cov=cov)), other options include lambda or just giving a constant node as a parent
"# NoteToSelf: Statement above is not true, see bug 268" ? You should set ELFI to use the global seed if you want it to use it
"# NoteToSelf: y is not a numpy array" ? It should be
"# NoteToSelf: if batch_size = 1 elfi uses different dimensions." No, it's the same. In ELFI axis=0 is always assumed the dimension of the batch_size. On the other hand, (in the default case) the output from your Simulator is directly the input to your Summary. So if your simulator does not have a dimension corresponding to batch_size, then it doesn't appear to the Summary either. In this case, if you did use batch_size, the output from your simulator would have shape (batch_size, 30, 6).
Very nice usage of OutputPool!
It would be informative to see output from elfi.draw
From a Bayesian perspective it is a bit confusing to name the parameters priors. What is the posterior mean of prior? Preferably use something like: mu1, mu2 etc.
Would be good to see how the inference succeeds in cases with fewer summaries
Note that Euclidean distance sums over squares, i.e. the more dimensions you have, the higher the distance typically is. Hence the validity of comparison using a fixed threshold is questionable
The statistical power of the different summaries in the 6d Gaussian case is very different, and I think the validity of the comparison should be discussed further. It is unfortunately not easy to come up with a meaningful case to try this with!
You should say elfi.new_model() before starting building the graph for MA2 to clear the previous model from memory
The MA2 case works very well with a larger batch_size, and is much faster
Using only 100 samples from rejection sampling (or SMC) is quite few and prone to much stochasticity
Comparison of results from inference seems to be based on posterior means only. Would be informative to see marginal distributions as well
Using the same threshold for Euclidean and Manhattan distance is unfounded
Autocorrelation is just normalized autocovariance, and for MA2 the difference is small, possibly insignificant as a summary statistic. In any case it's hardly 'silly'.
Please explain how ELFI was unstable? Issues with the multiprocessing module are not really issues with ELFI
Neither of the mentioned bugs are actually bugs in ELFI (poor documentation is another thing, apologies for that)

We will try to improve the documentation to avoid the mentioned confusions. Thanks for bringing these misunderstandings to our knowledge!

vuolleko · 2018-05-17T17:23:48Z

@cagatayyildiz Great suggestion with the vectorized multivariate_normal. It seems that the code in elfi.examples.gauss.gauss_nd_mean has a for loop. Feel free to open a pull request to the main elfi repo with a tuned version. :)

OssiGalkin added 6 commits April 23, 2018 19:19

First version of notebook.

a4a5df6

Added link to GitHub.

9abaeeb

Fixed axis on what means (summary statistic) are calculated in first …

16178be

…part.

Updated narrative text.

7d5949e

Fix one typo.

7d974c1

Updated filename.

482ff46

vuolleko requested changes May 17, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

OssiGalkin commented Apr 26, 2018

cagatayyildiz commented May 4, 2018

adhaka commented May 7, 2018

vuolleko left a comment

vuolleko commented May 17, 2018

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

Are you sure you want to change the base?

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

Conversation

OssiGalkin commented Apr 26, 2018

Summary:

Copyright and Licensing

cagatayyildiz commented May 4, 2018

adhaka commented May 7, 2018

vuolleko left a comment

Choose a reason for hiding this comment

vuolleko commented May 17, 2018