Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model #8

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

OssiGalkin
Copy link

Summary:

Notebook about the effect of distance metric and dimension of summary statistic on computational cost of ABC model. Notebook is related to course CS-E4070 in Aalto University.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):
Ossi Galkin

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@cagatayyildiz
Copy link
Contributor

Overall, the assignment is well-done. The notebook is pretty easy to follow. There are a few details to note though:

  • Why do the ELFI priors cover different regions? Why not all of them are, let's say, within [-10 10]? Also, is there any particular reason why the interval size (4) is that small?

  • I think simulator could be implemented in such a way that batch_size could be larger than 1. Suppose we are interested in 3D Gaussians. Let's say the first batch has mean (-1,0,1) and the second batch has (-100,0,100). Suppose covariance is 1e-10*np.eye(3). Then, the following code gives you a 2x3 matrix whose rows contain multivariate Gaussian random variables with specified mean and covariace. Of course, the dimensions below should not be set by hand but it is trivial to extend the code to variable dimensions.

m1 = np.array([-1,-100])
m2 = np.array([0,0])
m3 = np.array([1,100])
cov = 1e-10 * np.eye(3)
ss.multivariate_normal.rvs(np.hstack((m1,m2,m3)),np.kron(np.eye(2),cov)).reshape((3,2)).T
  • The figure of # of simulations vs dimensions plot would be more intuitive if the axes are reversed.

  • Because the error is a function of cost and the dimensionality and there is already a good code base, plotting error vs dimensionality plot should not be that hard. I am wondering if we can empirically confirm error∼cost−2/(q+4) as well.

@adhaka
Copy link

adhaka commented May 7, 2018

The assignment was easy to follow and understand.
I had the following observations(some of which have already been written above).
The interval of the priors was a bit too narrow in my opinion( making it quite an informative prior), and testing it with non-informative/weakly informative priors could have been an interesting experiment as well and observe as to how that affects the number of simulations.
Sampling from a multivariate Gaussian with a nearly spherical isotropic covariance can be done in the way suggested above, so there is no need to sample from univariate Gaussian multiple times, and that way you can have a higher batch size.

Copy link
Member

@vuolleko vuolleko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great and important topic, but the notebook needs more work. Some comments:

  • Remove the checkpoint file
  • Excellent to have theoretical background with references
  • Maybe comment what "optimal choice of tolerance" means, especially since typically unknown
  • "This example follows same conventions that are used in ELFI tutorial." More specific, please
  • What is setting the tolerance to 0.5 based on?
  • Note elfi.examples.gauss.gauss_nd_mean
  • Perhaps the best way to pass constant arguments to a simulator is using functools.partial (i.e. simulator_with_const = partial(simulator, cov=cov)), other options include lambda or just giving a constant node as a parent
  • "# NoteToSelf: Statement above is not true, see bug 268" ? You should set ELFI to use the global seed if you want it to use it
  • "# NoteToSelf: y is not a numpy array" ? It should be
  • "# NoteToSelf: if batch_size = 1 elfi uses different dimensions." No, it's the same. In ELFI axis=0 is always assumed the dimension of the batch_size. On the other hand, (in the default case) the output from your Simulator is directly the input to your Summary. So if your simulator does not have a dimension corresponding to batch_size, then it doesn't appear to the Summary either. In this case, if you did use batch_size, the output from your simulator would have shape (batch_size, 30, 6).
  • Very nice usage of OutputPool!
  • It would be informative to see output from elfi.draw
  • From a Bayesian perspective it is a bit confusing to name the parameters priors. What is the posterior mean of prior? Preferably use something like: mu1, mu2 etc.
  • Would be good to see how the inference succeeds in cases with fewer summaries
  • Note that Euclidean distance sums over squares, i.e. the more dimensions you have, the higher the distance typically is. Hence the validity of comparison using a fixed threshold is questionable
  • The statistical power of the different summaries in the 6d Gaussian case is very different, and I think the validity of the comparison should be discussed further. It is unfortunately not easy to come up with a meaningful case to try this with!
  • You should say elfi.new_model() before starting building the graph for MA2 to clear the previous model from memory
  • The MA2 case works very well with a larger batch_size, and is much faster
  • Using only 100 samples from rejection sampling (or SMC) is quite few and prone to much stochasticity
  • Comparison of results from inference seems to be based on posterior means only. Would be informative to see marginal distributions as well
  • Using the same threshold for Euclidean and Manhattan distance is unfounded
  • Autocorrelation is just normalized autocovariance, and for MA2 the difference is small, possibly insignificant as a summary statistic. In any case it's hardly 'silly'.
  • Please explain how ELFI was unstable? Issues with the multiprocessing module are not really issues with ELFI
  • Neither of the mentioned bugs are actually bugs in ELFI (poor documentation is another thing, apologies for that)

We will try to improve the documentation to avoid the mentioned confusions. Thanks for bringing these misunderstandings to our knowledge!

@vuolleko
Copy link
Member

@cagatayyildiz Great suggestion with the vectorized multivariate_normal. It seems that the code in elfi.examples.gauss.gauss_nd_mean has a for loop. Feel free to open a pull request to the main elfi repo with a tuned version. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants