Fortran backend for 1D EM forward modelling and Numpy broadcasting #19

leonfoks · 2018-08-30T20:04:32Z

I've updated the 1D forward modeller to use numpy broadcasting and Fortran for a nice speed up (see below for timings)

For timings I used the default 19 layer model from the test example scripts.

In the setup.py file i've added the Fortran extension. On my Mac numpy.distutils would not add the extra compile flags needed from the extra_f90_compile_flags argument, so i've stuck them in the link args. This shouldn't be a problem.

Timing

Forward Modelling

EM1D - Survey
Two major bottlenecks

projectFields - piecewise pulse - piecewise ramp - Scipy.fixed_quad
forward - hz_kernel_circular_loop - rTEfunfwd/RTEfun_vec

Solutions

Had a double loop in python in piecewise_ramp and lots of redundantly recalculated arrays inside the loop as a pythonic inline list expansion.
By first expanding memory in C and utilizing broadcasting, the call time for only projectFields() for 1000 forward models was reduced from 20.6 s to 1.27s. This could be made faster by reducing to Fortran, I would just need to write the 1D interpolator in Fortran. Not necessary just yet.
Because of the recursion relation, there's not much you can do other than use numpy broadcasting inside a python loop...Which is what was already done in RTEfun_vec.
However, this can be sped up by reducing to Fortran (or C), I chose the former.
After I did this, the call time for the forward() function for 1000 forward models went from 24.6s to about 10-11s. I was not expecting great speed up for this one function given that the only slow down was the python loop. Also I used a loop so no need for all the extra memory required when using numpy.

A single forward model is about 16ms instead of ~45ms. Which is a lot of time for our parallel code once you factor in the number of iterations and the number of data.

Sensitivity i.e. getJ_sigma()

ProjectFields - same as above.
forward - hz_kernel_circular_loop - rTEfunjac
Quite a few issues here with redundant, repeated code, if statements inside loops, and multiple appends to pythonic lists.

Solutions

Used the same functions as above, instead of piecewisePulse i wrote piecewise_pulse_fast().
Since there are two calls to piecewise_ramp the new fast function reduced the time from 46.3s to 1.76s for 100 sensitivity calculations
Recoded any matmul operations when I could take advantage of symmetry and negation of the components to minimize flops
Used loops in Fortran to remove memory overhead that broadcasting requires
100 calls to forward() went from 16.8s to 6.66s

Some general comments on speed

Avoid inline loops and list building i.e.
np.array([fixed_quad(step_func, t, t+t0, n=n)[0] for t in time])
It's extremely slow compared to numpy arrays.
Don't create empty arrays assigned to a variable and then immediately reassign that variable. e.g.
rTE = np.empty((n_frequency, n_layer), dtype=complex)
rTE = rTEfunfwd(
n_layer, f, lamda, sig, chi, depth, self.survey.half_switch
)
rTE gets reassigned to the output of rTEfunfwd, so the np.empty statement just allocated space for no reason. The overhead is minor, but it's unnecessary. If you want to assign the values to the memory of the first empty call, use rTE[:, :] = function(...).

…ble. i.e. using x = np.empty() must be followed by x[:, :] in order to take advantage of the allocated memory.

…r 1 pulse. Fast diff uses broadcasting for 2 pulses. Piecewise Pulse fast keeps the gauss legendre coefficients so no need to keep asking for them.

Fixed a bug where j was not assigned in certain cases.

… glue to the Fortran codes. Memory is preallocated before calling the Fortran codes.

…o update the Fortran compile flags using the extra_f90_compile_args option. I had to add them to the link_args instead. Bit hacky, but it should work fine.

prisae · 2018-08-31T06:55:16Z

Good catch on the overwriting of just created arrays. I saw that when I worked on the filters and wondered about it, but then left it aside.

You still have things like

lambd = np.empty([self.survey.frequency.size, n_filter], order='F')
lambd[:, :], _ = get_spline_values(self.fhtfilt, r, self.hankel_pts_per_dec)

Couldn't you just write

lambd, _ = get_spline_values(self.fhtfilt, r, self.hankel_pts_per_dec)

I might have time on the weekend to go a bit through your code, or then later in September. However, I probably wait until @sgkang went through. He knows the code in and out, I just had a look related to Hankel/Fourier transform.

prisae · 2018-08-31T07:08:00Z

One comment related to your question in #14 : In empymod the entire kernel is in python, where the calculation is done in a massive matrix that includes all frequencies, wavenumbers, and offsets (if RAM permits and depending on the Hankel/Fourier transform method; otherwise it might loop over frequencies or over offsets or both; but it always includes all wavenumbers).

I later profiled the kernel, and the most time consuming parts where the square roots and the exponential functions (np.sqrt, np.exp) which included this big matrix. I tried various things to check if I could speed it up, such as numba, Cython. However, not really successful. This led me to the conclusion that the most time consuming part of empymod (I think in the order of 90 % runtime), np.sqrt and np.exp calls, simply take that amount to calculate, and cannot be further optimized (if using conda with mkl etc). So rewriting the python code into a Cython loop over frequencies, offsets, and wavenumbers does not improve it, because the square root and the exponential in this (then compiled) loop will still take that time.

It would therefore be interesting to know if your Fortran implementation is really faster than a pure, vectorized python implementation (which is what #14 is about).

Of course, that is a luxury problem more out of interest than anything else, and probably not worth the extra work, as you have already done the Fortran version and it proves to be faster than the existing version.

leonfoks · 2018-08-31T16:55:42Z

First comment re:

lambd = np.empty([self.survey.frequency.size, n_filter], order='F')
lambd[:, :], _ = get_spline_values(self.fhtfilt, r, self.hankel_pts_per_dec)

This was done on purpose. Since lambd is an input to the forward modeller, I had to make sure that its ordering was Fortran. Otherwise, when a numpy array in order='C' is passed to a Fortran subroutine, there is an implicit copy due to the different row/column major ordering between C/Fortran. The copy is necessary in that case so it is row major ordering. By specifying order='F' we no longer cause the implicit copy.

The first line instantiates the memory without specifying a value, so very fast. While the second line might return a C ordered numpy array, we can put it into Fortran order with no issues. Doing this here saves me from digging down into get_spline_values() and changing the ordering there.

If another part of the code base uses lambd, there should not be a dramatic change in efficiency for those Python codes that were once using lambd with C ordering.

Second comment re:

I agree with what you are saying. There is little to no speed up gained when migrating a vectorized Numpy function to Fortran/C when there is a single Pythonic loop. By this I say a pythonic loop is

for i in range(nFrquencies):
   Do a bunch of vectorized Numpy stuff

The speed is equivalent between Numpy broadcasting vs loops in Fortran/C. Since they are doing the same operations on the backend, the time should not change. The downside with Numpy broadcasting is the extra memory (sometimes a lot!) to carryout the operations. So I only expected Fortran to be a little faster than the vectorized Numpy implementation for the sensitivity matrix. The gains were mainly the memory, and removing the need to keep instantiating 2D and 3D blocks of memory with every call to the sensitivity matrix. Allocation of memory takes time!

If the code was written without Numpy (i.e. pythonic) there absolutely are efficiency gains.

However, and I found this out yesterday, when the Fortran extension is added to the setup.py file, and we "pip install simpegEM1D", f2py is called which creates the glue code to the Fortran module. Pip then compiles the module using some default compile flags namely '-Ox -g', where '-O' is the optimization level, and '-g' adds in extra symbols for debugging. -g alone can slow codes down by at least 2x. With the default flags, the Fortran code was the same speed as the Numpy version.

If we want efficient code, we need '-O3 -g0'. If we test the code, and are confident in it, we can turn off debug symbols for the Fortran code only with '-g0' and get at least 2x speed up. So, in this scenario and as is the case with this PR, the Fortran extension with -O3 and -g0 was 2x faster than the vectorized Numpy function, as well as having far less memory overhead since I used scalars in a double Fortran loop rather than array operations. Also, I initially wrote the Fortran code in vectorized notation, and the speed was the same as the version with loops so I did test some of these ideas.

prisae · 2018-09-01T08:56:23Z

Thanks for the further explanations @leonfoks, very interesting and insightful. As I said, simpegEM1D is the brainchild of @sgkang, I only contributed a negligible part to it, and never looked at the code in detail. So you'll have to wait for his response/opinion.

sgkang · 2018-09-26T19:02:31Z

Hi @leonfoks, sorry for being too late about this pull request!
I am just having some time to work on this, and excited with your improvements. I'll start to put some questions as I am reviewing this pull request!

Leon Foks added 9 commits August 30, 2018 12:09

Removed np.empty statements when the next line would recast the varia…

cd5c127

…ble. i.e. using x = np.empty() must be followed by x[:, :] in order to take advantage of the allocated memory.

Added 3 new functions. Piecewise ramp fast uses Numpy broadcasting fo…

68f2a39

…r 1 pulse. Fast diff uses broadcasting for 2 pulses. Piecewise Pulse fast keeps the gauss legendre coefficients so no need to keep asking for them.

Added single layer flags to half-switch so no need to specify.

3f7d08f

Fixed a bug where j was not assigned in certain cases.

Updated forward modelling to use Piecewise_Pulse_Fast

98d7e39

minor

f21906f

Added Fortran module for forward and sensitivity calculations

f98e0f8

Inputs to forward modelling must be in Fortran order to get efficient…

a20954b

… glue to the Fortran codes. Memory is preallocated before calling the Fortran codes.

Added Fortran extension to setup.py. On my Mac, I could not get pip t…

8d83d1f

…o update the Fortran compile flags using the extra_f90_compile_args option. I had to add them to the link_args instead. Bit hacky, but it should work fine.

Removed unused variables

e7b6529

prisae mentioned this pull request Aug 31, 2018

vectorization for loop when computing sensitivity for time domain #14

Open

sgkang changed the base branch from master to speed_up_leon September 26, 2018 19:11

Merge branch 'speed_up_leon' into master

2fe95e6

sgkang merged commit 3f76463 into simpeg:speed_up_leon Sep 26, 2018

sgkang mentioned this pull request Sep 26, 2018

Fortran backend for 1D EM forward modelling and Numpy broadcasting #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fortran backend for 1D EM forward modelling and Numpy broadcasting #19

Fortran backend for 1D EM forward modelling and Numpy broadcasting #19

leonfoks commented Aug 30, 2018

prisae commented Aug 31, 2018

prisae commented Aug 31, 2018

leonfoks commented Aug 31, 2018 •

edited

Loading

prisae commented Sep 1, 2018

sgkang commented Sep 26, 2018

Fortran backend for 1D EM forward modelling and Numpy broadcasting #19

Fortran backend for 1D EM forward modelling and Numpy broadcasting #19

Conversation

leonfoks commented Aug 30, 2018

Timing

Forward Modelling

Solutions

Sensitivity i.e. getJ_sigma()

Solutions

Some general comments on speed

prisae commented Aug 31, 2018

prisae commented Aug 31, 2018

leonfoks commented Aug 31, 2018 • edited Loading

First comment re:

Second comment re:

prisae commented Sep 1, 2018

sgkang commented Sep 26, 2018

leonfoks commented Aug 31, 2018 •

edited

Loading