-
Notifications
You must be signed in to change notification settings - Fork 11
Fortran backend for 1D EM forward modelling and Numpy broadcasting #19
Conversation
…ble. i.e. using x = np.empty() must be followed by x[:, :] in order to take advantage of the allocated memory.
…r 1 pulse. Fast diff uses broadcasting for 2 pulses. Piecewise Pulse fast keeps the gauss legendre coefficients so no need to keep asking for them.
Fixed a bug where j was not assigned in certain cases.
… glue to the Fortran codes. Memory is preallocated before calling the Fortran codes.
…o update the Fortran compile flags using the extra_f90_compile_args option. I had to add them to the link_args instead. Bit hacky, but it should work fine.
Good catch on the overwriting of just created arrays. I saw that when I worked on the filters and wondered about it, but then left it aside. You still have things like
Couldn't you just write
I might have time on the weekend to go a bit through your code, or then later in September. However, I probably wait until @sgkang went through. He knows the code in and out, I just had a look related to Hankel/Fourier transform. |
One comment related to your question in #14 : In I later profiled the kernel, and the most time consuming parts where the square roots and the exponential functions ( It would therefore be interesting to know if your Fortran implementation is really faster than a pure, vectorized python implementation (which is what #14 is about). Of course, that is a luxury problem more out of interest than anything else, and probably not worth the extra work, as you have already done the Fortran version and it proves to be faster than the existing version. |
First comment re:lambd = np.empty([self.survey.frequency.size, n_filter], order='F')
lambd[:, :], _ = get_spline_values(self.fhtfilt, r, self.hankel_pts_per_dec) This was done on purpose. Since lambd is an input to the forward modeller, I had to make sure that its ordering was Fortran. Otherwise, when a numpy array in order='C' is passed to a Fortran subroutine, there is an implicit copy due to the different row/column major ordering between C/Fortran. The copy is necessary in that case so it is row major ordering. By specifying order='F' we no longer cause the implicit copy. The first line instantiates the memory without specifying a value, so very fast. While the second line might return a C ordered numpy array, we can put it into Fortran order with no issues. Doing this here saves me from digging down into get_spline_values() and changing the ordering there. If another part of the code base uses lambd, there should not be a dramatic change in efficiency for those Python codes that were once using lambd with C ordering. Second comment re:I agree with what you are saying. There is little to no speed up gained when migrating a vectorized Numpy function to Fortran/C when there is a single Pythonic loop. By this I say a pythonic loop is for i in range(nFrquencies):
Do a bunch of vectorized Numpy stuff The speed is equivalent between Numpy broadcasting vs loops in Fortran/C. Since they are doing the same operations on the backend, the time should not change. The downside with Numpy broadcasting is the extra memory (sometimes a lot!) to carryout the operations. So I only expected Fortran to be a little faster than the vectorized Numpy implementation for the sensitivity matrix. The gains were mainly the memory, and removing the need to keep instantiating 2D and 3D blocks of memory with every call to the sensitivity matrix. Allocation of memory takes time! If the code was written without Numpy (i.e. pythonic) there absolutely are efficiency gains. However, and I found this out yesterday, when the Fortran extension is added to the setup.py file, and we "pip install simpegEM1D", f2py is called which creates the glue code to the Fortran module. Pip then compiles the module using some default compile flags namely '-Ox -g', where '-O' is the optimization level, and '-g' adds in extra symbols for debugging. -g alone can slow codes down by at least 2x. With the default flags, the Fortran code was the same speed as the Numpy version. If we want efficient code, we need '-O3 -g0'. If we test the code, and are confident in it, we can turn off debug symbols for the Fortran code only with '-g0' and get at least 2x speed up. So, in this scenario and as is the case with this PR, the Fortran extension with -O3 and -g0 was 2x faster than the vectorized Numpy function, as well as having far less memory overhead since I used scalars in a double Fortran loop rather than array operations. Also, I initially wrote the Fortran code in vectorized notation, and the speed was the same as the version with loops so I did test some of these ideas. |
Hi @leonfoks, sorry for being too late about this pull request! |
I've updated the 1D forward modeller to use numpy broadcasting and Fortran for a nice speed up (see below for timings)
For timings I used the default 19 layer model from the test example scripts.
In the setup.py file i've added the Fortran extension. On my Mac numpy.distutils would not add the extra compile flags needed from the extra_f90_compile_flags argument, so i've stuck them in the link args. This shouldn't be a problem.
Timing
Forward Modelling
EM1D - Survey
Two major bottlenecks
Solutions
Had a double loop in python in piecewise_ramp and lots of redundantly recalculated arrays inside the loop as a pythonic inline list expansion.
By first expanding memory in C and utilizing broadcasting, the call time for only projectFields() for 1000 forward models was reduced from 20.6 s to 1.27s. This could be made faster by reducing to Fortran, I would just need to write the 1D interpolator in Fortran. Not necessary just yet.
Because of the recursion relation, there's not much you can do other than use numpy broadcasting inside a python loop...Which is what was already done in RTEfun_vec.
However, this can be sped up by reducing to Fortran (or C), I chose the former.
After I did this, the call time for the forward() function for 1000 forward models went from 24.6s to about 10-11s. I was not expecting great speed up for this one function given that the only slow down was the python loop. Also I used a loop so no need for all the extra memory required when using numpy.
A single forward model is about 16ms instead of ~45ms. Which is a lot of time for our parallel code once you factor in the number of iterations and the number of data.
Sensitivity i.e. getJ_sigma()
Quite a few issues here with redundant, repeated code, if statements inside loops, and multiple appends to pythonic lists.
Solutions
Since there are two calls to piecewise_ramp the new fast function reduced the time from 46.3s to 1.76s for 100 sensitivity calculations
Used loops in Fortran to remove memory overhead that broadcasting requires
100 calls to forward() went from 16.8s to 6.66s
Some general comments on speed
Avoid inline loops and list building i.e.
np.array([fixed_quad(step_func, t, t+t0, n=n)[0] for t in time])
It's extremely slow compared to numpy arrays.
Don't create empty arrays assigned to a variable and then immediately reassign that variable. e.g.
rTE = np.empty((n_frequency, n_layer), dtype=complex)
rTE = rTEfunfwd(
n_layer, f, lamda, sig, chi, depth, self.survey.half_switch
)
rTE gets reassigned to the output of rTEfunfwd, so the np.empty statement just allocated space for no reason. The overhead is minor, but it's unnecessary. If you want to assign the values to the memory of the first empty call, use rTE[:, :] = function(...).