Speeding up Grid via Numba/Cython #1583

rht · 2023-01-16T07:19:56Z

rht
Jan 16, 2023

Update 1: We have decided to go with Cython because:

Numba @jitclass requires all the methods to be strictly in nopython. This greatly limits the implementation detail. I can't even pass a pos (a tuple[int, int]) to a method, and will have to wrap a converter in an external function (defined outside the class definition) if I were to write the Numba code.
The error messages are more cryptic than Cython's. Sometimes the error doesn't even refer to the relevant line number. Something like TypeError: Failed in nopython mode pipeline (step: fix up args) Signature mismatch: 1 argument types given, but function takes 2 arguments with no line number
The supposed benefit of Numba, i.e. typed list and typed dict, are slow in non-Numba code

Numba will likely still evolve a lot in the next few years. But for our use case, Cython is a better fit.

@Tortar and I have been experimenting with reimplementing Grid in Numba/Cython. So far we have tried to speed up get_neighborhood. I will summarize the finding.

This is _Grid.get_neighborhood((10, 10), True, include_center=True, radius=10) with Torus=False in various implementations

default 146.35 μs
empty_function 0.399 μs
cython np.ndarray 6.122 μs
cython list 61.702 μs
numba np.ndarray 6.35 μs
numba typed list 57.719 μs
cython array 209.88 μs

"Cython array" is basically Cython with Pythons array.array. "Cython list" is Cython but with Python list. Numba/Cython with np.ndarray 23-24x faster than "default" (the vanilla `get_neighborhood).

Julia:

get_neighborhood:
  4.438 μs (2 allocations: 11.62 KiB)
Empty function:
  5.620 ns (0 allocations: 0 bytes)

Julia implementation is ~2x faster than Numba/Cython np.ndarray.

However, the story is different if we have to convert the array to Python list, at the end of computation:

default 175.889 μs
empty_function 0.709 μs
cython np.ndarray 931.432 μs
cython list 97.376 μs
numba np.ndarray 135.829 μs
numba typed list 3613.393 μs
cython array 220.017 μs

This leaves us 2 options:

Get at most "only" 2x speedup via Cython + Python list
Get 23x speedup if the output never gets converted to Python list, or most of the time, very slight speedup if the output has to be converted to Python list

I think option 2 is more promising. But this means that Mesa users will have to code their model in Numba/Cython if they want this 23x speedup. This might be a viable use case for people who already have written their model in Mesa/Python, but find it too time-consuming to migrate their code to Julia/Agents.jl.

This summary is only for get_neighborhood. But we should expect an order of magnitude speedup for iter_cell_list_contents as well.

Code to reproduce this can be found at https://github.com/rht/mesa-perf.

Tortar · 2023-01-16T14:57:07Z

Tortar
Jan 16, 2023

I think we have a third option but to be sure about that I need to carry out some experiments (will do it soon), and I will write down this option in more depth if they come out positively but summarizing I think we could be able to do both of your 1) and 2) all at once. If possible, this would be the best solution! (imho)

0 replies

Tortar · 2023-01-16T16:56:58Z

Tortar
Jan 16, 2023

The experiment was a success, we can have the same perf as with numpy with array,array (without the conversion) and the conversion makes it the same perf as with cython + list. This is the code:

cython array.array

import cython
from cython.cimports.cpython import array
import array

@cython.boundscheck(False)
def compute_neighborhood(pos, moore: cython.bint,
                         include_center: cython.bint, radius: cython.int, torus: cython.bint,
                         width: cython.int, height: cython.int) -> tuple[array.array]:

    neighborhood_x : array.array[cython.int]
    neighborhood_y: array.array[cython.int]
    
    neighborhood_x = array.array('l')
    neighborhood_y = array.array('l')

    array.resize(neighborhood_x, width * height)
    array.resize(neighborhood_y, width * height)

    p_neighborhood_x: cython.int[:]
    p_neighborhood_y: cython.int[:]
    p_neighborhood_x = neighborhood_x
    p_neighborhood_y = neighborhood_y
    
    x: cython.int; y: cython.int; i: cython.int
    i = 0
    x, y = pos
    
    if torus:
        x_max_radius, y_max_radius = width // 2, height // 2

        x_radius: cython.int; y_radius: cython.int
        x_radius, y_radius = min(radius, x_max_radius), min(radius, y_max_radius)

        xdim_even, ydim_even = (width + 1) % 2, (height + 1) % 2
        kx: cython.int = 1 if x_radius == x_max_radius and xdim_even else 0
        ky: cython.int = 1 if y_radius == y_max_radius and ydim_even else 0

        dx: cython.int; dy: cython.int
        nx: cython.int; ny: cython.int
        for dx in range(-x_radius, x_radius + 1 - kx):
            for dy in range(-y_radius, y_radius + 1 - ky):

                if not moore and abs(dx) + abs(dy) > radius:
                    continue

                nx = (x + dx) % width
                ny = (y + dy) % height

                if nx == x and ny == y and not include_center:
                    continue
                
                p_neighborhood_x[i] = nx
                p_neighborhood_y[i] = ny
                i += 1
    else:
        min_x_range: cython.int = max(0, x - radius)
        max_x_range: cython.int = min(width, x + radius + 1)
        min_y_range: cython.int = max(0, y - radius)
        max_y_range: cython.int = min(height, y + radius + 1)

        nx: cython.int; ny: cython.int
        for nx in range(min_x_range, max_x_range):
            for ny in range(min_y_range, max_y_range):

                if not moore and abs(nx - x) + abs(ny - y) > radius:
                    continue

                if nx == x and ny == y and not include_center:
                    continue
                
                p_neighborhood_x[i] = nx
                p_neighborhood_y[i] = ny
                i += 1

    return neighborhood_x, neighborhood_y

##    neighborhood: list = [0]*i
##
##    for k in range(i):
##        neighborhood[k] = (p_neighborhood_x[k], p_neighborhood_y[k])
##        
##    return neighborhood

But then I realized that it was unnecessary to perform this experiment xD since the same can be probably achieved using only numpy because the conversion will cost the same (we need to check but it seems probable). But in any case I think we can have both options 1) and 2) all at once. This is how:

Implement a _BaseGrid class with only Cython code (or Numba :-), but I'm for using Cython) passing around numpy arrays and definining an internal grid data structure as a numpy array containing only ids of agents (which have to be int) as suggested by @rht. This class could be used by users if they really want to do so.
Make the _Grid (and SingleGrid, MultiGrid) handle only the conversion, this can be done initializing a dict inside the _Grid class with ids as keys and agents as values.

This achieves both 1) and 2) at the same time. Also sometimes handling only the conversion can be quite fast, for example if the neighbors are not many, the list of ints which will be passed will be short. What do you think about this solution? This would be beneficial for all users

6 replies

Tortar Jan 16, 2023

performed some experiment, the only way to keep performance in conversion with numpy arrays (what we can keep at least) is to use a tolist() but this would make a list of 2 dimensional lists instead of tuples :(.

But we can go even further and return a numpy array from the two array.array for the cython class and a list when used from python grid, implementing two conversion functions instead of one (I think that anyway this would be the only time, because we have in all other cases just one dimensional lists and also the conversion functions should be almost one-liner functions)

rht Jan 16, 2023
Author

I thought the NumPy version could be fast if you construct the Python list within Cython? Is this slower than ndarray.tolist()?

Tortar Jan 17, 2023

yes, it is slower but maybe I'm doing something wrongly I don't know, can you try it too?

edit: probably I'm wrong, I did something like ls_neigh[i] = tuple(neighborhood[i]) but I think the right way should be ls_neigh[i] = (neigborhood[i,0], neighborhood[i,1]) but didn't try it yet

Tortar Jan 17, 2023

tried, indeed it now works, the fact is that the unpacking has to be done in Cython

rht Jan 17, 2023
Author

The (neigborhood[i,0], neighborhood[i,1]) works for both Numba and Cython. The only way to tie-break the 2 seems to be to compare the get_cell_list_contents implementation.

rht · 2023-01-18T15:46:51Z

rht
Jan 18, 2023
Author

New findings. I ported get_cell_list_contents to Cython, with 2 versions, and got this result:

default 341.865 μs
cython np.ndarray 18.906 μs
cython list-of-list 47.554 μs

Both versions return a list of agents, and so can easily be consumed by other Python functions.
This is a surprising ~18x speedup, much better than get_neighborhood, which is ~2x at worst case. But the result above was run on an Intel machine.

On Ryzen:

default 15.596 μs
cython np.ndarray 5.364 μs
cython list-of-list 10.657 μs

It's still a significant speedup regardless.
Implementation: https://github.com/rht/mesa_perf/tree/main/get_cell_list_contents

4 replies

Tortar Jan 18, 2023

@rht, my bad see element chat :(

rht Jan 19, 2023
Author

For reference:

I had a Mesa version installed with my optimization of iter_cell_list_contents xD
that's why I saw only that difference in performance

Actual benchmark result on Ryzen:

default 51.615 μs
cython np.ndarray  5.274 μs
cython list-of-list 10.907 μs

rht Jan 20, 2023
Author

@Tortar can you post the Cython memoryview result here? Is it 40x speedup?

Tortar Jan 24, 2023

This are the result with the memoryview ones included, they are big but not as big as 40x :-)

default

python grid init 780.232 μs
python get_neighborhood 26.999 μs
python get_cell_list_contents 69.572 μs

timings with the map

cython with map grid init 4.186 μs --> speedup 186.41
cython with map get_neighborhood_mview 1.569 μs --> speedup 17.21
cython with map get_cell_mview_contents 7.382 μs --> speedup 9.42
cython with map get_neighborhood 14.199 μs --> speedup 1.9
cython with map get_cell_list_contents 13.770 μs --> speedup 5.05

timings without the map

cython no map grid init 27.012 μs --> speedup 28.89
cython no map get_neighborhood_mview 1.540 μs --> speedup 17.53
cython no map get_cell_mview_contents 7.381 μs --> speedup 9.43
cython no map get_neighborhood 13.159 μs --> speedup 2.05
cython no map get_cell_list_contents 13.769 μs --> speedup 5.05

in the two different implementation with and without a map, for a 100x100 grid, with radius=10 in get_neighborhood in the moore case (less speed up in respect to the other case) and with a list of more or less 30 positions passed to get_cell_list_contents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up Grid via Numba/Cython #1583

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Speeding up Grid via Numba/Cython #1583

Replies: 3 comments · 10 replies

rht Jan 16, 2023 Author

rht Jan 17, 2023 Author

rht Jan 18, 2023 Author

rht Jan 19, 2023 Author

rht Jan 20, 2023 Author

Replies: 3 comments 10 replies

rht Jan 16, 2023
Author

rht Jan 17, 2023
Author

rht
Jan 18, 2023
Author

rht Jan 19, 2023
Author

rht Jan 20, 2023
Author