shared memory accumulation #298

toaster-robotics · 2024-08-24T15:33:44Z

toaster-robotics
Aug 24, 2024

hello,
would anyone have code examples of how to use func_native to speed up atomic add? I've read that block-wise accumulation using shared memory, then using atomic add between blocks can speed things up vs just using atomic_add. However I'm unsure how to implement this using what's available in the docs. I have the following questions after reading through that example code:

the example shows a 1d tid. But what do we do if tid is 2d or 3d?
the example has an input 1d array and an output int. how would you handle assigning the output to a slice of an array? for example I want to output to go into A[i, j]?

this is roughly what my code looks like

@wp.kernel
def my_kernel(
    C_11: wp.array3d(dtype=wp.float32),  # type: ignore
    C_12: wp.array3d(dtype=wp.float32),  # type: ignore
    C_21: wp.array3d(dtype=wp.float32),  # type: ignore
    C_22: wp.array3d(dtype=wp.float32),  # type: ignore
) -> None:
    jj, kk = wp.tid()

    _C_11, _C_12, _C_21, _C_22 = calculations(jj, kk, ...)

    wp.atomic_add(C_11, jj, _C_11)
    wp.atomic_add(C_12, jj, _C_12)
    wp.atomic_add(C_21, jj, _C_21)
    wp.atomic_add(C_22, jj, _C_22)

Answered by daedalus5

Aug 26, 2024

import warp as wp
import numpy as np

device = "cuda:0"

snippet = """
    __shared__ int sum[256];

    int index = j * 16 + i;

    sum[index] = arr[index];
    __syncthreads();

    for (int stride = 128; stride > 0; stride >>= 1) {
        if (index < stride) {
            sum[index] += sum[index + stride];
        }
        __syncthreads();
    }

    if (index == 0) {
        out[0] = sum[0];
    }
    """

@wp.func_native(snippet)
def reduce(arr: wp.array2d(dtype=int), out: wp.array(dtype=int), i: int, j: int):
    ...

@wp.kernel
def reduce_kernel(arr: wp.array2d(dtype=int), out: wp.array(dtype=int)):
    i, j = wp.tid()
    reduce(arr, out, i, j)

N = 16
row = np.arange(N, dtype=…

View full answer

daedalus5 · 2024-08-26T16:29:34Z

daedalus5
Aug 26, 2024
Maintainer

import warp as wp
import numpy as np

device = "cuda:0"

snippet = """
    __shared__ int sum[256];

    int index = j * 16 + i;

    sum[index] = arr[index];
    __syncthreads();

    for (int stride = 128; stride > 0; stride >>= 1) {
        if (index < stride) {
            sum[index] += sum[index + stride];
        }
        __syncthreads();
    }

    if (index == 0) {
        out[0] = sum[0];
    }
    """

@wp.func_native(snippet)
def reduce(arr: wp.array2d(dtype=int), out: wp.array(dtype=int), i: int, j: int):
    ...

@wp.kernel
def reduce_kernel(arr: wp.array2d(dtype=int), out: wp.array(dtype=int)):
    i, j = wp.tid()
    reduce(arr, out, i, j)

N = 16
row = np.arange(N, dtype=int)
x = wp.array2d(np.tile(row, (N, 1)), dtype=int, device=device)
out = wp.zeros(1, dtype=int, device=device)

wp.launch(kernel=reduce_kernel, dim=(N, N), inputs=[x, out], device=device)

print(out)

This is one way to handle the 2D case. Under the hood, all 1D, 2D, 3D etc arrays are treated as linear arrays. The multi-dimensional arrays just provide a convenient way to organize thread indices, but you are always free to compute your own 1D <--> ND conversions.

1 reply

toaster-robotics Aug 27, 2024
Author

thanks, really appreciate the example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared memory accumulation #298

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

shared memory accumulation #298

toaster-robotics Aug 24, 2024

Replies: 1 comment · 1 reply

daedalus5 Aug 26, 2024 Maintainer

toaster-robotics Aug 27, 2024 Author

toaster-robotics
Aug 24, 2024

Replies: 1 comment 1 reply

daedalus5
Aug 26, 2024
Maintainer

toaster-robotics Aug 27, 2024
Author