You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not have to do any division in the kernel itself
integer divisions are really expensive, but I don't think they will matter much in a kernel as memory-bound as this. I guess the first thing to do would be some thread coarsening, so that the divisions are amortized, and possibly a 2D or 3D grid, so that you don't even have to do the divisions at all, and can just read off individual coordinates from threadIdx and blockIdx.
Creating this issue to track progress on this
The text was updated successfully, but these errors were encountered:
Wouldn't neccesarily add more lines of code, just reorganize where the calculations are done. From a theoretical standpoint this should speed things up since it reduces the amount of calculations by a factor of how many kernels are used
Supposedly the permutation kernels, even though they are mostly memory bound can reduce the amount of division and do thread coarsening by having a 2d or 3d grid and not have to do any division in the kernel itself
Looking into this from the advice of @ngc92:
Creating this issue to track progress on this
The text was updated successfully, but these errors were encountered: