Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix uncoalesced memory reads #1910

Open
1 task done
Tracked by #2632
charleskawczynski opened this issue Jul 31, 2024 · 2 comments · May be fixed by #1950
Open
1 task done
Tracked by #2632

Fix uncoalesced memory reads #1910

charleskawczynski opened this issue Jul 31, 2024 · 2 comments · May be fixed by #1950
Assignees
Labels
enhancement New feature or request performance

Comments

@charleskawczynski
Copy link
Member

charleskawczynski commented Jul 31, 2024

Broadcast operations on our most simple expressions result in uncoalesced memory reads for VIJFH datalayouts, which was observed via nsight compute.

Issues are being analyzed/outlined in the following benchmark scripts:

  • benchmarks/scripts/index_swapping.jl
  • benchmarks/scripts/indexing_and_static_ndranges.jl
  • benchmarks/scripts/thermo_bench_bw.jl
  • benchmarks/scripts/benchmark_offset.jl

Tasks

  1. Performance monitoring 🔍🚀
@charleskawczynski
Copy link
Member Author

charleskawczynski commented Aug 15, 2024

This is an attempt to summarize the information gathered from benchmarks, and experimental development branches. Here is a high-level summary of the numbers. The bw column is the percentage of peak bandwidth reached for Float32 and Float64 data on NVIDIA A100s. The details of the kernels are discussed below.

indexing_and_static_ndranges.jl

@. y1 = foo(x1, x2, x3); foo(x1, x2, x3) = zero(x1) # kernel (0 reads, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                                     │ bw% F32 │ bw% F64 │
├───────────────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                        │ 14.4882 │ 28.8217 │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                      │ 72.1366 │ 70.4848 │
│ BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm)        │ 76.943  │ 78.1221 │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)            │ 76.9247 │ 78.1975 │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)│ 37.3141 │ 73.3654 │ # ~ClimaCore kernels
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm) │ 76.9613 │ 78.1095 │ K1
└───────────────────────────────────────────────────────────────────────────┴─────────┴─────────┘

thermo_bench_bw.jl

@. ρ_write = ρ_read # singlefield_bc! kernel (1 read, 1 write)
@. ts = PhaseEquil(ρ,p,e_int,q_tot,T) # thermo_func_bc! kernel (5 reads, 5 writes)
[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                          │ bw% F32 │ bw% F64 │
├────────────────────────────────────────────────┼─────────┼─────────┤
│ TBB.singlefield_bc!(x_soa, us; nreps=100, bm)  │ 29.4429 │ 36.5653 │
│ TBB.singlefield_bc!(x_aos, us; nreps=100, bm)  │ 28.5556 │ 32.1501 │
│ TBB.thermo_func_bc!(x, us; nreps=100, bm)      │ 12.4798 │ 19.0568 │ # ==ClimaCore kernels
│ TBB.thermo_func_sol!(x_vec, us; nreps=100, bm) │ 75.873  │ 77.477  │
└────────────────────────────────────────────────┴─────────┴─────────┘

Offset benchmark

Y[I1] = add3(X[I1], X[I2], X[I3]); add3(x1, x2, x3) = x1 + x2 + x3 # kernel (3 reads, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                          │ bw% F32 │ bw% F64 │
├────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BO.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 57.6793 │ 57.7908 │
│ BO.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 68.489  │ 68.4046 │ K2
│ BO.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 70.2858 │ 70.3113 │
│ BO.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 59.1188 │ 59.2089 │
└────────────────────────────────────────────────────────────────┴─────────┴─────────┘

index_swapping.jl

@. y1 = foo(x1, x2, x3); foo(x1, x2, x3) = x1 # kernel (1 read, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                                │ bw% F32 │ bw% F64 │
├──────────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 57.4574 │ 66.791  │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 32.939  │ 62.905  │ K4
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 29.2034 │ 49.4142 │ K3 # pointwise kernels pattern
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 32.9329 │ 62.9142 │    # stencil kernels pattern
└──────────────────────────────────────────────────────────────────────┴─────────┴─────────┘

Discussion / summary

  • One conclusion we learned early on from indexing_and_static_ndranges.jl was that statically computed offsets, or avoiding dynamic offsets via launch configurations, were necessary to reach peak bandwidth. We've since added Nv and Nh into the type space and now the size of our data is fully type-known. Therefore, we no longer need to care about the results of dynamic offsets, since we can trivially make them all static. So, I've removed the dynamic results from the indexing_and_static_ndranges.jl table for simplicity.

  • Our memory access is currently uncoalesced for many pointwise broadcast expressions.

  • K1 suggests that linearly indexing into a multidimensional array (MD) is all we need for maximum performance. However, K1 has no offsets, since all fields in those benchmarks were scalars.

  • Looking at K2, it seems to also suggest that we can reach near peak mem bandwidth. However, if we change the kernel in K2 to be: add3(x1, x2, x3) = x1, then our bandwidth efficiency drops to [ 48% (F32), 68% (F64)]. So, it appears that unused variables in expressions with linear indexing + offsets see performance degradation.

  • Index swapping has a big performance penalty for F32, but not so much for F64 (K4), and removing index swapping is not sufficient to reach maximum performance for F32.

  • I think that we can conclude, from the above bullet points, that linear indexing + offsets, alone, will not improve our kernels, since offsets for field variables in the VIJFH datalayouts have non-constant strides (i.e., strides are different before and after the fields dimension).

  • Therefore, it seems that we have two possible solutions to reach maximum performance for pointwise kernels:

    • S1) Remove the field dimension, and store tuples of fields (this was already tested and showed good performance, as discussed here)
    • S2) Swap the field and horizontal element dimension
  • Comparing S1) and S2)

    • S1) will likely be a breaking change, since users may index into the parent array. However, we may be able to make this somewhat less breaking if we make the replacement data structure array-like.
    • S1) MPI does not support isend/irecv with tuples, so we'll need to copy this data into a MD array before communicating data between processors.
    • S2) This will likely also be breaking in that we'll need to change the names of many of our datalayouts, but internal use may not be quite as breaking, since we'll just need to permute indexing.
    • S2) We may still want to copy data into a buffer array for MPI, to reduce strided data access.

@charleskawczynski
Copy link
Member Author

I need to take a look at a new nsight report. But this may have been fixed by #1969.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant