Fix uncoalesced memory reads #1910

charleskawczynski · 2024-07-31T18:52:46Z

Broadcast operations on our most simple expressions result in uncoalesced memory reads for VIJFH datalayouts, which was observed via nsight compute.

Issues are being analyzed/outlined in the following benchmark scripts:

benchmarks/scripts/index_swapping.jl
benchmarks/scripts/indexing_and_static_ndranges.jl
benchmarks/scripts/thermo_bench_bw.jl
benchmarks/scripts/benchmark_offset.jl

Tasks

Give feedback

Add offset benchmark #1926

Performance monitoring 🔍🚀
Options

The text was updated successfully, but these errors were encountered:

charleskawczynski · 2024-08-15T13:27:48Z

This is an attempt to summarize the information gathered from benchmarks, and experimental development branches. Here is a high-level summary of the numbers. The bw column is the percentage of peak bandwidth reached for Float32 and Float64 data on NVIDIA A100s. The details of the kernels are discussed below.

indexing_and_static_ndranges.jl

@. y1 = foo(x1, x2, x3); foo(x1, x2, x3) = zero(x1) # kernel (0 reads, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                                     │ bw% F32 │ bw% F64 │
├───────────────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BSR.at_dot_call!(X_array, Y_array; nreps=1000, bm)                        │ 14.4882 │ 28.8217 │
│ BSR.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                      │ 72.1366 │ 70.4848 │
│ BSR.custom_sol_kernel!(X_vector, Y_vector, Val(N); nreps=1000, bm)        │ 76.943  │ 78.1221 │
│ BSR.custom_kernel_bc!(X_vector, Y_vector, uss; nreps=1000, bm)            │ 76.9247 │ 78.1975 │
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=false, nreps=1000, bm)│ 37.3141 │ 73.3654 │ # ~ClimaCore kernels
│ BSR.custom_kernel_bc!(X_array, Y_array, uss; use_pw=true, nreps=1000, bm) │ 76.9613 │ 78.1095 │ K1
└───────────────────────────────────────────────────────────────────────────┴─────────┴─────────┘

thermo_bench_bw.jl

@. ρ_write = ρ_read # singlefield_bc! kernel (1 read, 1 write)
@. ts = PhaseEquil(ρ,p,e_int,q_tot,T) # thermo_func_bc! kernel (5 reads, 5 writes)
[ Info: device = ClimaComms.CUDADevice()
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                          │ bw% F32 │ bw% F64 │
├────────────────────────────────────────────────┼─────────┼─────────┤
│ TBB.singlefield_bc!(x_soa, us; nreps=100, bm)  │ 29.4429 │ 36.5653 │
│ TBB.singlefield_bc!(x_aos, us; nreps=100, bm)  │ 28.5556 │ 32.1501 │
│ TBB.thermo_func_bc!(x, us; nreps=100, bm)      │ 12.4798 │ 19.0568 │ # ==ClimaCore kernels
│ TBB.thermo_func_sol!(x_vec, us; nreps=100, bm) │ 75.873  │ 77.477  │
└────────────────────────────────────────────────┴─────────┴─────────┘

Offset benchmark

Y[I1] = add3(X[I1], X[I2], X[I3]); add3(x1, x2, x3) = x1 + x2 + x3 # kernel (3 reads, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                          │ bw% F32 │ bw% F64 │
├────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BO.aos_cart_offset!(X_aos_ref, Y_aos_ref, us; bm, nreps = 100) │ 57.6793 │ 57.7908 │
│ BO.aos_lin_offset!(X_aos, Y_aos, us; bm, nreps = 100)          │ 68.489  │ 68.4046 │ K2
│ BO.soa_linear_index!(X_soa, Y_soa, us; bm, nreps = 100)        │ 70.2858 │ 70.3113 │
│ BO.soa_cart_index!(X_soa, Y_soa, us; bm, nreps = 100)          │ 59.1188 │ 59.2089 │
└────────────────────────────────────────────────────────────────┴─────────┴─────────┘

index_swapping.jl

@. y1 = foo(x1, x2, x3); foo(x1, x2, x3) = x1 # kernel (1 read, 1 write)
[ Info: ArrayType = CuArray
Problem size: (63, 4, 4, 1, 5400), device_bandwidth_GBs=2039
┌──────────────────────────────────────────────────────────────────────┬─────────┬─────────┐
│ funcs                                                                │ bw% F32 │ bw% F64 │
├──────────────────────────────────────────────────────────────────────┼─────────┼─────────┤
│ BIS.at_dot_call!(X_vector, Y_vector; nreps=1000, bm)                 │ 57.4574 │ 66.791  │
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=0, nreps=1000, bm) │ 32.939  │ 62.905  │ K4
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=1, nreps=1000, bm) │ 29.2034 │ 49.4142 │ K3 # pointwise kernels pattern
│ BIS.custom_kernel_bc!(X_array, Y_array, uss; swap=2, nreps=1000, bm) │ 32.9329 │ 62.9142 │    # stencil kernels pattern
└──────────────────────────────────────────────────────────────────────┴─────────┴─────────┘

Discussion / summary

One conclusion we learned early on from indexing_and_static_ndranges.jl was that statically computed offsets, or avoiding dynamic offsets via launch configurations, were necessary to reach peak bandwidth. We've since added Nv and Nh into the type space and now the size of our data is fully type-known. Therefore, we no longer need to care about the results of dynamic offsets, since we can trivially make them all static. So, I've removed the dynamic results from the indexing_and_static_ndranges.jl table for simplicity.
Our memory access is currently uncoalesced for many pointwise broadcast expressions.
K1 suggests that linearly indexing into a multidimensional array (MD) is all we need for maximum performance. However, K1 has no offsets, since all fields in those benchmarks were scalars.
Looking at K2, it seems to also suggest that we can reach near peak mem bandwidth. However, if we change the kernel in K2 to be: add3(x1, x2, x3) = x1, then our bandwidth efficiency drops to [ 48% (F32), 68% (F64)]. So, it appears that unused variables in expressions with linear indexing + offsets see performance degradation.
Index swapping has a big performance penalty for F32, but not so much for F64 (K4), and removing index swapping is not sufficient to reach maximum performance for F32.
I think that we can conclude, from the above bullet points, that linear indexing + offsets, alone, will not improve our kernels, since offsets for field variables in the VIJFH datalayouts have non-constant strides (i.e., strides are different before and after the fields dimension).
Therefore, it seems that we have two possible solutions to reach maximum performance for pointwise kernels:
- S1) Remove the field dimension, and store tuples of fields (this was already tested and showed good performance, as discussed here)
- S2) Swap the field and horizontal element dimension
Comparing S1) and S2)
- S1) will likely be a breaking change, since users may index into the parent array. However, we may be able to make this somewhat less breaking if we make the replacement data structure array-like.
- S1) MPI does not support isend/irecv with tuples, so we'll need to copy this data into a MD array before communicating data between processors.
- S2) This will likely also be breaking in that we'll need to change the names of many of our datalayouts, but internal use may not be quite as breaking, since we'll just need to permute indexing.
- S2) We may still want to copy data into a buffer array for MPI, to reduce strided data access.

charleskawczynski · 2024-09-09T20:24:16Z

I need to take a look at a new nsight report. But this may have been fixed by #1969.

charleskawczynski added enhancement New feature or request performance labels Jul 31, 2024

charleskawczynski mentioned this issue Jul 31, 2024

Performance roadmap CliMA/ClimaAtmos.jl#2632

Open

charleskawczynski self-assigned this Jul 31, 2024

This was referenced Aug 9, 2024

Add thermo benchmark bandwidth script #1917

Merged

Fuse some similar pointwise broadcast expressions CliMA/ClimaAtmos.jl#2983

Draft

Add nsight report to more jobs CliMA/ClimaAtmos.jl#2936

Closed

This was referenced Aug 15, 2024

Implement and extend data-specific cartesian index #1902

Open

Forward fill! to parent #1888

Closed

This was referenced Aug 23, 2024

Drop field dimension, support linear indexing and empty fields #1953

Draft

Add field-last benchmark script #1950

Open

Refactor DSS to better leverage DataLayouts #1958

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix uncoalesced memory reads #1910

Fix uncoalesced memory reads #1910

charleskawczynski commented Jul 31, 2024 •

edited

Loading

Tasks

charleskawczynski commented Aug 15, 2024 •

edited

Loading

charleskawczynski commented Sep 9, 2024

Fix uncoalesced memory reads #1910

Fix uncoalesced memory reads #1910

Comments

charleskawczynski commented Jul 31, 2024 • edited Loading

Tasks

charleskawczynski commented Aug 15, 2024 • edited Loading

indexing_and_static_ndranges.jl

thermo_bench_bw.jl

Offset benchmark

index_swapping.jl

Discussion / summary

charleskawczynski commented Sep 9, 2024

charleskawczynski commented Jul 31, 2024 •

edited

Loading

charleskawczynski commented Aug 15, 2024 •

edited

Loading