-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor stencil kernels to show mem bandwidth #1968
Conversation
8cdfaac
to
26181ae
Compare
Can we extract memory bandwidth information from |
It's theoretically possible, but I don't think it's necessary and I'd like to preserve the generated table in this PR. Also, nsight compute may tell us the used and theoretical memory bandwidths, but it cannot tell us the number of expected reads and writes per operator-- which I think is best to compare against. Otherwise, nsight compute may tell us that we're reaching good memory bandwidth even though we may be reading/writing too many variables (e.g., metric terms). We could (and perhaps should) use nsight compute to measure the number of reads / writes to compare against what is expected, but I'll leave that as a separate effort. |
Here is what the new table looks like: Problem size: (4, 4, 1, 63, 5400), N-reps: 1, Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs │ time per call │ bw % │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none) │ 63 microseconds, 850 nanoseconds │ 31.1511 │ 635.17 │ 2 │
│ (op_GradientF2C!, :SetValue, :SetValue) │ 88 microseconds, 100 nanoseconds │ 22.5765 │ 460.334 │ 2 │
│ (op_GradientC2F!, :SetGradient, :SetGradient) │ 93 microseconds, 379 nanoseconds │ 21.2999 │ 434.305 │ 2 │
│ (op_GradientC2F!, :SetValue, :SetValue) │ 82 microseconds, 939 nanoseconds │ 23.9811 │ 488.974 │ 2 │
│ (op_DivergenceF2C!, :none) │ 130 microseconds, 90 nanoseconds │ 22.9339 │ 467.622 │ 3 │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate) │ 226 microseconds, 530 nanoseconds │ 13.1703 │ 268.542 │ 3 │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence) │ 129 microseconds, 440 nanoseconds │ 23.0491 │ 469.97 │ 3 │
│ (op_InterpolateF2C!, :none) │ 64 microseconds, 370 nanoseconds │ 30.8994 │ 630.039 │ 2 │
│ (op_InterpolateC2F!, :SetValue, :SetValue) │ 64 microseconds, 431 nanoseconds │ 30.8702 │ 629.443 │ 2 │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate) │ 78 microseconds, 590 nanoseconds │ 25.3081 │ 516.033 │ 2 │
│ (op_broadcast_example0!, :none) │ 75 microseconds, 79 nanoseconds │ 39.7374 │ 810.247 │ 3 │
│ (op_broadcast_example1!, :none) │ 134 microseconds, 670 nanoseconds │ 29.5386 │ 602.292 │ 4 │
│ (op_broadcast_example2!, :none) │ 134 microseconds, 689 nanoseconds │ 29.5342 │ 602.202 │ 4 │
│ (op_LeftBiasedC2F!, :SetValue) │ 60 microseconds, 189 nanoseconds │ 33.0453 │ 673.794 │ 2 │
│ (op_LeftBiasedF2C!, :none) │ 61 microseconds, 609 nanoseconds │ 32.2837 │ 658.264 │ 2 │
│ (op_LeftBiasedF2C!, :SetValue) │ 65 microseconds, 521 nanoseconds │ 30.3566 │ 618.971 │ 2 │
│ (op_RightBiasedC2F!, :SetValue) │ 61 microseconds, 349 nanoseconds │ 32.4205 │ 661.054 │ 2 │
│ (op_RightBiasedF2C!, :none) │ 61 microseconds, 130 nanoseconds │ 32.5372 │ 663.433 │ 2 │
│ (op_RightBiasedF2C!, :SetValue) │ 64 microseconds, 160 nanoseconds │ 31.0006 │ 632.102 │ 2 │
│ (op_CurlC2F!, :SetCurl, :SetCurl) │ 99 microseconds, 760 nanoseconds │ -9.96875 │ -203.263 │ -1 │
│ (op_CurlC2F!, :SetValue, :SetValue) │ 142 microseconds, 339 nanoseconds │ -6.98672 │ -142.459 │ -1 │
│ (op_UBPC2F!, :SetValue, :SetValue) │ 142 microseconds, 730 nanoseconds │ -6.96763 │ -142.07 │ -1 │
│ (op_UBPC2F!, :Extrapolate, :Extrapolate) │ 152 microseconds, 138 nanoseconds │ -6.53671 │ -133.284 │ -1 │
│ (op_divO3UBPC2F!, :1SidedO3, :1SidedO3, :SetValue, :SetValue) │ 501 microseconds, 387 nanoseconds │ -1.98347 │ -40.4429 │ -1 │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none) │ 300 microseconds, 917 nanoseconds │ 9.91452 │ 202.157 │ 3 │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence) │ 115 microseconds, 970 nanoseconds │ 25.726 │ 524.554 │ 3 │
│ (op_div_interp_CC!, :SetValue, :SetValue, :none) │ 209 microseconds, 69 nanoseconds │ -4.75672 │ -96.9895 │ -1 │
│ (op_div_interp_FF!, :none, :SetValue, :SetValue) │ 162 microseconds, 898 nanoseconds │ -6.10494 │ -124.48 │ -1 │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate) │ 274 microseconds, 658 nanoseconds │ -3.62082 │ -73.8284 │ -1 │
│ (op_divgrad_uₕ!, :none, :SetValue, :SetValue) │ 252 microseconds, 719 nanoseconds │ -3.93515 │ -80.2377 │ -1 │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘ |
be2be3f
to
a9bec4a
Compare
I'm going to shorten some names so that the table isn't quite so wide. Done. |
a9bec4a
to
c03040e
Compare
This PR refactors the stencil kernels to print the memory bandwidth (and efficiency) for the stencils. I've also split up the keys for the column and sphere cases, so that there's less duplicate printing.
@dennisYatunin and @sriharshakandala, can you please carefully review
n_reads_writes
for each of the kernels?