Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stencil kernels to show mem bandwidth #1968

Merged
merged 1 commit into from
Sep 3, 2024

Conversation

charleskawczynski
Copy link
Member

This PR refactors the stencil kernels to print the memory bandwidth (and efficiency) for the stencils. I've also split up the keys for the column and sphere cases, so that there's less duplicate printing.

@dennisYatunin and @sriharshakandala, can you please carefully review n_reads_writes for each of the kernels?

@sriharshakandala
Copy link
Member

Can we extract memory bandwidth information from nsight compute?

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Sep 3, 2024

Can we extract memory bandwidth information from nsight compute?

It's theoretically possible, but I don't think it's necessary and I'd like to preserve the generated table in this PR.

Also, nsight compute may tell us the used and theoretical memory bandwidths, but it cannot tell us the number of expected reads and writes per operator-- which I think is best to compare against. Otherwise, nsight compute may tell us that we're reaching good memory bandwidth even though we may be reading/writing too many variables (e.g., metric terms).

We could (and perhaps should) use nsight compute to measure the number of reads / writes to compare against what is expected, but I'll leave that as a separate effort.

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Sep 3, 2024

Here is what the new table looks like:

Problem size: (4, 4, 1, 63, 5400), N-reps: 1,  Float_type = Float32, Device_bandwidth_GBs=2039
┌───────────────────────────────────────────────────────────────┬───────────────────────────────────┬──────────┬─────────────┬────────────────┐
│ funcs                                                         │ time per call                     │ bw %     │ achieved bw │ N reads-writes │
├───────────────────────────────────────────────────────────────┼───────────────────────────────────┼──────────┼─────────────┼────────────────┤
│ (op_GradientF2C!, :none)                                      │ 63 microseconds, 850 nanoseconds  │ 31.1511635.172              │
│ (op_GradientF2C!, :SetValue, :SetValue)                       │ 88 microseconds, 100 nanoseconds  │ 22.5765460.3342              │
│ (op_GradientC2F!, :SetGradient, :SetGradient)                 │ 93 microseconds, 379 nanoseconds  │ 21.2999434.3052              │
│ (op_GradientC2F!, :SetValue, :SetValue)                       │ 82 microseconds, 939 nanoseconds  │ 23.9811488.9742              │
│ (op_DivergenceF2C!, :none)                                    │ 130 microseconds, 90 nanoseconds  │ 22.9339467.6223              │
│ (op_DivergenceF2C!, :Extrapolate, :Extrapolate)               │ 226 microseconds, 530 nanoseconds │ 13.1703268.5423              │
│ (op_DivergenceC2F!, :SetDivergence, :SetDivergence)           │ 129 microseconds, 440 nanoseconds │ 23.0491469.973              │
│ (op_InterpolateF2C!, :none)                                   │ 64 microseconds, 370 nanoseconds  │ 30.8994630.0392              │
│ (op_InterpolateC2F!, :SetValue, :SetValue)                    │ 64 microseconds, 431 nanoseconds  │ 30.8702629.4432              │
│ (op_InterpolateC2F!, :Extrapolate, :Extrapolate)              │ 78 microseconds, 590 nanoseconds  │ 25.3081516.0332              │
│ (op_broadcast_example0!, :none)                               │ 75 microseconds, 79 nanoseconds   │ 39.7374810.2473              │
│ (op_broadcast_example1!, :none)                               │ 134 microseconds, 670 nanoseconds │ 29.5386602.2924              │
│ (op_broadcast_example2!, :none)                               │ 134 microseconds, 689 nanoseconds │ 29.5342602.2024              │
│ (op_LeftBiasedC2F!, :SetValue)                                │ 60 microseconds, 189 nanoseconds  │ 33.0453673.7942              │
│ (op_LeftBiasedF2C!, :none)                                    │ 61 microseconds, 609 nanoseconds  │ 32.2837658.2642              │
│ (op_LeftBiasedF2C!, :SetValue)                                │ 65 microseconds, 521 nanoseconds  │ 30.3566618.9712              │
│ (op_RightBiasedC2F!, :SetValue)                               │ 61 microseconds, 349 nanoseconds  │ 32.4205661.0542              │
│ (op_RightBiasedF2C!, :none)                                   │ 61 microseconds, 130 nanoseconds  │ 32.5372663.4332              │
│ (op_RightBiasedF2C!, :SetValue)                               │ 64 microseconds, 160 nanoseconds  │ 31.0006632.1022              │
│ (op_CurlC2F!, :SetCurl, :SetCurl)                             │ 99 microseconds, 760 nanoseconds  │ -9.96875-203.263-1             │
│ (op_CurlC2F!, :SetValue, :SetValue)                           │ 142 microseconds, 339 nanoseconds │ -6.98672-142.459-1             │
│ (op_UBPC2F!, :SetValue, :SetValue)                            │ 142 microseconds, 730 nanoseconds │ -6.96763-142.07-1             │
│ (op_UBPC2F!, :Extrapolate, :Extrapolate)                      │ 152 microseconds, 138 nanoseconds │ -6.53671-133.284-1             │
│ (op_divO3UBPC2F!, :1SidedO3, :1SidedO3, :SetValue, :SetValue) │ 501 microseconds, 387 nanoseconds │ -1.98347-40.4429-1             │
│ (op_divgrad_CC!, :SetValue, :SetValue, :none)                 │ 300 microseconds, 917 nanoseconds │ 9.91452202.1573              │
│ (op_divgrad_FF!, :none, :SetDivergence, :SetDivergence)       │ 115 microseconds, 970 nanoseconds │ 25.726524.5543              │
│ (op_div_interp_CC!, :SetValue, :SetValue, :none)              │ 209 microseconds, 69 nanoseconds  │ -4.75672-96.9895-1             │
│ (op_div_interp_FF!, :none, :SetValue, :SetValue)              │ 162 microseconds, 898 nanoseconds │ -6.10494-124.48-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :Extrapolate)              │ 274 microseconds, 658 nanoseconds │ -3.62082-73.8284-1             │
│ (op_divgrad_uₕ!, :none, :SetValue, :SetValue)                 │ 252 microseconds, 719 nanoseconds │ -3.93515-80.2377-1             │
└───────────────────────────────────────────────────────────────┴───────────────────────────────────┴──────────┴─────────────┴────────────────┘

@charleskawczynski charleskawczynski force-pushed the ck/refactor_stencil_bm branch 2 times, most recently from be2be3f to a9bec4a Compare September 3, 2024 17:47
@charleskawczynski
Copy link
Member Author

charleskawczynski commented Sep 3, 2024

I'm going to shorten some names so that the table isn't quite so wide.

Done.

@charleskawczynski charleskawczynski merged commit f36251c into main Sep 3, 2024
22 of 23 checks passed
@charleskawczynski charleskawczynski deleted the ck/refactor_stencil_bm branch September 3, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants