Add FieldVector microbenchmarks and improve fieldvector broadcast performance #2070

charleskawczynski · 2024-11-01T19:17:23Z

Closes #2067. This turned out to be pretty easy: any fieldvector operation is embarrassingly parallel. We can leverage NonExtrudedBroadcasted, forward everything to the backing arrays, and just linearly index everywhere. This puts us in a pretty good shape for fieldvector operations, and it'll be even better at lower resolution since we parallelize across field variables.

GPU

main

N reads-writes: 5,  Float_type = Float64, Device_bandwidth_GBs=2039
┌─────────────┬───────────────────────────────────┬─────────┬─────────────┬──────────────┬────────┐
│ funcs       │ time per call                     │ bw %    │ achieved bw │ problem size │ n-reps │
├─────────────┼───────────────────────────────────┼─────────┼─────────────┼──────────────┼────────┤
│ FieldVector │ 1 millisecond, 104 microseconds   │ 54.161  │ 1104.34     │ (32745600,)  │ 4475   │
│ FieldVector │ 307 microseconds, 597 nanoseconds │ 48.6243 │ 991.45      │ (8186400,)   │ 10000  │
└─────────────┴───────────────────────────────────┴─────────┴─────────────┴──────────────┴────────┘

this PR

N reads-writes: 5,  Float_type = Float64, Device_bandwidth_GBs=2039
┌─────────────┬───────────────────────────────────┬─────────┬─────────────┬──────────────┬────────┐
│ funcs       │ time per call                     │ bw %    │ achieved bw │ problem size │ n-reps │
├─────────────┼───────────────────────────────────┼─────────┼─────────────┼──────────────┼────────┤
│ FieldVector │ 757 microseconds, 513 nanoseconds │ 78.9779 │ 1610.36     │ (32745600,)  │ 6491   │
│ FieldVector │ 202 microseconds, 849 nanoseconds │ 73.7335 │ 1503.43     │ (8186400,)   │ 10000  │
└─────────────┴───────────────────────────────────┴─────────┴─────────────┴──────────────┴────────┘

CPU

main

N reads-writes: 5,  Float_type = Float64,
┌─────────────┬──────────────────────────────────┬─────────────┬──────────────┬────────┐
│ funcs       │ time per call                    │ achieved bw │ problem size │ n-reps │
├─────────────┼──────────────────────────────────┼─────────────┼──────────────┼────────┤
│ FieldVector │ 32 milliseconds, 45 microseconds │ 38.0674     │ (32745600,)  │ 33     │
│ FieldVector │ 7 milliseconds, 792 microseconds │ 39.1372     │ (8186400,)   │ 529    │
└─────────────┴──────────────────────────────────┴─────────────┴──────────────┴────────┘

This PR

N reads-writes: 5,  Float_type = Float64,
┌─────────────┬───────────────────────────────────┬─────────────┬──────────────┬────────┐
│ funcs       │ time per call                     │ achieved bw │ problem size │ n-reps │
├─────────────┼───────────────────────────────────┼─────────────┼──────────────┼────────┤
│ FieldVector │ 23 milliseconds, 218 microseconds │ 52.5398     │ (32745600,)  │ 43     │
│ FieldVector │ 5 milliseconds, 752 microseconds  │ 53.0143     │ (8186400,)   │ 828    │
└─────────────┴───────────────────────────────────┴─────────────┴──────────────┴────────┘

src/Fields/fieldvector.jl

charleskawczynski · 2024-11-04T19:52:32Z

Interesting, this shows that ClimaLand is using fieldvectors in some untested edge case. I'm going to find out the difference and add some unit tests that exercise this.

charleskawczynski · 2024-11-05T19:52:07Z

This looks done! 🎉

charleskawczynski · 2024-11-05T19:54:32Z

Just to comment on why this took a bit longer to iron out: the land model seems to rely on recursively defined fieldvectors, which we support copyto! for. Doing this correctly is a bit tricky because we need to dispatch to a custom (and device-specific) method to better optimize. All that said, I'm happy with the result.

charleskawczynski force-pushed the ck/fieldvector_benchmarks branch from 7a66cf4 to f1e9282 Compare November 2, 2024 18:57

charleskawczynski requested review from dennisYatunin and Sbozzolo and removed request for dennisYatunin November 2, 2024 19:01

charleskawczynski changed the title ~~Add FieldVector microbenchmarks~~ Add FieldVector microbenchmarks and improve fieldvector broadcast performance Nov 2, 2024

charleskawczynski commented Nov 2, 2024

View reviewed changes

src/Fields/fieldvector.jl Outdated Show resolved Hide resolved

charleskawczynski force-pushed the ck/fieldvector_benchmarks branch from 9087954 to 9e49241 Compare November 2, 2024 19:25

Sbozzolo approved these changes Nov 4, 2024

View reviewed changes

src/Fields/fieldvector.jl Outdated Show resolved Hide resolved

charleskawczynski mentioned this pull request Nov 4, 2024

Add fieldvector unit tests, update rcompare #2072

Merged

charleskawczynski force-pushed the ck/fieldvector_benchmarks branch 3 times, most recently from da45803 to e0dec11 Compare November 5, 2024 19:50

FieldVectors, add benchmarks, improve perf

1c512be

charleskawczynski force-pushed the ck/fieldvector_benchmarks branch from e0dec11 to 1c512be Compare November 5, 2024 21:13

charleskawczynski merged commit a093d4a into main Nov 6, 2024
32 of 33 checks passed

charleskawczynski deleted the ck/fieldvector_benchmarks branch November 6, 2024 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FieldVector microbenchmarks and improve fieldvector broadcast performance #2070

Add FieldVector microbenchmarks and improve fieldvector broadcast performance #2070

charleskawczynski commented Nov 1, 2024 •

edited

Loading

charleskawczynski commented Nov 4, 2024

charleskawczynski commented Nov 5, 2024

charleskawczynski commented Nov 5, 2024

Add FieldVector microbenchmarks and improve fieldvector broadcast performance #2070

Add FieldVector microbenchmarks and improve fieldvector broadcast performance #2070

Conversation

charleskawczynski commented Nov 1, 2024 • edited Loading

GPU

main

this PR

CPU

main

This PR

charleskawczynski commented Nov 4, 2024

charleskawczynski commented Nov 5, 2024

charleskawczynski commented Nov 5, 2024

charleskawczynski commented Nov 1, 2024 •

edited

Loading