RMA WG 11 30 2020

Md. Wasi-ur Rahman, Manjunath Gorentla Venkata, Min Si, David Ozog, Clarete Riana Crasta, Manjunath Gorentla Venkata, Naveen Rav

New API for supporting GPU communication:
- Add multithreading resource input in existing RMA API to allow the runtime to parallelize large PUT/GET by using available threads.
- Min: will this API be also beneficial for CPU-oriented OpenSHMEM? E.g., in FUNNELED or SERIALIZED thread level, only single thread is calling SHMEM and the other threads may be idle and can be used by SHMEM (e.g., for packing strided data)
  - Manju: most SHMEM users don't complain the performance of strided PUT/GET, so no internal packing for strided RMA in current OpenSHMEM implementations (OSHMEM, Cray SHMEM); but DOE users may want to have such optimizations such as for halo exchange.
  - Manju: should the program just use MULTIPLE thread level in order to utilize all idle threads?
User-level thread performance:
- Wasi: plan to first post performance results on the issue ticket.

Provide feedback