Skip to content
David Ozog edited this page Apr 4, 2018 · 8 revisions

This page discusses methods for optimizing and troubleshooting the performance of Sandia OpenSHMEM applications.

FAQ

Which combination of settings will get the best performance?

For performance builds, we recommend disabling error checking --disable-error-checking and enabling remote virtual addressing --enable-remote-virtual-addressing. Note that remote virtual addressing is incompatible with ASLR. If ASLR is enabled, disabling position-independent executable code via LDFLAGS may be a workable solution. When taking this route, also configure with --disable-aslr-check. A high-performance OFI provider or Portals 4 build is also required.

I'm seeing a lot of noise in my performance numbers.

Portals 4 and some OFI providers utilize communication threads. It is helpful to bind the SHMEM PEs and companion threads to a set of cores, using the --bind-to option from Hydra. For example, the oshrun --bind-to core:2 ... assigns two cores to each PE and use oshrun -bind-to hwthread:2... for hardware thread assignment. For more details on binding options, run oshrun --bind-to -h.

The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY environment variable. See the output of the fi_info -e command for details.

The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL and FI_PSM2_PROG_AFFINITY environment variables. Refer to the provider manpage for additional details.

If you are using Portals 4 revision 32c452f or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1 to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.

Additional noise can come from bounce buffering in the SOS runtime system. This can be disabled with the following environment variable setting, SMA_BOUNCE_SIZE=0.

OFI Performance Considerations:

Sockets Provider

Performance variation/degradation in the sockets provider can be due to the affinity of the progress thread. The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY environment variable. See the output of the fi_info -e command for details.

PSM2 Provider

Note: PSM2 users should use libfabric 1.6 or later to achieve the best performance and most efficient resource utilization in Sandia OpenSHMEM applications.

Progress

Performance variation/degradation in the PSM2 provider can be due to the affinity and polling interval of the progress thread. The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL and FI_PSM2_PROG_AFFINITY environment variables. Refer to the provider manpage for additional details (latest version is here: https://ofiwg.github.io/libfabric/master/man/fi_psm2.7.html).

SOS can also improve progress performance with the "manual progress" mode. To enable manual progress, pass --enable-manual-progress at configure time. Manual progress mode makes intermittent libfabric calls to read a receiver counter, forcing the runtime to make progress. This setting is particularly effective with the PSM2 provider when using multiple threads in SHMEM_THREAD_MULTIPLE mode.

Shareable Transmit Contexts (STXs)

More information about about STXs is below. For PSM2, it may be necessary for the system administrator to configure the HFI for a larger number of contexts. For example, if you want 80 contexts, first remove the hfi1 module:

$ rmmod hfi1

Then reload it with different parameter:

$ modprobe hfi1 num_user_contexts=80

You can put a file hfi1.conf under /etc/modprobe.d to make this the default parameter. Content of the file:

options hfi1 num_user_contexts=80

The Intel® Omni-Path Architecture 100 HFI supports up to 160 contexts, but you may find optimal performance with 80.

Portals 4 Performance Considerations:

If you are using Portals 4 revision 32c452f or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1 to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.

Shareable Transmit Contexts (STXs)

The OFI transport layer in SOS makes use of shareable transmit contexts (STXs) for managing communication. An STX is a libfabric software object that represents a resource that is shared across multiple transport endpoints. Ideally, each STX would be associated with a transmit hardware resource within the host-to-fabric (HFI) interface or network interface card (NIC) on each compute node. SOS provides the SHMEM_OFI_STX_AUTO environment variable, which attempts to limit the maximum number of STX objects to the number of outbound command queues optimally supported by the provider. In addition, when SHMEM_OFI_STX_AUTO is enabled SOS evenly partitions the STX resources evenly across PEs that share a compute node. Setting SHMEM_DEBUG=1 before running an SOS application will print useful information regarding the STX resources. For example, the following output:

STX[8] = [ 2S 1S 1S 1S 3P 3P 0S 0S ]

shows that this PE uses 8 STX resources. The first STX has 2 shared contexts, the next 3 STXs have 1 shared context each, and the next two have 3 private contexts each. The 7th and 8th STXs are unused.

Setting SHMEM_OFI_STX_AUTO may not achieve the optimal performance, so it may be necessary to set additional parameters. When SHMEM_OFI_STX_AUTO is enabled, you may optionally set the SHMEM_OFI_STX_NODE_MAX value to the desired maximum number of STXs per compute node (instead of using the value provided by libfabric by SHMEM_OFI_STX_AUTO). These STXs are evenly partitioned across PEs that reside on the same compute node. If SHMEM_OFI_STX_AUTO is off, then SHMEM_OFI_STX_NODE_MAX has no effect.

Setting SHMEM_OFI_STX_DISABLE_PRIVATE may improve load balance across transmit resources, especially in scenarios where the number of contexts exceeds the number of STXs.

Other STX parameters related to the allocation algorithm (SHMEM_OFI_STX_ALLOCATOR and SHMEM_OFI_STX_THRESHOLD) may also improve the performance of Sandia OpenSHMEM applications. More information on these parameters and others can be found in the README file.