Skip to content
David Ozog edited this page Apr 2, 2018 · 1 revision

Below are some helpful tips to commonly encountered problems.

Bugs and Other Gotchas

I think I found a bug!

Great! Please check the issues page to see if has already been reported. If not, please file a new issue.

Full output (enabling the SHMEM_INFO and SHMEM_DEBUG environment variables will produce additional output that is helpful to developers), build settings, and a test case that reproduces it will help us to diagnose and correct the error.

CMA runs fail with an "Operation Not Permitted" error.

You may need to disable Linux ptrace protrection: https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection

In Ubuntu, this can be done by running sudo sysctl kernel.yama.ptrace_scope=0 on the nodes that will execute OpenSHMEM processes.

Troubleshooting the process manager

SOS supports a number of different process manager configurations, check configure --help for details.

I'm seeing a failure in the global_exit test case.

SOS implements the shmem_global_exit() routine using the process manager's PMI_abort() functionality. In older versions of Hydra, this was not implemented properly. Please update to Hydra 3.2 or newer.

The oshrun wrapper is not choosing the right launcher.

Set the OSHRUN_LAUNCHER environment variable to the correct launcher.

Troubleshooting the OFI build

OFI is not selecting the right provider.

The SMA_OFI_PROVIDER environment variable can be used to request a specific provider from libfabric, e.g. SMA_OFI_PROVIDER=sockets.

OFI is not selecting the right domain/fabric.

The SMA_OFI_DOMAIN and SMA_OFI_FABRIC environment variables can be used to request a specific domain or fabric from libfabric. The domain and fabric names can be queried by running the fi_info -c "FI_RMA|FI_ATOMIC" -t FI_EP_RDM program provided with libfabric. See the SOS README file for additional details.

OFI reports "No space left on device" (e.g. on the Cray Aries interconnect)

The following warning suggests that the default maximum number of STX resources (16) is too high:

Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
WARN:  ../../src/transport_ofi.c:585: bind_enable_cntr_ep_resources
       fi_ep_bind STX to CNTR endpoint failed
WARN:  ../../src/transport_ofi.c:1340: shmem_transport_ofi_ctx_init
       context bind/enable CNTR endpoint failed (No space left on device)

Lowering the value of the environment variable SHMEM_OFI_STX_MAX fixes this issue.

Troubleshooting the PSM2 provider

PSM2 error: can't open hfi unit.

If the following error is printed at launch:

hfi_userinit: assign_context command failed: Invalid argument
PSM2 can't open hfi unit: -1 (err=23)

then please set the PSM2_SHAREDCONTEXTS environment variable to 0.

This bug has been filed: http://ibbugzilla.ph.intel.com/bugzilla/show_bug.cgi?id=135318

and the fix should be in libpsm2 > 10.2.84.

PSM2 error: message from unknown process.

The following error:

Received eager message(s) ptype=0x1 opcode=0xcc from an unknown process

likely means that a previous job with the same PSM2_UUID is still running (i.e. didn’t terminate properly). Killing any latent processes should remove the error.

PSM2 error: assertion failure

The following error:

Assertion failure at [...]/ptl_ips/ips_proto.c:1877: (scb->payload_size & 0x3) == 0

Is caused by a bug in an older version of PSM2, please upgrade your PSM2 library.

PSM2 and SHMEM_THREAD_MULTIPLE

The following error message when initializing SOS in SHMEM_THREAD_MULTIPLE mode:

[0000] WARN:  transport_ofi.c:1182: query_for_fabric
[0000]        OFI transport did not find any valid fabric services (provider=<auto>)
[0000] ERROR: init.c:259: shmem_internal_init
[0000]        Transport init failed (-61)

May occur because PSM2 provider supports the FI_THREAD_COMPLETION model instead of the default FI_THREAD_SAFE mode that is assumed by SOS. To enable support for FI_THREAD_COMPLETION, SOS must be configured with the --enable-thread-completion flag.

Troubleshooting the Portals Build

I'm seeing several failures in the test suite when using the sockets build of Portals 4.

Unfortunately, this is a known issue. For sockets builds, we recommend using OFI.

I got the following error message: ptl_mr.c:456: mr_lookup: Assertion `res == ((void *)0)' failed.

This is caused by a variation in the IB Verbs implementation. Add --enable-zero-mrs to your Portals 4 configuration to correct for it.

I got the following error message: PtlLEAppend of all memory failed: 1.

Some configurations (e.g. if configured with --with-cma) appear to be incompatible with remote virtual addressing. Try reconfiguring without --enable-remote-virtual-addressing.

My system does not have support for ummunotify.

If you’re running without ummunotify or KNEM, you’ll need to add the following env variable: PTL_DISABLE_MEM_REG_CACHE=1. Note that this will negatively impact performance, but it will provide correctness on your system.

Performance Troubleshooting

Which combination of settings will get the best performance?

For performance builds, we recommend disabling error checking --disable-error-checking and enabling remote virtual addressing --enable-remote-virtual-addressing. Note that remote virtual addressing is incompatible with ASLR. If ASLR is enabled, disabling position-independent executable code via LDFLAGS may be a workable solution. When taking this route, also configure with --disable-aslr-check. A high-performance OFI provider or Portals 4 build is also required.

I'm seeing a lot of noise in my performance numbers.

Portals 4 and some OFI providers utilize communication threads. It is helpful to bind the SHMEM PEs and companion threads to a set of cores, using the --bind-to option from Hydra. For example, the oshrun --bind-to core:2 ... assigns two cores to each PE and use oshrun -bind-to hwthread:2... for hardware thread assignment. For more details on binding options, run oshrun --bind-to -h.

The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY environment variable. See the output of the fi_info -e command for details.

The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL and FI_PSM2_PROG_AFFINITY environment variables. Refer to the provider manpage for additional details.

If you are using Portals 4 revision 32c452f or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1 to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.

Additional noise can come from bounce buffering in the SOS runtime system. This can be disabled with the following environment variable setting, SMA_BOUNCE_SIZE=0.

Fortran Users

OpenSHMEM functions like shmem_my_pe are not defined?

That's right! SHMEM implementations have historically waffled on whether these functions are declared in the header file or by the application. The OpenSHMEM specification chose the latter semantic. If you would like a header file that includes all of the function declarations, it can be selected when configuring the build via the --enable-long-fortran-header option.