-
Notifications
You must be signed in to change notification settings - Fork 0
Troubleshooting
Below are some helpful tips to commonly encountered problems.
I think I found a bug!
Great! Please check the issues page to see if has already been reported. If not, please file a new issue.
Full output (enabling the SHMEM_INFO
and SHMEM_DEBUG
environment variables will produce additional output that is helpful to developers), build settings, and a test case that reproduces it will help us to diagnose and correct the error.
CMA runs fail with an "Operation Not Permitted" error.
You may need to disable Linux ptrace protrection: https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection
In Ubuntu, this can be done by running sudo sysctl kernel.yama.ptrace_scope=0
on the nodes that will execute OpenSHMEM processes.
SOS supports a number of different process manager configurations, check configure --help
for details.
I'm seeing a failure in the global_exit test case.
SOS implements the shmem_global_exit()
routine using the process manager's PMI_abort()
functionality. In older versions of Hydra, this was not implemented properly. Please update to Hydra 3.2 or newer.
The oshrun
wrapper is not choosing the right launcher.
Set the OSHRUN_LAUNCHER
environment variable to the correct launcher.
OFI is not selecting the right provider.
The SMA_OFI_PROVIDER
environment variable can be used to request a specific provider from libfabric, e.g. SMA_OFI_PROVIDER=sockets
.
OFI is not selecting the right domain/fabric.
The SMA_OFI_DOMAIN
and SMA_OFI_FABRIC
environment variables can be used to request a specific domain or fabric from libfabric. The domain and fabric names can be queried by running the fi_info -c "FI_RMA|FI_ATOMIC" -t FI_EP_RDM
program provided with libfabric. See the SOS README file for additional details.
OFI reports "No space left on device" (e.g. on the Cray Aries interconnect)
The following warning suggests that the default maximum number of STX resources (16) is too high:
Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
WARN: ../../src/transport_ofi.c:585: bind_enable_cntr_ep_resources
fi_ep_bind STX to CNTR endpoint failed
WARN: ../../src/transport_ofi.c:1340: shmem_transport_ofi_ctx_init
context bind/enable CNTR endpoint failed (No space left on device)
Lowering the value of the environment variable SHMEM_OFI_STX_MAX fixes this issue.
PSM2 error: can't open hfi unit.
If the following error is printed at launch:
hfi_userinit: assign_context command failed: Invalid argument
PSM2 can't open hfi unit: -1 (err=23)
then please set the PSM2_SHAREDCONTEXTS environment variable to 0.
This bug has been filed: http://ibbugzilla.ph.intel.com/bugzilla/show_bug.cgi?id=135318
and the fix should be in libpsm2 > 10.2.84.
PSM2 error: message from unknown process.
The following error:
Received eager message(s) ptype=0x1 opcode=0xcc from an unknown process
likely means that a previous job with the same PSM2_UUID is still running (i.e. didn’t terminate properly). Killing any latent processes should remove the error.
PSM2 error: assertion failure
The following error:
Assertion failure at [...]/ptl_ips/ips_proto.c:1877: (scb->payload_size & 0x3) == 0
Is caused by a bug in an older version of PSM2, please upgrade your PSM2 library.
PSM2 and SHMEM_THREAD_MULTIPLE
The following error message when initializing SOS in SHMEM_THREAD_MULTIPLE
mode:
[0000] WARN: transport_ofi.c:1182: query_for_fabric
[0000] OFI transport did not find any valid fabric services (provider=<auto>)
[0000] ERROR: init.c:259: shmem_internal_init
[0000] Transport init failed (-61)
May occur because PSM2 provider supports the FI_THREAD_COMPLETION
model instead of the default FI_THREAD_SAFE
mode that is assumed by SOS. To enable support for FI_THREAD_COMPLETION
, SOS must be configured with the --enable-thread-completion
flag.
I'm seeing several failures in the test suite when using the sockets build of Portals 4.
Unfortunately, this is a known issue. For sockets builds, we recommend using OFI.
I got the following error message: ptl_mr.c:456: mr_lookup: Assertion `res == ((void *)0)' failed.
This is caused by a variation in the IB Verbs implementation. Add --enable-zero-mrs
to your Portals 4 configuration to correct for it.
I got the following error message: PtlLEAppend of all memory failed: 1
.
Some configurations (e.g. if configured with --with-cma
) appear to be incompatible with remote virtual addressing. Try reconfiguring without --enable-remote-virtual-addressing
.
My system does not have support for ummunotify.
If you’re running without ummunotify or KNEM, you’ll need to add the following env variable: PTL_DISABLE_MEM_REG_CACHE=1. Note that this will negatively impact performance, but it will provide correctness on your system.
Which combination of settings will get the best performance?
For performance builds, we recommend disabling error checking --disable-error-checking
and enabling remote virtual addressing --enable-remote-virtual-addressing
. Note that remote virtual addressing is incompatible with ASLR. If ASLR is enabled, disabling position-independent executable code via LDFLAGS
may be a workable solution. When taking this route, also configure with --disable-aslr-check
. A high-performance OFI provider or Portals 4 build is also required.
I'm seeing a lot of noise in my performance numbers.
Portals 4 and some OFI providers utilize communication threads. It is helpful to bind the SHMEM PEs and companion threads to a set of cores, using the --bind-to option from Hydra. For example, the oshrun --bind-to core:2 ...
assigns two cores to each PE and use oshrun -bind-to hwthread:2...
for hardware thread assignment. For more details on binding options, run oshrun --bind-to -h
.
The OFI sockets provider allows you to directly control the affinity of the progress thread through the FI_SOCKETS_PE_AFFINITY
environment variable. See the output of the fi_info -e
command for details.
The PSM2 provider allows you to control the duty cycle and affinity of the progress thread through the FI_PSM2_PROG_INTERVAL
and FI_PSM2_PROG_AFFINITY
environment variables. Refer to the provider manpage for additional details.
If you are using Portals 4 revision 32c452f
or later, you can set the environment variable PTL_PROGRESS_NOSLEEP=1
to prevent the Portals progress thread from sleeping. This eliminates noise from waking up the progress thread, but requires that the progress thread is assigned its own hardware thread.
Additional noise can come from bounce buffering in the SOS runtime system. This can be disabled with the following environment variable setting, SMA_BOUNCE_SIZE=0
.
OpenSHMEM functions like shmem_my_pe are not defined?
That's right! SHMEM implementations have historically waffled on whether these functions are declared in the header file or by the application. The OpenSHMEM specification chose the latter semantic. If you would like a header file that includes all of the function declarations, it can be selected when configuring the build via the --enable-long-fortran-header
option.