-
Notifications
You must be signed in to change notification settings - Fork 53
Troubleshooting
Below are some helpful tips to commonly encountered problems.
Great! Please check the issues page to see if has already been reported. If not, please file a new issue.
Full output (enabling the SHMEM_INFO
and SHMEM_DEBUG
environment variables will produce additional output that is helpful to developers), build settings, and a test case that reproduces it will help us to diagnose and correct the error.
You may need to disable Linux ptrace protrection: https://wiki.ubuntu.com/SecurityTeam/Roadmap/KernelHardening#ptrace_Protection
In Ubuntu, this can be done by running sudo sysctl kernel.yama.ptrace_scope=0
on the nodes that will execute OpenSHMEM processes.
SOS supports a number of different process manager configurations, check configure --help
for details.
The system’s processes do not assume that the binary is in the current working directory, therefore you must pass the path to the binary (rather than simply the binary’s name) as an argument to the process manager.
SOS implements the shmem_global_exit()
routine using the process manager's PMI_abort()
functionality. In older versions of Hydra, this was not implemented properly. Please update to Hydra 3.2 or newer.
Set the OSHRUN_LAUNCHER
environment variable to the correct launcher.
Either the launcher is not supported by SOS, or the launcher requires a particular PMI that needs to be enabled in the SOS configuration (e.g., using --with-pmi
). When in doubt, please use a recent version of the hydra launcher, because hydra is tested with all SOS releases.
The SMA_OFI_PROVIDER
environment variable can be used to request a specific provider from libfabric, e.g. SMA_OFI_PROVIDER=sockets
.
The SMA_OFI_DOMAIN
and SMA_OFI_FABRIC
environment variables can be used to request a specific domain or fabric from libfabric. The domain and fabric names can be queried by running the fi_info -c "FI_RMA|FI_ATOMIC" -t FI_EP_RDM
program provided with libfabric. See the SOS README file for additional details.
The following warning suggests that the default maximum number of STX resources (16) is too high:
Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
WARN: ../../src/transport_ofi.c:585: bind_enable_cntr_ep_resources
fi_ep_bind STX to CNTR endpoint failed
WARN: ../../src/transport_ofi.c:1340: shmem_transport_ofi_ctx_init
context bind/enable CNTR endpoint failed (No space left on device)
Lowering the value of the environment variable SHMEM_OFI_STX_MAX fixes this issue.
You may check which socket providers are available to you by invoking the “fi_info” tool, provided by the OFI libfabric libraries. This tool can be found in the bin directory within the SOS install area. It is convenient to set the environment PATH to the bin directory within the libfabric install area so that this tool can be invoked from anywhere without having to pass the full path:
- An example of setting the environment PATH on bash:
$ export PATH=<path-to-libfabric-install>/bin:$PATH
If the following error is printed at launch:
hfi_userinit: assign_context command failed: Invalid argument
PSM2 can't open hfi unit: -1 (err=23)
then please set the PSM2_SHAREDCONTEXTS environment variable to 0.
This bug has been filed: http://ibbugzilla.ph.intel.com/bugzilla/show_bug.cgi?id=135318
and the fix should be in libpsm2 > 10.2.84.
The following error:
Received eager message(s) ptype=0x1 opcode=0xcc from an unknown process
likely means that a previous job with the same PSM2_UUID is still running (i.e. didn’t terminate properly). Killing any latent processes should remove the error.
The following error:
Assertion failure at [...]/ptl_ips/ips_proto.c:1877: (scb->payload_size & 0x3) == 0
Is caused by a bug in an older version of PSM2, please upgrade your PSM2 library.
The following error message when initializing SOS in SHMEM_THREAD_MULTIPLE
mode:
[0000] WARN: transport_ofi.c:1182: query_for_fabric
[0000] OFI transport did not find any valid fabric services (provider=<auto>)
[0000] ERROR: init.c:259: shmem_internal_init
[0000] Transport init failed (-61)
May occur because PSM2 provider supports the FI_THREAD_COMPLETION
model instead of the default FI_THREAD_SAFE
mode that is assumed by SOS. To enable support for FI_THREAD_COMPLETION
, SOS must be configured with the --enable-thread-completion
flag.
You may also want to disable the libtool-wrapper as it may interfere with the path to some of the dynamic libraries used by Sandia-OpenSHMEM (such as the infinipath library). This can be done by adding the following option to the Sandia-OpenSHMEM configuration:
--disable-libtool-wrapper
Unfortunately, this is a known issue. For sockets builds, we recommend using OFI.
This is caused by a variation in the IB Verbs implementation. Add --enable-zero-mrs
to your Portals 4 configuration to correct for it.
Some configurations (e.g. if configured with --with-cma
) appear to be incompatible with remote virtual addressing. Try reconfiguring without --enable-remote-virtual-addressing
.
If you’re running without ummunotify or KNEM, you’ll need to add the following env variable: PTL_DISABLE_MEM_REG_CACHE=1. Note that this will negatively impact performance, but it will provide correctness on your system.
While SOS allows OpenSHMEM routines to be used in conjunction with MPI routines in a hybrid MPI + OpenSHMEM program, the current supported usage mode is limited to using PMI-MPI only. To build and run such programs, users should use --enable-pmi-mpi
with CC=mpicc
during configuration. In addition to that, the program order must follow an MPI initialization followed by an OpenSHMEM initialization and an OpenSHMEM finalize followed by an MPI finalize. As an example, the following is a valid MPI + OpenSHMEM program to be built and run with SOS. Any other ordering of initialization and finalize routines may lead to undefined behavior.
#include <mpi.h>
#include <shmem.h>
...
int main (int argc, char *argv[]) {
MPI_Init(argc, argv);
shmem_init();
// Other program code
shmem_finalize();
MPI_Finalize();
return 0;
}
Please refer to the Performance Tuning wiki page.
That's right! SHMEM implementations have historically waffled on whether these functions are declared in the header file or by the application. The OpenSHMEM specification chose the latter semantic. If you would like a header file that includes all of the function declarations, it can be selected when configuring the build via the --enable-long-fortran-header
option.