Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: Default MPI Command #508

Closed
wants to merge 3 commits into from

Conversation

scrasmussen
Copy link
Member

DESCRIPTION OF ISSUES AND CHANGES:

  • On login nodes for most clusters the mpiexec or mpirun command will not run, even if a single process is requested. The default MPI command was mpirun -np 1, default MPI command now changed to ''.
  • Various MPI implementations might have different flags for debugging so defaulting to xterm -e could potentially cause issues. It probably won't be an issue, so script only reports that it adds 'xterm -e' to MPI command if debug mode turned on and xterm was not passed by the user.

…in node so default MPI command changed to ''. This might be useful for short runs.

- Script reports that it adds 'xterm -e' to MPI command if not already present and debug mode turned on. User might be using a different method for debugging MPI so this info is useful to report
Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for type conversions here, since the empty string is evaluated as False in logical checks anyway.

scm/src/run_scm.py Outdated Show resolved Hide resolved
scm/src/run_scm.py Outdated Show resolved Hide resolved
@grantfirl
Copy link
Collaborator

grantfirl commented Sep 5, 2024

@scrasmussen How has this been tested? Just the CI tests below, or did you try Hera/Derecho/something else with login node and compute node?

@grantfirl
Copy link
Collaborator

FYI, on Hera login node, using "vanilla" ./run_scm.py -c bomex -s SCM_GFSv16 command, the run fails:

stdout: ""
stderr: "Abort(1090575) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(143):
MPID_Init(1221)......:
MPIR_pmi_init(168)...: PMI2_Job_GetId returned 14

@grantfirl
Copy link
Collaborator

@scrasmussen Same error with compute node

@mkavulich
Copy link
Collaborator

@grantfirl I'm not surprised by these results, since these changes make the run command blank by default. If you include --mpi_command 'mpirun -np 1' does the run succeed in those cases?

@grantfirl
Copy link
Collaborator

grantfirl commented Sep 6, 2024

@grantfirl I'm not surprised by these results, since these changes make the run command blank by default. If you include --mpi_command 'mpirun -np 1' does the run succeed in those cases?

Yes. Right now, I'm not in favor of making these changes (i.e. merging this PR) since I think that more people use Hera than Derecho for the SCM. Why can't we just change the SCM UG to mention that one needs to know how to run with MPI on their machine and tell them what the default is?

Another way to go would be to set an environment variable in modulefiles for the MPI command to use and then reference that from the run script. IMO, I think that the code is fine to release without these changes and just explain in the documentation.

scm/src/run_scm.py Outdated Show resolved Hide resolved
@mkavulich
Copy link
Collaborator

mkavulich commented Sep 6, 2024

Another way to go would be to set an environment variable in modulefiles for the MPI command to use and then reference that from the run script.

I actually like this idea best, but I agree it's probably a bit much to get in for this release.

We do need some kind of fix because the current code does not allow you to not specify mpi_command (if you use --mpi_command='', the script replaces it with DEFAULT_MPI_COMMAND. I put in a suggested change that I think will achieve the default behavior you want while also allowing for a blank MPI command.

@scrasmussen What do you think?

New default to work on Hera. Derecho will require --mpi_command=''

Co-authored-by: Michael Kavulich <[email protected]>
@scrasmussen
Copy link
Member Author

@grantfirl @mkavulich Thanks for testing this on Hera! That's frustrating that the two different MPI implementations/versions deal with the linking differently. It seems Derecho's MPI will not call any MPI functions when being run in serial, so it doesn't need the mpiexec command to know where the libraries are. (You can use the nm scm command to see which MPI symbols might need defining. This is one of the things that the mpiexec/mpirun command handles)

I committed Mike's recommended change of using mpirun -np 1 as the default, though this kinda means the PR is pretty much just cosmetic... I'm fine with not merging it and just adding documentation to solve this issue.

@scrasmussen
Copy link
Member Author

@grantfirl @mkavulich I've created a new branch with documentation changes for --mpi_command. Just testing the scm_qsub_example.py on Derecho right now and then I'll submit a new PR

@scrasmussen scrasmussen marked this pull request as draft September 6, 2024 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants