Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discover issues (follow-up to PR https://github.com/JCSDA/spack-stack/pull/993) #1011

Closed
climbfuji opened this issue Feb 23, 2024 · 20 comments
Closed
Assignees
Labels
bug Something is not working INFRA JEDI Infrastructure

Comments

@climbfuji
Copy link
Collaborator

climbfuji commented Feb 23, 2024

Describe the bug
In testing PR #993, two issues were discovered that need to be addressed:

  1. For Discover SCU16 Intel, I had to add LDFLAGS="-L/usr/local/other/gcc/11.2.0/lib64" to the cmake command when building jedi-bundle. There was an additional/similar error later on when building mapl during the make step.
  2. For Discover SCU17 Intel, I had to load ecflow_ui like this: LD_PRELOAD="/usr/local/other/gcc/12.3.0/lib64/libstdc++.so" ecflow_ui
  3. For Discover SCU17 Intel, skylab experiment skylab-aero-weather hangs in variational task (somewhere in bump?) and runs are in general VERY VERY slow

To Reproduce
See above

Expected behavior
Both workarounds shouldn't be necessary

System:
Discover SCU16 and SCU17 with Intel

Additional context
n/a

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 8, 2024

To test number 3 on my end I need to make changes to jedi_bundle and build a new JEDI on SLES15 which will take a while. I will use the modules here. Yes, I see the temporary warning.

In the meantime @climbfuji will provide some OOPS timing statistics which may help us figure out the problem with the intel compiler.

Tagging @mathomp4 for visibility.

@climbfuji
Copy link
Collaborator Author

To test number 3 on my end I need to make changes to jedi_bundle and build a new JEDI on SLES15 which will take a while. I will use the modules here. Yes, I see the temporary warning.

In the meantime @climbfuji will provide some OOPS timing statistics which may help us figure out the problem with the intel compiler.

Tagging @mathomp4 for visibility.

Please use the modules from PR #1017 - thanks!

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 8, 2024

@climbfuji, will there be a miniconda module for SCU17? I see there is a stack-python version but Dan created GEOS-ESM\jedi_bundle using miniconda/3.9.7. I can't tag Dan here but I will ask him why that was the case.

@climbfuji
Copy link
Collaborator Author

No, there won't be one. We use the native/OS python3 to "drive" spack, and for everything else (building GEOS etc) we use the spack-built python3.

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 11, 2024

I created an Intel JEDI built with the following SLES15 modules and confirmed the extreme slowness issue with a high resolution SOCA 3dvar_diffusion (no bump!) variational test. Same task would take 2-3 minutes with a single node on SLES12 and now it takes more than 30 minutes with a full node on SLES15. I will create an issue with the NCCS and see if they can help.

Only difference I notice is that the stack-intel-mpi is 2.10.0 as opposed to 2.6.0 that works for GEOS. For the sake of learning, may I ask why was version 2.10.0 chosen? Perhaps @mathomp4 can also tell me.

module purge
module use /discover/swdev/gmao_SIteam/modulefiles-SLES15
module use /discover/swdev/jcsda/spack-stack/scu17/modulefiles
module load ecflow/5.11.4

module use /gpfsm/dswdev/jcsda/spack-stack/scu17/spack-stack-20240228/envs/unified-env-intel-2021.10.0/install/modulefiles/Core
module load stack-intel/2021.10.0
module load stack-intel-oneapi-mpi/2021.10.0
module load stack-python/3.10.13

Also, when I load these modules, the native pip3 version is 3.6 which is used for Swell installation. However, when I create a venv with the same modules my pip version is 3.10. That is a bit confusing for me. Perhaps miniconda prevented that for us in the SCU16 and prior?

@climbfuji
Copy link
Collaborator Author

@Dooruk Regarding Python. I see this, and that is entirely expected and correct. There's no need for a miniconda install as long as the native Python is "new" enough to drive spack and install the Python we want for the software stack.

dheinzel@discover35:~> module purge
dheinzel@discover35:~> module use /discover/swdev/gmao_SIteam/modulefiles-SLES15
dheinzel@discover35:~> module use /discover/swdev/jcsda/spack-stack/scu17/modulefiles
dheinzel@discover35:~> module load ecflow/5.11.4
dheinzel@discover35:~> which python3
/usr/bin/python3
dheinzel@discover35:~> module use /gpfsm/dswdev/jcsda/spack-stack/scu17/spack-stack-20240228/envs/unified-env-intel-2021.10.0/install/modulefiles/Core
dheinzel@discover35:~> module load stack-intel/2021.10.0
dheinzel@discover35:~> module load stack-intel-oneapi-mpi/2021.10.0
dheinzel@discover35:~> module load stack-python/3.10.13
dheinzel@discover35:~> which python3
/gpfsm/dswdev/jcsda/spack-stack/scu17/spack-stack-20240228/envs/unified-env-intel-2021.10.0/install/intel/2021.10.0/python-3.10.13-i7jpfao/bin/python3
dheinzel@discover35:~> module load jedi-fv3-env soca-env
dheinzel@discover35:~> which python3
/gpfsm/dswdev/jcsda/spack-stack/scu17/spack-stack-20240228/envs/unified-env-intel-2021.10.0/install/intel/2021.10.0/python-3.10.13-i7jpfao/bin/python3
dheinzel@discover35:~> module load ewok-env
dheinzel@discover35:~> which python3
/gpfsm/dswdev/jcsda/spack-stack/scu17/spack-stack-20240228/envs/unified-env-intel-2021.10.0/install/intel/2021.10.0/python-3.10.13-i7jpfao/bin/python3

@climbfuji
Copy link
Collaborator Author

Regarding the slowness. Can you set the following environment variables and try again? I got those from @mathomp4 and they will be part of the stack-intel compiler module in any future version of spack-stack (assuming they make things better and not worse):

setenv I_MPI_SHM_HEAP_VSIZE 512
setenv PSM2_MEMORY large
setenv I_MPI_EXTRA_FILESYSTEM 1
setenv I_MPI_EXTRA_FILESYSTEM_FORCE gpfs
setenv I_MPI_FALLBACK 0
setenv I_MPI_FABRICS ofi
setenv I_MPI_OFI_PROVIDER psm3
setenv I_MPI_ADJUST_SCATTER 2
setenv I_MPI_ADJUST_SCATTERV 2
setenv I_MPI_ADJUST_GATHER 2
setenv I_MPI_ADJUST_GATHERV 3
setenv I_MPI_ADJUST_ALLGATHER 3
setenv I_MPI_ADJUST_ALLGATHERV 3
setenv I_MPI_ADJUST_ALLREDUCE 12
setenv I_MPI_ADJUST_REDUCE 10
setenv I_MPI_ADJUST_BCAST 11
setenv I_MPI_ADJUST_REDUCE_SCATTER 4
setenv I_MPI_ADJUST_BARRIER 9

@mathomp4
Copy link
Collaborator

Note: These Intel MPI options are what we've found for Intel MPI + GEOSgcm + SLES15. At the moment SLES15 = Milan, but it's possible when the Cascade Lakes get on there, maybe we have to be even more specific.

Of all them, it's possible that NCCS might put the PSM3 line in their Intel MPI modulefile, but I don't think they have yet.

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 11, 2024

Regarding the slowness. Can you set the following environment variables and try again? I got those from @mathomp4 and they will be part of the stack-intel compiler module in any future version of spack-stack (assuming they make things better and not worse):

These helped tremendously, thanks @mathomp4! Yesterday, I hit the walltime limit 1 hour without these env variables and the same variational executable now takes 270 seconds. Silly question but would these env variables help improve performance if they were implemented while building JEDI?

Ok, in that case I will use python3 -m pip install to ensure we use the latest pip with installations, or use venv locally.

@climbfuji
Copy link
Collaborator Author

@Dooruk I'll make those env vars default in the milan stack compiler module in my next spack-stack PR. Sounds good about Python/venv, that's what we do, too.

@climbfuji
Copy link
Collaborator Author

@Dooruk See #1027 (comment)

@mathomp4
Copy link
Collaborator

These helped tremendously, thanks @mathomp4! Yesterday, I hit the walltime limit 1 hour without these env variables and the same variational executable now takes 270 seconds. Silly question but would these env variables help improve performance if they were implemented while building JEDI?

If I had to guess, the PSM3 might have been the important one, though dang, that's a biiiig difference. I wonder if other flags made a difference as well? Probably not worth the effort to do a benchmark sweep of every flag :)

@climbfuji
Copy link
Collaborator Author

I just built a spack-stack on discover-mil with [email protected] instead of 2021.10.0 - will run some tests later today and let you know how/if that changes the runtime

@climbfuji
Copy link
Collaborator Author

right now discover-mil has come to a crawl (gpfs issues again?)

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 13, 2024

I just built a spack-stack on discover-mil with [email protected] instead of 2021.10.0 - will run some tests later today and let you know how/if that changes the runtime

thanks

right now discover-mil has come to a crawl (gpfs issues again?)

yes, even my simple pip installs take 10 minutes, it is so frustrating.. On Discover, from Tues till Thurs there are more people active and that slows the system down I noticed. I would suggest doing Discover work on Mondays and Fridays 😄

@mathomp4
Copy link
Collaborator

Yeah. We aren't sure why SCU17 sometimes has these issues.

Another fun can be sometimes it seems like the network out of discover uses a weird route. Sometimes clones of MAPL can take 30+ minutes. It's why I've moved to blobless clones when I can since that can be seconds comparatively!

@Dooruk
Copy link
Collaborator

Dooruk commented Mar 13, 2024

If I had to guess, the PSM3 might have been the important one, though dang, that's a biiiig difference. I wonder if other flags made a difference as well? Probably not worth the effort to do a benchmark sweep of every flag :)

You are right. I just tested running with all the other flags without the following and I get the same issue. This seems to be the magic touch 🪄 :

export I_MPI_OFI_PROVIDER=psm3

@climbfuji
Copy link
Collaborator Author

@Dooruk I am doing a few timing comparisons. I have a large variational task that, with the I_MPI settings above, finishes in 1900s on discover-mil with [email protected] (and the walltime in EWOK is set to 1hr 15min, therefore this looks rather fast). I am going to do the same run on on discover (cas) with 2021.5.0 (I think it's 5).

@climbfuji
Copy link
Collaborator Author

@Dooruk I am doing a few timing comparisons. I have a large variational task that, with the I_MPI settings above, finishes in 1900s on discover-mil with [email protected] (and the walltime in EWOK is set to 1hr 15min, therefore this looks rather fast). I am going to do the same run on on discover (cas) with 2021.5.0 (I think it's 5).

The second cycle with intel 2021.10.0 on scu17 finished even faster (1700s).

@climbfuji
Copy link
Collaborator Author

@Dooruk @mathomp4 Here is a poor man's comparison of three different experiments, where the large variational task (18 nodes, 12 tasks per node (due to memory limitations on scu16, maybe potential for optimization on scu17)) ran twice. Note: I used the I_MPI settings from @mathomp4 for SCU17.

The takeaways:

  • No difference in runtime between Intel 2021.10.0 and 2021.6.0 on SCU17, but a huge difference in memory footprint!
  • SCU17 is about 22% slower than SCU16
SCU17 with 2021.10.0

OOPS_STATS Run end - Runtime:   1901.65 sec,  Memory: total:  3577.86 Gb, per task: min =    15.42 Gb, max =    21.33 Gb

OOPS_STATS Run end - Runtime:   1613.91 sec,  Memory: total:  3292.10 Gb, per task: min =    14.12 Gb, max =    20.47 Gb


SCU17 with 2021.6.0

OOPS_STATS Run end - Runtime:   2081.32 sec,  Memory: total:  1577.36 Gb, per task: min =     6.24 Gb, max =    11.75 Gb

OOPS_STATS Run end - Runtime:   1592.81 sec,  Memory: total:  1402.77 Gb, per task: min =     5.48 Gb, max =    11.59 Gb


SCU16 with 2021.5.0

OOPS_STATS Run end - Runtime:   1638.15 sec,  Memory: total:  1613.46 Gb, per task: min =     6.43 Gb, max =    11.81 Gb

OOPS_STATS Run end - Runtime:   1308.62 sec,  Memory: total:  1435.14 Gb, per task: min =     5.59 Gb, max =    11.83 Gb

I am going to close this issue as resolved, because the factor-of-many difference in runtime we saw initially is addressed by the I_MPI settings. But I'll open another issue for the 22% runtime difference SCU16/17 and factor of 2 memory increase on SCU17 with 2021.10.0 vs 2021.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is not working INFRA JEDI Infrastructure
Projects
No open projects
Development

No branches or pull requests

3 participants