Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to v1.2.0-alpha.5 #555

Merged
merged 9 commits into from
Apr 7, 2023
Merged

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Mar 8, 2023

Updates mache to v1.14.0, which brings in many updates since v1.10.0 (current compass version)

Explicitly gives a version of CMake in spack, needed by Trilinos (from #549)

Removes Albany (therefore MALI) support on Anvil, as we haven't been able to create a compatible configuration.

Updates the spack, compass and mpas-standalone locations to be in /usr/projects/e3sm on Chicoma.

Removes support for Cori-Haswell, which will be decommissioned in April.

Removes support for Gnu and OpenMPI on Compy, which has not been working in testing.

Testing

MPAS-Ocean with pr:

MALI with full_integration:

  • Chicoma @xylar
    • gnu and mpich
  • Chrysalis @xylar
    • gnu and openmpi
  • Compy @xylar
    • gnu and openmpi - nearly all tests are failing (invalid memory reference)
  • Perlmutter @xylar
    • gnu and mpich

MPAS-Ocean with nonhydro:

Deployed

MPAS-Ocean with pr:

  • Anvil @xylar
    • intel and impi
    • intel and openmpi
    • gnu and openmpi
    • gnu and mvapich
  • Chicoma @mark-petersen
    • gnu and mpich
  • Chrysalis @xylar
    • intel and openmpi
    • intel and impi
    • gnu and openmpi
  • Compy @xylar
    • intel and impi
  • Perlmutter @mark-petersen
    • gnu and mpich

MALI with full_integration:

MPAS-Ocean with nonhydro:

  • Anvil @darincomeau
    • intel and impi
    • gnu and openmpi
  • Chicoma @xylar
    • gnu and mpich
  • Chrysalis @darincomeau
    • intel and openmpi
    • gnu and openmpi
  • Compy @xylar (@jonbob, I'll do this one since I'm doing the non-petsc on Compy anyway)
    • intel and impi
  • Perlmutter @xylar
    • gnu and mpich

closes #335

@xylar
Copy link
Collaborator Author

xylar commented Mar 8, 2023

@mark-petersen, @matthewhoffman, @trhille, @darincomeau and @jonbob,

A need has arisen for new spack environments. The changes @jonbob has made for Chicoma in E3SM-Project/E3SM#5499 also affect spack (E3SM-Project/mache#112).

We also need to solve the problems in #539 and #549, the relevant commits from which I have included here.

@jonbob and I will test a few builds on a few different machines first to make sure there aren't any unpleasant surprises. Then, I will merge E3SM-Project/mache#112 and ask you all to do some more thorough testing (see the PR description). Then, I will create a new mache release and we will deploy (again, see above).

There is nothing for any of the rest of you to do for now. I just wanted to make sure you're aware that I'll be asking for this soon.

@xylar
Copy link
Collaborator Author

xylar commented Mar 8, 2023

As a reminder, instructions are here:
https://mpas-dev.github.io/compass/latest/developers_guide/deploying_spack.html

@xylar xylar force-pushed the update_to_1.2.0-alpha.5 branch 2 times, most recently from 27e4061 to 435148c Compare March 8, 2023 13:58
@xylar
Copy link
Collaborator Author

xylar commented Mar 8, 2023

A question for everyone (but especially @trhille, @mark-petersen and @darincomeau who would do the work): Is there any point in updating on Cori at this point? Or should we drop support with this update?

@trhille
Copy link
Collaborator

trhille commented Mar 9, 2023

A question for everyone (but especially @trhille, @mark-petersen and @darincomeau who would do the work): Is there any point in updating on Cori at this point? Or should we drop support with this update?

I'd be fine with leaving Cori out of the update.

@xylar
Copy link
Collaborator Author

xylar commented Mar 9, 2023

Okay, given @trhille's blessing, I'm removing Cori (we can always take out that commit).

@xylar xylar added in progress This PR is not ready for review or merging dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script labels Mar 14, 2023
@mark-petersen
Copy link
Collaborator

mark-petersen commented Mar 21, 2023

This is how I am testing this branch on perlmutter and chicoma:


export  CONDA_BASE=/usr/projects/climate/mpeterse/miconda3 # LANL IC
export SCRATCH_BASE=/lustre/scratch5/mpeterse/ # LANL IC

export  CONDA_BASE=~/miconda3 # elsewhere
export SCRATCH_BASE=/pscratch/sd/m/mpeterse/ # NERSC

./conda/configure_compass_env.py  \
   --conda ${CONDA_BASE}  \
   --update_spack   \
   --spack ${SCRATCH_BASE}/spack_test  \
   --tmpdir ${SCRATCH_BASE}/spack_tmp  \
   --compiler gnu  \
   --mache_fork xylar/mache  \
   --mache_branch update_chicoma_spack  \
   --recreate

Is that the correct mache branch? @xylar is this what I'm supposed to be doing?

On perlmutter it dies with this

./conda/configure_compass_env.py  \
>    --conda ${CONDA_BASE}  \
>    --update_spack   \
>    --spack ${SCRATCH_BASE}/spack_test  \
>    --tmpdir ${SCRATCH_BASE}/spack_tmp  \
>    --compiler gnu  \
>    --mache_fork xylar/mache  \
>    --mache_branch update_chicoma_spack  \
>    --recreate
Logging to: conda/logs/prebootstrap.log

Doing initial setup


Setting up a conda environment for installing compass

Clone and install local mache

Creating the compass conda environment


 Running:
   source /global/homes/m/mpeterse/miconda3/etc/profile.d/conda.sh
   source /global/homes/m/mpeterse/miconda3/etc/profile.d/mamba.sh
   conda activate compass_bootstrap
   /global/u2/m/mpeterse/repos/compass/pr/conda/bootstrap.py --conda /global/homes/m/mpeterse/miconda3 --update_spack --spack /pscratch/sd/m/mpeterse//spack_test --tmpdir /pscratch/sd/m/mpeterse//spack_tmp --compiler gnu --mache_fork xylar/mache --mache_branch update_chicoma_spack --recreate

Logging to: conda/logs/bootstrap.log

/global/u2/m/mpeterse/miconda3/envs/compass_bootstrap/lib/python3.11/site-packages/mache/discover.py:47: UserWarning: defaulting to pm-cpu.  Explicitly specify pm-gpu as the machine if you wish to run on GPUs.
  warnings.warn('defaulting to pm-cpu.  Explicitly specify '
Configuring environment(s) for the following compilers and MPI libraries:
  gnu, mpich

creating dev_compass_1.2.0-alpha.5
Installing pre-commit

Install local mache


Note: the module "PrgEnv-nvidia" cannot be unloaded because it was not loaded.


Note: the module "craype-accel-host" cannot be unloaded because it was not loaded.


Note: the module "perftools" cannot be unloaded because it was not loaded.


Note: the module "darshan" cannot be unloaded because it was not loaded.


Note: the module "cray-hdf5-parallel" cannot be unloaded because it was not loaded.


Note: the module "cray-netcdf-hdf5parallel" cannot be unloaded because it was not loaded.


Note: the module "cray-parallel-netcdf" cannot be unloaded because it was not loaded.

Cloning into '/pscratch/sd/m/mpeterse//spack_test/spack_for_mache_1.14.0'...
remote: Enumerating objects: 403099, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 403099 (delta 4), reused 1 (delta 1), pack-reused 403093
Receiving objects: 100% (403099/403099), 206.91 MiB | 29.30 MiB/s, done.
Resolving deltas: 100% (160774/160774), done.
Updating files: 100% (10014/10014), done.
creating new environment: dev_compass_1_2_0-alpha_5_gnu_mpich
==> Created environment 'dev_compass_1_2_0-alpha_5_gnu_mpich' in /pscratch/sd/m/mpeterse/spack_test/spack_for_mache_1.14.0/var/spack/environments/dev_compass_1_2_0-alpha_5_gnu_mpich
==> You can activate this environment with:
==>   spack env activate dev_compass_1_2_0-alpha_5_gnu_mpich
==> Bootstrapping clingo from pre-built binaries
==> Fetching https://mirror.spack.io/bootstrap/github-actions/v0.4/build_cache/linux-centos7-x86_64-gcc-10.2.1-clingo-bootstrap-spack-prqkzynv2nwko5mktitebgkeumuxkveu.spec.json
==> Fetching https://mirror.spack.io/bootstrap/github-actions/v0.4/build_cache/linux-centos7-x86_64/gcc-10.2.1/clingo-bootstrap-spack/linux-centos7-x86_64-gcc-10.2.1-clingo-bootstrap-spack-prqkzynv2nwko5mktitebgkeumuxkveu.spack
==> Installing "clingo-bootstrap@spack%[email protected]~docs~ipo+python+static_libstdcpp build_type=Release arch=linux-centos7-x86_64" from a buildcache
==> Error: Package 'armpl' not found.
You may need to run 'spack clean -m'.
Traceback (most recent call last):
  File "/global/u2/m/mpeterse/repos/compass/pr/conda/bootstrap.py", line 987, in <module>
    main()
  File "/global/u2/m/mpeterse/repos/compass/pr/conda/bootstrap.py", line 927, in main
    spack_branch_base, spack_script, env_vars = build_spack_env(
                                                ^^^^^^^^^^^^^^^^
  File "/global/u2/m/mpeterse/repos/compass/pr/conda/bootstrap.py", line 439, in build_spack_env
    make_spack_env(spack_path=spack_branch_base, env_name=spack_env,
  File "/global/u2/m/mpeterse/miconda3/envs/compass_bootstrap/lib/python3.11/site-packages/mache/spack/__init__.py", line 140, in make_spack_env
    subprocess.check_call(f'env -i bash -l {build_filename}', shell=True)
  File "/global/u2/m/mpeterse/miconda3/envs/compass_bootstrap/lib/python3.11/subprocess.py", line 413, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'env -i bash -l build_dev_compass_1_2_0-alpha_5_gnu_mpich.bash' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/global/u2/m/mpeterse/repos/compass/pr/./conda/configure_compass_env.py", line 135, in <module>
    main()
  File "/global/u2/m/mpeterse/repos/compass/pr/./conda/configure_compass_env.py", line 131, in main
    bootstrap(activate_install_env, source_path, local_conda_build)
  File "/global/u2/m/mpeterse/repos/compass/pr/./conda/configure_compass_env.py", line 37, in bootstrap
    check_call(command)
  File "/global/u2/m/mpeterse/repos/compass/pr/conda/shared.py", line 151, in check_call
    raise subprocess.CalledProcessError(process.returncode, commands)
subprocess.CalledProcessError: Command 'source /global/homes/m/mpeterse/miconda3/etc/profile.d/conda.sh && source /global/homes/m/mpeterse/miconda3/etc/profile.d/mamba.sh && conda activate compass_bootstrap && /global/u2/m/mpeterse/repos/compass/pr/conda/bootstrap.py --conda /global/homes/m/mpeterse/miconda3 --update_spack --spack /pscratch/sd/m/mpeterse//spack_test --tmpdir /pscratch/sd/m/mpeterse//spack_tmp --compiler gnu --mache_fork xylar/mache --mache_branch update_chicoma_spack --recreate' returned non-zero exit status 1.

It gets a lot further on chicoma, but is extremely slow on these builds (but not hung)

==> Installing esmf-8.2.0-c6wuz6a5c2cdm27csntj7tw7uetoozdj
==> No binary for esmf-8.2.0-c6wuz6a5c2cdm27csntj7tw7uetoozdj found: installing from source
==> Fetching https://mirror.spack.io/_source-cache/archive/36/3693987aba2c8ae8af67a0e222bea4099a48afe09b8d3d334106f9d7fc311485.tar.gz
==> No patches needed for esmf
==> esmf: Executing phase: 'edit'
==> esmf: Executing phase: 'build'

@xylar
Copy link
Collaborator Author

xylar commented Mar 21, 2023

@mark-petersen, yes, that's the correct mache branch. You do, indeed, need to run the spack clean thing. You'll need to figure out the right file to source to get the spack executable.

See:
https://mpas-dev.github.io/compass/latest/developers_guide/deploying_spack.html#troubleshooting-spack

@xylar
Copy link
Collaborator Author

xylar commented Mar 21, 2023

I've also been having trouble on chicoma.

@xylar
Copy link
Collaborator Author

xylar commented Mar 21, 2023

@mark-petersen, one more thing: you should be able to see the EC issues on main. You don't need to use this branch or any special branch of mache. It's been an ongoing problem for many months.

@mark-petersen
Copy link
Collaborator

Using the commands in my previous post, chicoma ran to completion - it was just slow. I was then able to

source load_dev_compass_1.2.0-alpha.5_chicoma-cpu_gnu_mpich.sh

compass suite -s -c ocean -t pr \
  -p /usr/projects/climate/mpeterse/repos/E3SM/master/components/mpas-ocean \
  -w $DIR

build MPAS-Ocean, and run successfully with compass run. So that one works!

I still need to figure out perlmutter.

@xylar
Copy link
Collaborator Author

xylar commented Mar 21, 2023

@mark-petersen, Chrysalis with Gnu and OpenMPI would be the other to try. That one has been failing reliably for a long time. #500

@mark-petersen
Copy link
Collaborator

I take that back. Using compass build with this pr on chicoma passes the nightly test suite, but I get these failures with the pr test suite:

00:00 PASS ocean_global_ocean_EC30to60_mesh
00:00 PASS ocean_global_ocean_EC30to60_PHC_init
115:36 FAIL ocean_global_ocean_EC30to60_PHC_performance_test

This one also has trouble:

ocean/isomip_plus/planar/2km/z-star/Ocean0
  * step: process_geom
  * step: planar_mesh
  * step: cull_mesh
  * step: initial_state
  * step: ssh_adjustment

It appears to hang on this line in the log file, but sometimes recovers.

 Reading namelist from file namelist.ocean

This looks like the same error as in #500.

Since this is a long-standing documented error, it does not affect this compass update.

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

@mark-petersen, the problems with ocean/isomip_plus/planar/2km/z-star/Ocean0 are new to me. I wonder if it was a temporary file system issue. I was seeing a lot of random hanging on Chicoma yesterday (even just in building spack, which was going to scratch4). I wonder if it's a scratch issue. I don't think we can assume it's the same issue as #500, which was specifically for EC tests.

The ocean_global_ocean_EC30to60_PHC_performance_test failure does seem likely to be what I'm seeing, presumably #497 rather than #500?

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

Since this is a long-standing documented error, it does not affect this compass update.

I'm inclined to say the same, even though it's frustrating that it's a long-standing error.

@xylar
Copy link
Collaborator Author

xylar commented Mar 22, 2023

I am now waiting on:
E3SM-Project/E3SM#5533
and
E3SM-Project/mache#114
before I ask everyone to deploy spack on all the machines.

@darincomeau
Copy link
Collaborator

darincomeau commented Apr 4, 2023

@xylar the ./conda/configure_compass_env.py script for both Chrysalis and Anvil completed successfully.

nonhydro (270 and 271) test results (will update as they run):

Chrysalis:
intel and openmpi:

Test Runtimes:
11:07 FAIL ocean_nonhydro_solitary_wave
07:01 PASS ocean_nonhydro_stratified_seiche
Total runtime 18:09
FAIL: 1 test failed, see above.

gnu and openmpi:

Test Runtimes:
10:31 FAIL ocean_nonhydro_solitary_wave
07:02 PASS ocean_nonhydro_stratified_seiche
Total runtime 17:35
FAIL: 1 test failed, see above.

Anvil:
intel and impi

Build error, will revisit when next test completes

gnu and openmpi

Built, compass test in queue

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@darincomeau, the failed tests on Chrysalis aren't a good sign. Could you point me to where you ran them so I can look at the log files?

@darincomeau
Copy link
Collaborator

@xylar the four test directories are here:

/lcrc/group/e3sm/ac.dcomeau/compass/test_20230403

Thanks for taking a look!

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

Thanks @darincomeau. The fact that we're seeing state validation errors indicates to me that Sara might have changed something on her branch that has broken the test. This is part of what makes the nonhydro testing so fragile compared to our other workflows were we are pointing to a static commit (via the submodules). I will make sure I can reproduce the problem with compass 1.2.0-alpha.4, and then try to coordinate with Sara to see if she can reproduce the problem and hopefully fix it.

@trhille
Copy link
Collaborator

trhille commented Apr 5, 2023

@xylar, any thoughts on how to deal with this?

export CONDA_BASE=~/mambaforge/
export TMPDIR=/pscratch/sd/t/trhille/spack_tmp_20230404
./conda/configure_compass_env.py --conda ${CONDA_BASE} --update_spack --tmpdir ${TMPDIR} --compiler gnu --mpi mpich --with_albany --recreate

and I get the following error:

fatal: detected dubious ownership in repository at '/global/cfs/cdirs/e3sm/software/compass/pm-cpu/spack/spack_for_mache_1.14.0'

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, I hadn't seen that before. I don't know why you're seeing it and everyone else isn't. The only solution I've found is the one you've tried. I'll keep looking...

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, reading more, it seems like having others using a git repo that I created is not considered to be safe. The only solution I can come up with (which I was avoiding in the interest of saving disk space and build time) is to have a different spack clone for each spack environment, rather than trying to share one. I can make that change but it will mean we all have to start over with deployment. My preference would be for me to do all the deployment on Perlmutter (and any other machines with this error) for this version and then to fix the problem in a subsequent update when we're not so far along.

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

... except that @mark-petersen is the owner of the spack clone on Perlmutter. So that complicates things.

@trhille
Copy link
Collaborator

trhille commented Apr 5, 2023

@xylar is this perhaps something that removing --update_spack would help? I don't want to try that without your go-ahead, since I'm not sure if it would end up with multiple spack builds that could result in confusion.

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, I am currently rebuilding the spack environment, after moving @mark-petersen's build aside. Once that's done, you can test without --update_spack but for now if you try that, building MALI will just fail because the spack environment is incomplete.

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@darincomeau, I think the solitary_wave test case is broken, independent of this PR. I think you can approve based on stratified_seiche.

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, you can test now on Perlmutter.

@trhille
Copy link
Collaborator

trhille commented Apr 5, 2023

@xylar, we're getting (hopefully) closer.
On Perlmutter, I executed:

export CONDA_BASE=~/mambaforge/
export TMPDIR=/pscratch/sd/t/trhille/spack_tmp_20230405
./conda/configure_compass_env.py --conda ${CONDA_BASE} --tmpdir ${TMPDIR} --compiler gnu --mpi mpich --with_albany --recreate

(Note: without --update_spack)
This ran without error and created load_dev_compass_1.2.0-alpha.5_pm-cpu_gnu_mpich_albany.sh. However, when I ran the full_integration suite, all tests failed to execute. The case_outputs/*.log files all contain the following error upon trying to run the executable: error while loading shared libraries: libpanzer-expr-eval.so.14: cannot open shared object file: No such file or directory

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, let me see if I can reproduce this.

@xylar
Copy link
Collaborator Author

xylar commented Apr 5, 2023

@trhille, I'm getting other errors even earlier. I'll try again tomorrow...

Copy link
Collaborator

@darincomeau darincomeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployed and tested nonhydro stratified_seiche for supported compilers on Chrysalis and Anvil.

Thanks for all the guidance @xylar !

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2023

@trhille, I was able to build MALI just fine on Perlmutter. Did you do the following?

make ALBANY=true gnu-cray

I'm running the test suite now and it seems to be passing so far.

Update: the full_integration suite passed for me on Perlmutter.

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2023

@trhille, I checked the box but would still appreciate you verifying that things work for you. If not, we need to debug that library error.

@trhille
Copy link
Collaborator

trhille commented Apr 6, 2023

@trhille, I was able to build MALI just fine on Perlmutter. Did you do the following?

I had no problem building MALI, it was just when I ran full_integration that I ran into trouble.
However, this morning I rebuilt MALI and now full_integration runs just fine. I'm not sure what I did wrong yesterday. Sorry for the false alarm!

@trhille
Copy link
Collaborator

trhille commented Apr 6, 2023

@xylar, I'm still waiting for your go-ahead to re-test on Chicoma without --update_spack, correct?

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2023

@trhille, on Chicoma, please use --update_spack and let me know how it goes.

@trhille
Copy link
Collaborator

trhille commented Apr 6, 2023

@xylar, Chicoma still gives me fatal: detected dubious ownership in repository at '/usr/projects/e3sm/compass/chicoma-cpu/spack/spack_for_mache_1.14.0'

@xylar
Copy link
Collaborator Author

xylar commented Apr 6, 2023

okay, thanks. I'll do all 3 envs on Chicoma, too.

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2023

@trhille, I've made the --with_albany environment on Chicoma. Please test when you can.

@jonbob, I've made the shared PETSc environment, too. Did you run into the same issue as @trhille on Chicoma and Perlmutter (not being able to make changes to @mark-petersen's spack clone)? Or have you been using a different space?

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2023

@scalandr suggested the following namelist changes to make solitary_wave run successfully:

config_use_vertMom_del2 = .true.
config_vertMom_del2 = 1.0
config_nonhydrostatic_remove_rhs_mean = .false.

She will make a compass pr to change these. With these changes, I was able to run both nonhydro test cases successfully on both Perlmutter and Chicoma. @jonbob, I'll check the box on those.

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2023

@trhille, I was able to run full_integration successfully on Chicoma. I'm going to check the box on that.

@xylar
Copy link
Collaborator Author

xylar commented Apr 7, 2023

Thanks everyone for your help on this! I know it was a slog. Hopefully, the next time will be smoother because of what we learned this time.

@xylar xylar merged commit 8160b2f into MPAS-Dev:main Apr 7, 2023
@xylar xylar deleted the update_to_1.2.0-alpha.5 branch April 7, 2023 10:24
@jonbob
Copy link
Collaborator

jonbob commented Apr 7, 2023

@xylar - I had run into that issue and more. I finally gave up, or at least stopped pushing on it...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies and deployment Changes relate to creating conda and Spack environments, and creating a load script
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ocean/global_ocean/QU240/PHC/files_for_e3sm failing on Cori
6 participants