Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to contrib installation of spack-stack on Jet #2878

Conversation

InnocentSouopgui-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA commented Aug 29, 2024

Description

Migrates Global Workflow to use contrib installation of spack-stack on Jet.
Following the failure of the storage /lfs4 on Jet, the installation of spack spack moved to /contrib.
All softwares relying on spack-stack on Jet needs update.

Resolves #2841
Refs NOAA-EMC/gfs-utils#78
Refs NOAA-EMC/GSI#786
Refs NOAA-EMC/GSI-Monitor#143
Refs NOAA-EMC/GSI-utils#51
Refs ufs-community/UFS_UTILS#977

Type of change

  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

How has this been tested?

Example:

  • Clone and build on Jet

  • Cycled experiments (48+ hours) at resolutions

    • 96/48 on kjet
    • 192/96 on kjet
    • 384/192 on kjet
  • Forecast only experiment (48+ hours) at resolutions

    • 48
    • 96
    • 192
    • 384

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@InnocentSouopgui-NOAA
Copy link
Contributor Author

I am getting errors of the form bellow in forecast steps (tasks gdasfcst_seg0 and all enkfgdasfcst_mem###)
That is while running C192/C96 on Jet. It happened from the very first cycle, so it did not complete a single cycle.

@DavidHuber-NOAA

21: Warn_K=   6 (i,j)=   87   12 (lon,lat)=123.209 -43.765 VA = 264.64157
21:      K=   5    338.73022
21:      K=   7    217.40834
21: Warn_K=   6 (i,j)=   88   13 (lon,lat)=121.423 -43.705 VA = 250.69765
21:      K=   5    297.46832
21:      K=   7    213.49741
21: Warn_K=   6 (i,j)=   84   16 (lon,lat)=122.124 -46.976 VA = 251.07759
21:      K=   5    256.00562
21:      K=   7    241.60887
 0: PASS: fcstRUN phase 2, n_atmsteps =               27 time is         1.264142
 5:
 5: FATAL from PE     5: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
 5:
13:
13: FATAL from PE    13: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
13:
21:
21: FATAL from PE    21: NaN in input field of mpp_reproducing_sum(_2d), this indicates numerical instability
21:
13: Image              PC                Routine            Line        Source
13: ufs_model.x        00000000086C51A7  Unknown               Unknown  Unknown
13: ufs_model.x        00000000078AD1B9  mpp_mod_mp_mpp_er          72  mpp_util_mpi.inc
13: ufs_model.x        0000000007B61AB6  mpp_efp_mod_mp_mp         195  mpp_efp.F90
13: ufs_model.x        0000000007AB2D99  mpp_domains_mod_m         143  mpp_global_sum.fh
13: ufs_model.x        0000000003E9EB6A  fv_grid_utils_mod        3077  fv_grid_utils.F90
13: ufs_model.x        0000000003F2ED3E  fv_mapz_mod_mp_la         794  fv_mapz.F90
13: libiomp5.so        0000146B48A6CBB3  __kmp_invoke_micr     Unknown  Unknown
13: libiomp5.so        0000146B489E8FAC  __kmp_fork_call       Unknown  Unknown
13: libiomp5.so        0000146B489AACB5  __kmpc_fork_call      Unknown  Unknown
13: ufs_model.x        0000000003F2A129  fv_mapz_mod_mp_la         683  fv_mapz.F90
13: ufs_model.x        0000000003E2EE61  fv_dynamics_mod_m         771  fv_dynamics.F90
13: ufs_model.x        0000000003C9236C  atmosphere_mod_mp         688  atmosphere.F90
13: ufs_model.x        0000000003A3490D  atmos_model_mod_m         879  atmos_model.F90
13: ufs_model.x        00000000035F688C  module_fcst_grid_        1335  module_fcst_grid_comp.F90

@InnocentSouopgui-NOAA
Copy link
Contributor Author

It seems to be just a bad day

@InnocentSouopgui-NOAA
Copy link
Contributor Author

I build initial conditions for other days and it cycled smoothly.

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA marked this pull request as ready for review September 5, 2024 19:57
@RussTreadon-NOAA
Copy link
Contributor

GSI PR #787 has been merged into GSI develop. Done at 9f44c87.

The sorc/gsi_enkf.fd hash in InnocentSouopgui-NOAA:migration-jet-contrib must be updated to 9f44c87 to bring these changes into g-w.

@InnocentSouopgui-NOAA InnocentSouopgui-NOAA marked this pull request as draft September 6, 2024 16:13
@InnocentSouopgui-NOAA
Copy link
Contributor Author

A check is failing with the following message,

fatal: Fetched in submodule path 'ufs_utils.fd', but it did not contain 0426bf793051530794ec8f182e04f5cf129d0a90. Direct fetching of that commit failed.

How to diagnose what is going on?

@DavidHuber-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA I suspect that the hash you are pointing to is for your own branch. Update the hash instead to ufs-community/UFS_UTILS@06eec5b.

@InnocentSouopgui-NOAA
Copy link
Contributor Author

@InnocentSouopgui-NOAA I suspect that the hash you are pointing to is for your own branch. Update the hash instead to ufs-community/UFS_UTILS@06eec5b.

Thanks @DavidHuber-NOAA , that was the issue.

@DavidHuber-NOAA
Copy link
Contributor

@InnocentSouopgui-NOAA Just a heads up, there is a bug in the newest GSI-utils that will cause the gdasanalcalc job to fail when performing GDASApp analyses (i.e. the C96C48_ufs_hybatmDA CI test) as noted in #2819 (comment).

@InnocentSouopgui-NOAA
Copy link
Contributor Author

@KateFriedman-NOAA , @DavidHuber-NOAA The automated tests failed on wcoss2. Can you have a look to investigate further? The error says "CPU oversubscription detected for application". I do not have access to wcoss2 to check what is going on.

That looks like a similar problem that cropped up in PR #2895 and which @DavidHuber-NOAA had that PR include updates to the prep job resources to resolve it on WCOSS2. Sync your PR branch with g-w develop and we'll retry the WCOSS2 CI again with those updated prep job resources.

I just updated my branch.

@KateFriedman-NOAA KateFriedman-NOAA added CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Sep 26, 2024
@emcbot emcbot added CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS and removed CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Sep 26, 2024
@emcbot
Copy link

emcbot commented Sep 26, 2024

CI Update on Wcoss2 at 09/26/24 01:26:06 PM
============================================
Cloning and Building global-workflow PR: 2878
with PID: 111294 on host: clogin03

@emcbot emcbot added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Sep 26, 2024
@emcbot
Copy link

emcbot commented Sep 26, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Thu Sep 26 13:31:59 UTC 2024 on clogin03
---------------------------------------------------
Build: Completed at 09/26/24 02:09:45 PM
Case setup: Completed for experiment C48_ATM_e6c639aa
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_e6c639aa
Case setup: Skipped for experiment C48_S2SWA_gefs_e6c639aa
Case setup: Completed for experiment C48_S2SW_e6c639aa
Case setup: Completed for experiment C96_atm3DVar_extended_e6c639aa
Case setup: Skipped for experiment C96_atm3DVar_e6c639aa
Case setup: Completed for experiment C96C48_hybatmaerosnowDA_e6c639aa
Case setup: Completed for experiment C96C48_hybatmDA_e6c639aa
Case setup: Completed for experiment C96C48_ufs_hybatmDA_e6c639aa

@emcbot emcbot added CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Sep 26, 2024
@emcbot
Copy link

emcbot commented Sep 26, 2024

Experiment C96_atm3DVar_extended_e6c639aa FAIL on Wcoss2 at 09/26/24 11:14:42 PM

Error logs:

/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f095.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f096.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f097.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f098.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f099.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f100.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f101.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f102.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f103.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f104.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f105.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f106.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f107.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsatmos_prod_f108.log
/lfs/h2/emc/global/noscrub/globalworkflow.ci/GFS_CI_ROOT/PR/2878/RUNTESTS/COMROOT/C96_atm3DVar_extended_e6c639aa/logs/2021122118/gfsfcst_seg0.log

Follow link here to view the contents of the above file(s): (link)

@InnocentSouopgui-NOAA
Copy link
Contributor Author

@KateFriedman-NOAA , @DavidHuber-NOAA

WCOSS2 has a disk space issue.

Disk quota exceeded in gfsfcst_seg0.log
FATAL ERROR: write error in gfsatmos_prod_f###.log

@KateFriedman-NOAA
Copy link
Member

Hmmmmmm the stmp quota on WCOSS2-Cactus is currently at 66%. It's possible it hit 100% and then fell overnight because of the scrubber. We'd need to retry the CI test again.

@KateFriedman-NOAA KateFriedman-NOAA added CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS and removed CI-Wcoss2-Failed **Bot use only** CI testing on WCOSS for this PR has failed labels Sep 27, 2024
@emcbot emcbot added CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS and removed CI-Wcoss2-Ready **CM use only** PR is ready for CI testing on WCOSS labels Sep 27, 2024
@emcbot
Copy link

emcbot commented Sep 27, 2024

CI Update on Wcoss2 at 09/27/24 02:06:07 PM
============================================
Cloning and Building global-workflow PR: 2878
with PID: 122735 on host: clogin03

@emcbot emcbot added CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress and removed CI-Wcoss2-Building **Bot use only** CI testing is cloning/building on WCOSS labels Sep 27, 2024
@emcbot
Copy link

emcbot commented Sep 27, 2024

Automated global-workflow Testing Results:

Machine: Wcoss2
Start: Fri Sep 27 14:11:52 UTC 2024 on clogin03
---------------------------------------------------
Build: Completed at 09/27/24 02:49:30 PM
Case setup: Completed for experiment C48_ATM_db407437
Case setup: Skipped for experiment C48mx500_3DVarAOWCDA_db407437
Case setup: Skipped for experiment C48_S2SWA_gefs_db407437
Case setup: Completed for experiment C48_S2SW_db407437
Case setup: Completed for experiment C96_atm3DVar_extended_db407437
Case setup: Skipped for experiment C96_atm3DVar_db407437
Case setup: Completed for experiment C96C48_hybatmaerosnowDA_db407437
Case setup: Completed for experiment C96C48_hybatmDA_db407437
Case setup: Completed for experiment C96C48_ufs_hybatmDA_db407437

@emcbot emcbot added CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully and removed CI-Wcoss2-Running **Bot use only** CI testing on WCOSS for this PR is in-progress labels Sep 28, 2024
@emcbot
Copy link

emcbot commented Sep 28, 2024

All CI Test Cases Passed on Wcoss2:

Experiment C48_ATM_db407437 *** SUCCESS *** at 09/27/24 04:21:17 PM
Experiment C48_S2SW_db407437 *** SUCCESS *** at 09/27/24 04:35:12 PM
Experiment C96C48_hybatmDA_db407437 *** SUCCESS *** at 09/27/24 05:28:22 PM
Experiment C96C48_hybatmaerosnowDA_db407437 *** SUCCESS *** at 09/27/24 06:07:28 PM
Experiment C96C48_ufs_hybatmDA_db407437 *** SUCCESS *** at 09/27/24 07:21:21 PM
Experiment C96_atm3DVar_extended_db407437 *** SUCCESS *** at 09/28/24 03:56:31 AM

@WalterKolczynski-NOAA WalterKolczynski-NOAA merged commit 8f0541c into NOAA-EMC:develop Sep 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Wcoss2-Passed **Bot use only** CI testing on WCOSS for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate Jet to /lfs5
7 participants