Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MET_ensemble_verification_only_vx_time_lag no longer works on Tier 1 machines #900

Closed
natalie-perlin opened this issue Sep 5, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@natalie-perlin
Copy link
Collaborator

MET verification tests use modules met and metplus from software stacks on Tier 1 machines, and the changes were implemented in PR-826 (#826)
Since then, changes that were implemented affected the MET verification tasks, and MET_ensemble_verification_only_vx_time_lag no longer seem to work (Tested Hera, Gaea, Orion; new platform Derecho).
Tasks get_obs_ccpa, get_obs_mrms, get_obs_ndas fail.

Log files could be viewed on Hera:
/scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/expt_dirs/MET_ensemble_verification_only_vx_time_lag/log/get_obs_ndas_2021050500.log, get_obs_mrms_2021050500.log, get_obs_ccpa_2021050500.log

Attached are the get_obs_*_2021050500.log files, var_defns.sh and generated FV3LAM_wflow.xml workflow.

Expected behavior

MET_ensemble_verification_only_vx_time_lag test passes successfully on Hera (intel and gnu, Gaea, Orion, Jet, Derecho). Tasks
get_obs_ccpa, get_obs_mrms, get_obs_ndas do not need to be run, as the data is staged on these systems.

Current behavior

Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         DEAD                   0.00
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                   0.00

Tasks that fail are get_obs_ccpa, get_obs_mrms, get_obs_ndas.

Machines affected

Any system running SRW

Steps To Reproduce

Example for Orion :

git clone -b develop https://github.com/ufs-community/ufs-srweather-app.git
cd ufs-srweather-app/
./manage_externals/checkout_externals 
source etc/lmod-setup.sh orion
module use $PWD/modulefiles
./devbuild.sh -p=orion -c=intel 
cd tests/WE2E
module load wflow_orion
conda activate workflow_tools
./run_WE2E_tests.py -t MET_ensemble_verification_only_vx_time_lag  -m orion -a epic

See the bug... -->

calling function that monitors jobs, prints summary
Writing information for all experiments to WE2E_tests_20230905195031.yaml
Checking tests available for monitoring...
Starting experiment MET_ensemble_verification_only_vx_time_lag running
Updating database for experiment MET_ensemble_verification_only_vx_time_lag
Setup complete; monitoring 1 experiments
Use ctrl-c to pause job submission/monitoring
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_ccpa, jobid=49135445, in state DEAD (FAILED), ran for 6.0 seconds, exit status=256, try=1 (of 1)
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_mrms, jobid=49135446, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=1 (of 1)
09/05/23 19:50:45 UTC :: FV3LAM_wflow.xml :: Cycle 202105050000, Task get_obs_ndas, jobid=49135447, in state DEAD (FAILED), ran for 5.0 seconds, exit status=256, try=1 (of 1)
Experiment MET_ensemble_verification_only_vx_time_lag is DEAD
Took 0:00:23.103369; will no longer monitor.
All 1 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_time_lag                         DEAD                   0.00
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                   0.00

In MET_ensemble_verification_only_vx_time_lag tests done before merging the PR-826, no get_obs_ccpa, get_obs_mrms, get_obs_ndas were run, as all of the data were staged on each machine.

An example of a successful MET_ensemble_verification_only_vx_time_lag test on Hera:
SRW base directory: /scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/srw-dev-met
Experiment directory: /scratch1/NCEPDEV/stmp2/Natalie.Perlin/SRW/INTEL/MET_ensemble_verification_only_vx_time_lag

Detailed Description of Fix (optional)

May need to be related to configurations in parm/wflow/*.yaml, and ./ush/machine/verify_*.yaml files

Additional Information (optional)

There are differences between machine files used in PR-826, i.e., setting the OBS data directories, and their current versions (and also earlier version, before the PR-826). Example for Hera; used the following data in PR-826:

  CCPA_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ccpa/proc
  MRMS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/mrms/proc
  NDAS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ndas/proc

Current hera.yaml and the machine file before the merge of PR-826 contain the following:

  TEST_CCPA_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ccpa/proc
  TEST_MRMS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/mrms/proc
  TEST_NDAS_OBS_DIR: /scratch1/NCEPDEV/nems/role.epic/UFS_SRW_data/develop/obs_data/ndas/proc

Possible Implementation (optional)

Output (optional)

get_obs_ccpa_2021050500.log
get_obs_ndas_2021050500.log
get_obs_mrms_2021050500.log
var_defns.sh.txt
FV3LAM_wflow.xml.txt

@natalie-perlin natalie-perlin added the bug Something isn't working label Sep 5, 2023
@natalie-perlin
Copy link
Collaborator Author

@mkavulich - some input on the changes from #PR-864 and whether they could could have affected this test could be really helpful! I'm not sure the
Some changes that could be relevant:
Verification task names likely changed from get_obs_* to get_verif_obs as a part of #PR-864, yet the tasks that fail in MET_ensemble_verification_only_vx_time_lag are named get_obs_ccpa, get_obs_mrms, get_obs_ndas, despite being part of a verification task workflow.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin -

I was able to clone the develop branch on Orion, build the SRW App, then submit the MET_ensemble_verification_only_vx_time_lag. My test failed due to HPSS not being available on Orion. The get_obs_* tasks should be pointing to the JREGIONAL_GET_VERIF_OBS j-job file, i.e., <command>&LOAD_MODULES_RUN_TASK_FP; "get_obs" "&JOBSdir;/JREGIONAL_GET_VERIF_OBS"</command>

The test was fundamentally changed in PR #864 to require the verification data to be pulled from HPSS (please see lines 31-34 of the MET_ensemble_verification_only_vx_time_lag configuration file). The test no longer uses the staged data. With this change, this test can only be run on Hera and Jet. It should also be noted that the data in question appears to contain restricted data. If you aren't a member of the rstprod project, then you will be unable to pull the necessary data from HPSS, resulting in the test failing.

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - I can confirm that the removal of line 31-34 in the MET_ensemble_verification_only_vx_time_lag configuration file will allow the test to run without issue using the staged data. However, as noted above, the purpose of the test is try and pull the verification from HPSS and then run the test.

@mkavulich
Copy link
Collaborator

@MichaelLueken thanks for jumping in with a reply. Your summary is correct: these two WE2E tests are intended to only check for data on HPSS. I used HPSS data for the MET_ensemble_verification_only_vx_time_lag test because, at the time, data was only staged for that test on Hera, so it couldn't be run on other machines anyway. In addition, the function of the tasks get_obs_ccpa, get_obs_mrms, and get_obs_ndas changed with that PR. They should be run regardless of whether data is being pulled from HPSS or disk (this makes the formatting of config.yaml much easier for most cases); in the latter case the task checks to ensure all the necessary data is available on disk.

If there is a desire to make the time-lag test use staged data that would be fine, but at least one of the verification tests should be run pulling data from HPSS to test that functionality.

@natalie-perlin
Copy link
Collaborator Author

natalie-perlin commented Sep 7, 2023

@mkavulich @MichaelLueken - thank you for your comments

  • Is it possible to test data retrieval from AWS? The test fails even if the data store is set to aws, as
    EXTRN_MDL_DATA_STORES: aws nomads in hera.yaml
  • The OBS data is staged on all Tier-1 machines accessible by EPIC (Gaea, Hera, Jet, Orion, Cheyenne), unless MET_ensemble_verification_only_vx_time_lag requires additional data.
  • What needs to be done to make this test working again if I do not have HPSS account? Who could test it with an active HPSS account? Could alternative data sources be set?

@MichaelLueken
Copy link
Collaborator

@natalie-perlin - As per today's meeting, please ensure that you log into AIM and request access to the rstprod project. You will be asked to provide justification to be granted permission. If you include:

The Short-Range Weather Application (SRW App) runs workflow end-to-end tests to ensure that modifications to the code don't adversely affect development. Among these tests, there are verification tests that require observational data from HPSS. Unfortunately, these data sets are included in tarballs that also contain restricted data. Due to this, verification tests that need to pull data from HPSS are failing due to the lack of rstprod project access.

access should be granted. Once you have access to rstprod on RDHPCS, you will need to let HPSS know that you have been granted access to rstprod on RDHPCS so that you can pull the tarball that contains restricted data from HPSS. The email for the HPSS helpdesk is [email protected]. Skylar Nelson is the lead for the HPSS helpdesk, so including an email to him might expedite the process.

Closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants