Skip to content

Commit

Permalink
[develop] Improvements for WE2E tests: script features, additional te…
Browse files Browse the repository at this point in the history
…sts, remove unsupported domains (#871)

Script improvements
 * run_WE2E_tests.py
    * Adds ability to run a group of all tests in a subdirectory of test_configs
    * Replace --use_cron_to_relaunch with a --launch argument that can take the values "python", "cron", or "none" (the last of which will create the experiments but not run them)
 * monitor_jobs.py
    * Adds --mode flag, which if specified as "advance" will only run rocotorun once for each experiment, then quit.
 * generate_FV3LAM_wflow.py and set_FV3nml_sfc_climo_filenames.py
    * Add --debug flag that will give more verbose output. No longer read "VERBOSE" variable from config.yaml for this script.
    * Make most prints "debug only"
 * get_crontab_contents.py
    * Overhaul script to remove global variables, add arguments as needed
    * Fixes bug where submitting jobs from cron won't work if your crontab is empty
    * Change functionality so that the script will remove the specified line from crontab if "--remove" flag is provided, otherwise it prints the crontab contents
New tests
 * Several new custom domain tests are added in a new custom_grids directory. The new custom domains were chosen to span a variety of locations, terrain types, and dates. Aside from custom_ESGgrid_Great_Lakes_snow_8km, these are basic tests that don't use many non-default settings, and so are good candidates for testing new SRW capabilities in the future. See the "Documentation" section for more details.
 * A new "long forecast" test (108 hours) starting at 2023060112 that retrieves FV3GFS grib2 input data from AWS (and so can not be run on Cheyenne).
Additional changes
 * Removes unsupported domains:
    * EMC_AK
    * EMC_HI
    * EMC_PR
    * EMC_GU
    * GSL_HAFSV0.A_25km
    * GSL_HAFSV0.A_13km
    * GSL_HAFSV0.A_3km
    * GSD_HRRR_AK_50km
 * For HPSS tests, ensure they all are explicitly set to look for HPSS data only
 * Test custom_ESGgrid_Great_Lakes_snow_8km (introduced in [develop] Consolidate verification tasks using retrieve_data.py #864) moved from verification to custom_tests, with a symbolic link (test_configs/verification/config.MET_verification_winter_wx.yaml) left behind. Also the ICs/LBCs are now from RAP output, retrieved from HPSS (since the observation data requires HPSS access anyway this is fine)
 * Make new comprehensive test file for Gaea (symlink to Orion's file) since that machine also does not have HPSS access
 * Various comment fixes
  • Loading branch information
mkavulich authored Sep 8, 2023
1 parent 776c855 commit 98c4106
Show file tree
Hide file tree
Showing 47 changed files with 696 additions and 598 deletions.
14 changes: 7 additions & 7 deletions .cicd/Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@ pipeline {
parameters {
// Allow job runner to filter based on platform
// Use the line below to enable all PW clusters
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet-epic', 'orion', 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'], description: 'Specify the platform(s) to use')
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet', 'orion', 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'], description: 'Specify the platform(s) to use')
// Use the line below to enable the PW AWS cluster
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet-epic', 'orion', 'pclusternoaav2use1'], description: 'Specify the platform(s) to use')
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet-epic', 'orion'], description: 'Specify the platform(s) to use')
choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'gaea', 'hera', 'jet-epic', 'orion'], description: 'Specify the platform(s) to use')
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet', 'orion', 'pclusternoaav2use1'], description: 'Specify the platform(s) to use')
// choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'cheyenne', 'gaea', 'hera', 'jet', 'orion'], description: 'Specify the platform(s) to use')
choice(name: 'SRW_PLATFORM_FILTER', choices: ['all', 'gaea', 'hera', 'jet', 'orion'], description: 'Specify the platform(s) to use')
// Allow job runner to filter based on compiler
choice(name: 'SRW_COMPILER_FILTER', choices: ['all', 'gnu', 'intel'], description: 'Specify the compiler(s) to use to build')
// Uncomment the following line to re-enable comprehensive tests
Expand Down Expand Up @@ -77,8 +77,8 @@ pipeline {
axes {
axis {
name 'SRW_PLATFORM'
// values 'cheyenne', 'gaea', 'hera', 'jet-epic', 'orion' //, 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'
values 'gaea', 'hera', 'jet-epic', 'orion' //, 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'
// values 'cheyenne', 'gaea', 'hera', 'jet', 'orion' //, 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'
values 'gaea', 'hera', 'jet', 'orion' //, 'pclusternoaav2use1', 'azclusternoaav2eus1', 'gclusternoaav2usc1'
}

axis {
Expand All @@ -92,7 +92,7 @@ pipeline {
exclude {
axis {
name 'SRW_PLATFORM'
values 'gaea', 'jet-epic', 'orion' //, 'pclusternoaav2use1' , 'azclusternoaav2eus1', 'gclusternoaav2usc1'
values 'gaea', 'jet', 'orion' //, 'pclusternoaav2use1' , 'azclusternoaav2eus1', 'gclusternoaav2usc1'
}

axis {
Expand Down
7 changes: 6 additions & 1 deletion tests/WE2E/machine_suites/comprehensive
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
2020_CAD
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_Great_Lakes_snow_8km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down Expand Up @@ -54,6 +59,7 @@ grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot
GST_release_public_v1
long_fcst
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_verification_only_vx
Expand All @@ -64,6 +70,5 @@ nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson_mynn_lam3km
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR
pregen_grid_orog_sfc_climo
quilting_false
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS
specify_template_filenames
5 changes: 5 additions & 0 deletions tests/WE2E/machine_suites/comprehensive.cheyenne
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/comprehensive.gaea
5 changes: 5 additions & 0 deletions tests/WE2E/machine_suites/comprehensive.orion
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
2020_CAD
community
custom_ESGgrid
custom_ESGgrid_Central_Asia_3km
custom_ESGgrid_IndianOcean_6km
custom_ESGgrid_NewZealand_3km
custom_ESGgrid_Peru_12km
custom_ESGgrid_SF_1p1km
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_GFDLgrid
deactivate_tasks
Expand Down
4 changes: 2 additions & 2 deletions tests/WE2E/machine_suites/coverage.cheyenne
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
custom_ESGgrid_IndianOcean_6km
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR
#nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
pregen_grid_orog_sfc_climo
specify_template_filenames
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.cheyenne.gnu
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.gaea
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
community
custom_ESGgrid_NewZealand_3km
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
Expand Down
5 changes: 3 additions & 2 deletions tests/WE2E/machine_suites/coverage.hera.gnu.com
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
custom_ESGgrid_Peru_12km
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
quilting_false
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
GST_release_public_v1
long_fcst
MET_verification_only_vx
#MET_ensemble_verification_only_vx_time_lag Removed temporarily due to HPSS permissions issue
MET_ensemble_verification_only_vx_time_lag
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.hera.intel.nco
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_ESGgrid_Central_Asia_3km
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2mems
get_from_HPSS_ics_HRRR_lbcs_RAP
Expand Down
2 changes: 1 addition & 1 deletion tests/WE2E/machine_suites/coverage.jet
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
community
custom_ESGgrid
#custom_ESGgrid_Great_Lakes_snow_8km Removed temporarily due to HPSS permissions issue
custom_ESGgrid_Great_Lakes_snow_8km
custom_GFDLgrid
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h
Expand Down
1 change: 1 addition & 0 deletions tests/WE2E/machine_suites/coverage.orion
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
custom_ESGgrid_SF_1p1km
deactivate_tasks
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
Expand Down
26 changes: 21 additions & 5 deletions tests/WE2E/monitor_jobs.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,20 +16,23 @@
from utils import calculate_core_hours, write_monitor_file, update_expt_status,\
update_expt_status_parallel, print_WE2E_summary

def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug: bool = False) -> str:
def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1,
mode: str = 'continuous', debug: bool = False) -> str:
"""Function to monitor and run jobs for the specified experiment using Rocoto
Args:
expts_dict (dict): A dictionary containing the information needed to run
one or more experiments. See example file monitor_jobs.yaml
monitor_file (str): [optional]
mode (str): [optional] Mode of job monitoring
continuous (default): monitor jobs continuously until complete
advance: increment jobs once, then quit
debug (bool): [optional] Enable extra output for debugging
Returns:
str: The name of the file used for job monitoring (when script is finished, this
str: The name of the file used for job monitoring (when script is finished, this
contains results/summary)
"""

monitor_start = datetime.now()
# Write monitor_file, which will contain information on each monitored experiment
monitor_start_string = monitor_start.strftime("%Y%m%d%H%M%S")
Expand All @@ -52,6 +55,12 @@ def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug

write_monitor_file(monitor_file,expts_dict)

if mode != 'continuous':
logging.debug("All experiments have been updated")
return monitor_file
else:
logging.debug("Continuous mode: will monitor jobs until all are complete")

logging.info(f'Setup complete; monitoring {len(expts_dict)} experiments')
logging.info('Use ctrl-c to pause job submission/monitoring')

Expand Down Expand Up @@ -102,7 +111,8 @@ def monitor_jobs(expts_dict: dict, monitor_file: str = '', procs: int = 1, debug
endtime = datetime.now()
total_walltime = endtime - monitor_start

logging.debug(f"Finished loop {i}\nWalltime so far is {str(total_walltime)}")
logging.debug(f"Finished loop {i}")
logging.debug(f"Walltime so far is {str(total_walltime)}")
#Slow things down just a tad between loops so experiments behave better
time.sleep(5)

Expand Down Expand Up @@ -160,6 +170,11 @@ def setup_logging(logfile: str = "log.run_WE2E_tests", debug: bool = False) -> N
parser.add_argument('-p', '--procs', type=int,
help='Run resource-heavy tasks (such as calls to rocotorun) in parallel, '\
'with provided number of parallel tasks', default=1)
parser.add_argument('-m', '--mode', type=str, default='continuous',
choices=['continuous','advance'],
help='continuous: script will run continuously until all experiments are'\
'finished.'\
'advance: will only advance each experiment one step')
parser.add_argument('-d', '--debug', action='store_true',
help='Script will be run in debug mode with more verbose output')

Expand All @@ -175,7 +190,8 @@ def setup_logging(logfile: str = "log.run_WE2E_tests", debug: bool = False) -> N
#Call main function

try:
monitor_jobs(expts_dict,args.yaml_file,args.procs,args.debug)
monitor_jobs(expts_dict=expts_dict,monitor_file=args.yaml_file,procs=args.procs,
mode=args.mode,debug=args.debug)
except KeyboardInterrupt:
logging.info("\n\nUser interrupted monitor script; to resume monitoring jobs run:\n")
logging.info(f"{__file__} -y={args.yaml_file} -p={args.procs}\n")
Expand Down
Loading

0 comments on commit 98c4106

Please sign in to comment.