Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] bug fix for processing radar reflectivity task #849

Merged

Conversation

EdwardSnyder-NOAA
Copy link
Collaborator

@EdwardSnyder-NOAA EdwardSnyder-NOAA commented Jul 5, 2023

DESCRIPTION OF CHANGES:

While testing the new process_radarref_prod task on AWS and Cheyenne, I noticed that the script would return a successful message even though there is an error present. See code snippet below.

ERROR, this run must use           33  or more cores !!!
1234
    Call to executable to run radar refl process returned with nonzero exit code.

========================================================================
RADAR REFL PROCESS completed successfully!!!
Exiting script:  "exregional_process_radarref.sh"
In directory:    "/glade/scratch/esnyder/wflow_test/ufs-srweather-app/scripts"
========================================================================

There are two issues:

  1. The process_radarref_prod task returns a successful message when it should error out.
  2. This task needs 33 or more cores than what is listed in the FV3LAM_wflow.xml.

After some investigation, the 1st issue is resolved by modifying the print command in scripts/exregional_process_radarref.sh. And the 2nd issue is resolved by adding the envars variable to the processing radar task in parm/wflow/da_data_preproc.yaml file. Adding envars allows the nnodes variable to be updated when the task is called, which results in the right number of processes (nprocs). Presumably we didn't catch this error before because this task was tested on Jet and Hera which use srun, while AWS and Cheyenne use mpiexec/mpirun which uses the nprocs argument.

This test was run by copying the config.da_cycling.yaml and adding some da cycling variables to the machine file like RAP_OBS_BUFR and NSSLMOSAIC. This case data is present on Hera, Jet, Gaea, Cheyenne, and AWS with it partially staged on Orion.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

TESTS CONDUCTED:

  • hera.intel
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform): aws
  • Jenkins
  • fundamental test suite on Cheyenne
  • comprehensive tests (specify which if a subset was used)

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

  • Work In Progress
  • bug
  • enhancement
  • documentation
  • release
  • high priority
  • run_ci
  • run_we2e_fundamental_tests
  • run_we2e_comprehensive_tests
  • Needs Cheyenne test
  • Needs Jet test
  • Needs Hera test
  • Needs Orion test
  • help wanted

CONTRIBUTORS (optional):

@EdwardSnyder-NOAA EdwardSnyder-NOAA changed the title [develop] bug fix for processing radar ref task [develop] bug fix for processing radar reflectivity task Jul 5, 2023
@EdwardSnyder-NOAA EdwardSnyder-NOAA marked this pull request as ready for review July 6, 2023 15:40
Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EdwardSnyder-NOAA These changes look good to me!

I also ran the fundamental WE2E tests on Jet and ran the process_obs test on Jet as well to ensure that the changes successfully passed on the machine that the test is actively ran on. All tests successfully passed.

Approving this work now.

@mkavulich
Copy link
Collaborator

@EdwardSnyder-NOAA quick question about this line in your PR message:

This test was run by copying the config.da_cycling.yaml and adding some da cycling variables to the machine file like RAP_OBS_BUFR and NSSLMOSAIC

Does this mean the da_cycling test should be updated to include these variables?

@EdwardSnyder-NOAA
Copy link
Collaborator Author

@mkavulich - These variables (RAP_OBS_BUFR and NSSLMOSAIC) should be added to the machine files. Only the Hera and Jet machine files have them. The reason why I didn't include it with this PR is because I was waiting for PR-740 to be merged which includes the updated way to build and install the SRW App on the NOAA cloud platforms. I do have the da_cycling data staged everywhere else, so in theory I can add the variables in question to the other RDHPCs machine files.

@EdwardSnyder-NOAA
Copy link
Collaborator Author

@mkavulich & @MichaelLueken - For this PR, I've now added the cycling variables (RAP_OBS_BUFR and NSSLMOSAIC) to all the machine files. To toggle between a retrospective observation path and a current observation path, I added the DO_RETRO variable. I also updated Orion's file paths to reflect the data location for our new role account there.

@MichaelLueken
Copy link
Collaborator

@EdwardSnyder-NOAA -

Thank you very much for adding the cycling variables to all of the machine files, adding the DO_RETRO variable, and updating the data locations on Orion to use the new role-epic account on that machine!

When PR #853 was merged last Monday, it caused a conflict in ush/valid_param_vals.yaml when you added DO_RETRO valid values to the file. Please merge the latest authoritative develop branch into your bugfix/process_radar branch to correct the conflicted file at your earliest convenience.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good, thanks for answering my questions about those variables. Sorry I have forgotten the conclusion of the conversation at the last meeting, is the plan still to merge this PR despite the RRFS removal?

@MichaelLueken
Copy link
Collaborator

@mkavulich and @EdwardSnyder-NOAA -

It looks like the modifications to parm/wflow/prep.yaml and the updated role-epic locations on Orion (ush/machine/orion.yaml) are still desirable fixes.

With the removal of all DA and RRFS-related work in @christinaholtNOAA's PR #893, which would include removing the rest of the changes that would be introduced as part of this PR, would @EdwardSnyder-NOAA be willing to remove the rest of the changes (since they are going to be removed shortly in another PR) and just keep the fixes in parm/wflow/prep.yaml and location updates in ush/machine/orion.yaml?

@EdwardSnyder-NOAA
Copy link
Collaborator Author

@MichaelLueken I removed the rrfs related work and left changes in the prep.yaml and orion.yaml

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving, still looks good 👍

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Sep 5, 2023
@MichaelLueken
Copy link
Collaborator

The WE2E coverage tests were manually ran on Jet (the old jet-epic pipeline is no longer valid since all tier-1 platforms are now using the EPIC allocations and locations). All tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              21.15
custom_ESGgrid                                                     COMPLETE              24.34
custom_GFDLgrid                                                    COMPLETE              14.68
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018         COMPLETE              17.87
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h     COMPLETE              90.64
get_from_HPSS_ics_RAP_lbcs_RAP                                     COMPLETE              23.22
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             310.75
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              45.80
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               9.98
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             663.75
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       COMPLETE              12.13
process_obs                                                        COMPLETE               0.43
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1234.74

Additionally, a rerun of the Hera Intel tests were run and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200          COMPLETE               6.41
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             791.03
get_from_HPSS_ics_HRRR_lbcs_RAP                                    COMPLETE              14.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE               7.35
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              13.90
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP                 COMPLETE               9.86
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2        COMPLETE               6.75
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2         COMPLETE             242.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16           COMPLETE             305.34
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR            COMPLETE             342.06
pregen_grid_orog_sfc_climo                                         COMPLETE               7.31
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1747.51

The Jenkins tests on Gaea, Hera GNU, and Orion all successfully passed. Moving forward with merging this work now.

@MichaelLueken MichaelLueken merged commit dded05b into ufs-community:develop Sep 6, 2023
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants