Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GST_release_public_v1 test fails on Hera in latest develop #874

Closed
mkavulich opened this issue Jul 31, 2023 · 8 comments · Fixed by #906
Closed

GST_release_public_v1 test fails on Hera in latest develop #874

mkavulich opened this issue Jul 31, 2023 · 8 comments · Fixed by #906
Assignees
Labels
bug Something isn't working

Comments

@mkavulich
Copy link
Collaborator

mkavulich commented Jul 31, 2023

Expected behavior

WE2E test GST_release_public_v1 should run successfully on all platforms.

Current behavior

Currently the test fails at the run_fcst step with the line

FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1

followed by a core dump. This typically indicates a CFL violation/model instability.

Full log file can be found below. This occurs in the current develop as well as hash f9696e1 (July 10), but likely occurs in earlier hashes as well.

Machines affected

Hera. Have not noticed this on other machines, but I can not be sure if this is Hera-specific or not.

Edit: note that this is for the Intel compiler, in community mode (GNU compiler seems to succeed strangely). I have not tested in NCO mode.

Steps To Reproduce

  1. Run WE2E test
  2. Observe failure at run_fcst step.

Output

run_fcst_mem000_2019061500.log

@mkavulich mkavulich added the bug Something isn't working label Jul 31, 2023
@MichaelLueken
Copy link
Collaborator

@mkavulich -

Very interesting. I take it that the test is failing on Hera using the Intel compiler? I ask because the Hera coverage tests are passing, and GST_release_public_v1 is part of the Hera GNU coverage suite. I wonder why the test is failing for Hera Intel, but not Hera GNU.

@mkavulich
Copy link
Collaborator Author

Yes, sorry for the missing detail: this is for Intel. Here is a link to my working directory for the latest develop: /scratch2/BMC/fv3lam/kavulich/UFS/workdir/test_develop/2023-07-26/expt_dirs/GST_release_public_v1

@MichaelLueken
Copy link
Collaborator

The GST_release_public_v1 test also fails on Orion, with the same error message:

FATAL from PE 7: compute_qs: saturation vapor pressure table overflow, nbad= 1

at the exact same location (~27 steps).

The link to my working directory for the latest develop on Orion is:
/work/noaa/epic-ps/mlueken/expt_dirs/GST_release_public_v1

@MichaelLueken
Copy link
Collaborator

MichaelLueken commented Jul 31, 2023

PR #799 (hash 294e18b) appears to be the point that the GST_release_public_v1 test began failing on Intel systems. DT_ATMOS was already decreased to address issues with RRFS_CONUS_25km tests with FV3_GFS_v15p2 CCPP physics. Will try testing with different DT_ATMOS settings to see if the test can once again pass.

@mkavulich
Copy link
Collaborator Author

Thanks @MichaelLueken, that makes sense since the failure seems to be model instability again. Since this was a test specifically for the v1 release, it might make sense to return to the DT_ATMOS= 40 used in that release for that specific test. But a higher value would probably also work.

@MichaelLueken
Copy link
Collaborator

@mkavulich -

I tried various DT_ATMOS values (40 - 400) for the GST_release_public_v1 test on Hera Intel, and only setting this to 40 allowed the test to pass. Values higher than 400 led to segfaults in run_fcst. Unfortunately, running the GST_release_public_v1 test on Hera GNU, using DT_ATMOS=40, led the test to fail due to CFL violations:

FATAL from PE 2: compute_qs: saturation vapor pressure table overflow, nbad= 1

So, it looks like the test will only pass for either GNU compilers or Intel compilers.

Are there other parameters that can be tweaked to try and correct these errors, or will we need to add a GST_release_public_v1_intel and GST_release_public_v1_gnu, set DT_ATMOS=40 for GST_release_public_v1_intel, create comprehensive*gnu suites that use GST_release_public_v1_gnu, and change the current comprehensive suites to use GST_release_public_v1_intel?

@mkavulich
Copy link
Collaborator Author

mkavulich commented Aug 3, 2023

I don't think a convoluted solution is necessary. This is an old test using now-unsupported data and a now-unsupported physics suite. And we don't actually know if it originally worked on GNU hera since that wasn't tested regularly until recently.

I am almost of the mind that the test should be removed (for the above reasons) if it can't be fixed for all platforms, but this is something that probably needs wider discussion.

@MichaelLueken MichaelLueken self-assigned this Aug 21, 2023
@MichaelLueken
Copy link
Collaborator

From the August 3rd SRW App Code Management meeting, @gsketefian noted that the GST_release_public_v1 test was only meant for SRWv1 testing, so it can be removed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants