Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namelist updates and runtime fixes #502

Merged
merged 17 commits into from
Aug 14, 2024
Merged

Conversation

grantfirl
Copy link
Collaborator

@grantfirl grantfirl commented Aug 10, 2024

  • fix bug in GFS_phys_time_vary.scm.F90 that tries to access array out of bounds related to recent lake changes: f0fc5dd
  • fix bug in scm_vgrid single precision changes: 6181de9
  • update RRTMGP root and data file paths: f008ae0, 2f76060
  • don't require NSST input data when lsm_ics = T: de3cab1
  • fix tracer file in suite_info.py for GFS_v17_p8_ugwpv1: dee39db
  • add "ps" version of GFS_v16_RRTMGP suite: e47081b
  • update nmls to fix runtime errors: 2601bae
  • adds vegtype_frac and soiltype_frac to UFS_case_gen.py and SCM input files (not tested fully since there isn't any UFS data to test this with that we could readily find, although it correctly output the new variables in the DEPHY file and the SCM tries to read them in, they're just set to missing values and therefore not used for initialization in the SCM).

Note: This PR is stacked on the dephy_conversion branch, so #496 should be merged first

Requires NCAR/ccpp-physics#1082

@grantfirl grantfirl force-pushed the nml_updates branch 4 times, most recently from 9224dcb to f008ae0 Compare August 12, 2024 21:02
@grantfirl
Copy link
Collaborator Author

@dustinswales @scrasmussen @mkavulich All RTs are running successfully on the GitHub servers, although there are failures associated with nvfortran running without GPU acceleration and within the Docker container. Are these worth debugging before the release? Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

@grantfirl
Copy link
Collaborator Author

@dustinswales @scrasmussen @mkavulich @ligiabernardet I think that all of the CI tests should be passing, but the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think. If I create a container with the code in this PR, I'm able to run and pass all tests in rt_test_cases.py, so there are no failing RTs on Docker anymore. We may need to streamline the nvidia and docker tests to get them to reliably pass after the release, but I would like to merge this as-is.

@dustinswales
Copy link
Collaborator

@dustinswales @scrasmussen @mkavulich @ligiabernardet I think that all of the CI tests should be passing, but the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think. If I create a container with the code in this PR, I'm able to run and pass all tests in rt_test_cases.py, so there are no failing RTs on Docker anymore. We may need to streamline the nvidia and docker tests to get them to reliably pass after the release, but I would like to merge this as-is.

@grantfirl I think it's fine to proceed w/o the NVIDIA CI working. Post-release we should look at "containerizing" the NVIDIA CI tests (And add Intel)

@grantfirl
Copy link
Collaborator Author

@dustinswales @scrasmussen @mkavulich @ligiabernardet I think that all of the CI tests should be passing, but the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think. If I create a container with the code in this PR, I'm able to run and pass all tests in rt_test_cases.py, so there are no failing RTs on Docker anymore. We may need to streamline the nvidia and docker tests to get them to reliably pass after the release, but I would like to merge this as-is.

@grantfirl I think it's fine to proceed w/o the NVIDIA CI working. Post-release we should look at "containerizing" the NVIDIA CI tests (And add Intel)

OK, thanks. The previous nvidia CI failures were failures when running RTs due to lack of the climatology data, so what I did should fix them, but GitHub servers keep on hanging when running the tests, so I'm not sure and I don't have an independent way to check with the nvidia compilers.

@mkavulich
Copy link
Collaborator

Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

The Docker container does not grab the PR branch code, it builds from main. This is probably something that should be updated in the test file.

the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think

I didn't realize Soren's PR still included the aerosol climo download. We should look into the potential of having a subset of these files for use in RTs, because I don't think this is ever going to work on Github runners.

@grantfirl
Copy link
Collaborator Author

Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

The Docker container does not grab the PR branch code, it builds from main. This is probably something that should be updated in the test file.

the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think

I didn't realize Soren's PR still included the aerosol climo download. We should look into the potential of having a subset of these files for use in RTs, because I don't think this is ever going to work on Github runners.

Soren added some code to pull from the PR branch (recursively) in the Dockerfile when invoked through the CI with a PR number as an argument. I added back the aerosol climo download for the Docker container testing only (it shouldn't be in the image, only when the image starts up and runs). Otherwise, a bunch of RTs fail due to lack of the aerosol data. This must fill up the space alotted for containers though. There's just no good way to shuffle around GBs of data for the GitHub runners, really.

@scrasmussen
Copy link
Member

Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

The Docker container does not grab the PR branch code, it builds from main. This is probably something that should be updated in the test file.

the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think

I didn't realize Soren's PR still included the aerosol climo download. We should look into the potential of having a subset of these files for use in RTs, because I don't think this is ever going to work on Github runners.

Soren added some code to pull from the PR branch (recursively) in the Dockerfile when invoked through the CI with a PR number as an argument. I added back the aerosol climo download for the Docker container testing only (it shouldn't be in the image, only when the image starts up and runs). Otherwise, a bunch of RTs fail due to lack of the aerosol data. This must fill up the space alotted for containers though. There's just no good way to shuffle around GBs of data for the GitHub runners, really.

Was typing these answer up but you beat me to it, lol

I was thinking maybe we need to add another file like rt_test_cases.py, but without the cases that need the aerosol data. We can just call the new subset of cases when running the CI on github?

Or maybe there is only a subset of the aerosol data that is needed, we can create a new tarball that is smaller

@dustinswales
Copy link
Collaborator

Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

The Docker container does not grab the PR branch code, it builds from main. This is probably something that should be updated in the test file.

the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think

I didn't realize Soren's PR still included the aerosol climo download. We should look into the potential of having a subset of these files for use in RTs, because I don't think this is ever going to work on Github runners.

Soren added some code to pull from the PR branch (recursively) in the Dockerfile when invoked through the CI with a PR number as an argument. I added back the aerosol climo download for the Docker container testing only (it shouldn't be in the image, only when the image starts up and runs). Otherwise, a bunch of RTs fail due to lack of the aerosol data. This must fill up the space alotted for containers though. There's just no good way to shuffle around GBs of data for the GitHub runners, really.

@grantfirl I'm not surprised by this issue.
The NVIDIA tests requires more space than are normally allowed on the GitHub runners, so much so that I added a step to uninstall some things to make space, https://github.com/NCAR/ccpp-scm/blob/main/.github/workflows/ci_build_scm_ubuntu_22.04_nvidia.yml#L34.
This is because we DL the entire NVIDIA HPC package, which is huuuge, and then install it.

@grantfirl
Copy link
Collaborator Author

Are we sure that the Docker container is grabbing the latest commit in the PR branch (and doing so recursively)?

The Docker container does not grab the PR branch code, it builds from main. This is probably something that should be updated in the test file.

the Nvidia RTs and docker tests are timing out due to the large aerosol climo file handling, I think

I didn't realize Soren's PR still included the aerosol climo download. We should look into the potential of having a subset of these files for use in RTs, because I don't think this is ever going to work on Github runners.

Soren added some code to pull from the PR branch (recursively) in the Dockerfile when invoked through the CI with a PR number as an argument. I added back the aerosol climo download for the Docker container testing only (it shouldn't be in the image, only when the image starts up and runs). Otherwise, a bunch of RTs fail due to lack of the aerosol data. This must fill up the space alotted for containers though. There's just no good way to shuffle around GBs of data for the GitHub runners, really.

@grantfirl I'm not surprised by this issue. The NVIDIA tests requires more space than are normally allowed on the GitHub runners, so much so that I added a step to uninstall some things to make space, https://github.com/NCAR/ccpp-scm/blob/main/.github/workflows/ci_build_scm_ubuntu_22.04_nvidia.yml#L34. This is because we DL the entire NVIDIA HPC package, which is huuuge, and then install it.

OK, excellent. Since this is a known issue, I feel confident that the RTs actually run on every platform/compiler combo at this point. The only remaining issue is to streamline the data component of the RTs, which can happen after the release.

@grantfirl grantfirl merged commit 9570205 into NCAR:main Aug 14, 2024
22 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants