C768 gdasfcst runs too slow on WCOSS2 #2891

DavidHuber-NOAA · 2024-09-05T13:22:27Z

What is wrong?

The C768 gdas forecast takes much longer than expected to run on WCOSS2 (tested on dogwood). Runtime exceeded 70 minutes with the current configuration with the bulk of the time spent in writing the inline post and atm forecast files. Interestingly, the inline post on odd forecast hours and f000 only took ~30s while the inline post at even hours took closer to 360s. Increasing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GDAS from 10 to 15 actually slowed down the inline post write times on even hours to ~420s, though the odd hours' inline posts ran faster (~20s).

This is not an issue on Hera. I have not tested on other machines.

What should have happened?

Runtime should be less than 40 minutes.

What machines are impacted?

WCOSS2

Steps to reproduce

Create a cycled experiment (tested on the 2021122018 half cycle)
Run the first gdasfcst

Additional information

Discovered while testing #2819.

Do you have a proposed solution?

Re-test after the UPP update coming in PR #2877.

The text was updated successfully, but these errors were encountered:

DavidHuber-NOAA · 2024-09-09T14:42:18Z

Updating to the newest UPP did not resolve this issue. More investigation will be required.

DavidHuber-NOAA · 2024-09-10T16:48:51Z

@WenMeng-NOAA @junwang-noaa While testing #2819 on WCOSS2, I found that the first C768 half-cycle, ATM-only GDAS forecast ran very slowly. When running on Hera, the forecast took a little over 20 minutes while on dogwood it took closer to 70 minutes. The slowdown seems to be coming from the inline post. On dogwood, the inline post runtime was ~20s for the 0-hour and all odd-hour writes, but over 6-minutes on even-hour writes. On Hera, the inline post executed in less than 30s at all write times.

Would you be able to look into this? I have initial conditions available on Dogwood here: /lfs/h2/emc/global/noscrub/David.Huber/keep/global_ICs/768/2021122018.

CatherineThomas-NOAA · 2024-09-10T16:50:27Z

@RuiyuSun has also experienced this slowdown for the HR4 scout runs at C1152. The 16 day forecast does not complete within 10 hours walltime.

WenMeng-NOAA · 2024-09-10T17:14:09Z

@WenMeng-NOAA @junwang-noaa While testing #2819 on WCOSS2, I found that the first C768 half-cycle, ATM-only GDAS forecast ran very slowly. When running on Hera, the forecast took a little over 20 minutes while on dogwood it took closer to 70 minutes. The slowdown seems to be coming from the inline post. On dogwood, the inline post runtime was ~20s for the 0-hour and all odd-hour writes, but over 6-minutes on even-hour writes. On Hera, the inline post executed in less than 30s at all write times.

Would you be able to look into this? I have initial conditions available on Dogwood here: /lfs/h2/emc/global/noscrub/David.Huber/keep/global_ICs/768/2021122018.

@DavidHuber-NOAA Do you have runtime logs saved?

DavidHuber-NOAA · 2024-09-10T17:53:40Z

@WenMeng-NOAA Yes, I have a partial log here: /lfs/h2/emc/global/noscrub/David.Huber/para/COMROOT/dev_768_upp/logs/2021122018/gdasfcst_seg0.log. It is partial because it ran into the walltime limit of 40 minutes.

I also have a complete log here: /lfs/h2/emc/global/noscrub/David.Huber/para/COMROOT/768_768/logs/2021122018/gdasfcst_seg0.log. However, for this test, I increased the number of write tasks by 1.5x, which actually slowed the inline post down further.

Lastly, I have a Hera log here: /scratch1/NCEPDEV/global/David.Huber/para/COMROOT/C768_2/logs/2023021018/gdasfcst.log.

RuiyuSun · 2024-09-11T03:51:10Z

I was able to complete a 120 hour coupled HR4 forecast experiment. The log files is at /lfs/h2/emc/stmp/ruiyu.sun/ROTDIRS/HR47/logs/2020012600 on dogwood.

DavidHuber-NOAA · 2024-09-11T14:06:55Z

I should clarify that this issue was only present for me for the GDAS forecast. The 120 hour ATM-only GFS forecast did not exhibit this issue.

RussTreadon-NOAA · 2024-09-11T14:19:00Z

@DavidHuber-NOAA , I see that the model is now writing both gaussian grid [atmfxxx, sfcfxxx] as well as cubed_spehere_grid [atmfxxx, sfcfxxx] files. Writing more output takes more time. Can we reduce i/o time by adjusting WRITE_GROUP or WRTTASK_PER_GROUP_PER_THREAD_PER_TILE?

DavidHuber-NOAA · 2024-09-11T14:27:43Z

@RussTreadon-NOAA I did try increasing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE from 10 to 15 (which is now what is in develop), but the posts actually ran slower. Increasing the WRITE_GROUP would be a good choice to look at next.

WenMeng-NOAA · 2024-09-11T14:34:01Z

@DavidHuber-NOAA Could you try to modify setting of WRTASK_PER_GROUP? Could you keep the run directory for @junwang-noaa and me to check inline post?

junwang-noaa · 2024-09-11T14:52:01Z

@RussTreadon-NOAA Thanks for finding the issue! @DavidHuber-NOAA Is it required to write out 2 sets of history files on Gaussian grid and on native grid outputs? What is the native grid output used for? This doubles the memory requirement on the IO side. Also I want to confirm this configuration (2 sets of history files) could actually cause IO issue on all the platforms unless the machine has huge memory.

RuiyuSun · 2024-09-11T14:58:35Z

I should clarify that this issue was only present for me for the GDAS forecast. The 120 hour ATM-only GFS forecast did not exhibit this issue.

@DavidHuber-NOAA GFS fcst is slow too in the coupled configuration. My HR4 GFS forecast experiment didn't completed in 10 hour walltime. Layout_x_gfs=24 and layout_y_gfs=16 were used in this run.
=>> PBS: job killed: walltime 36058 exceeded limit 36000

The log file is gfsfcst_seg0.log.0 at /lfs/h2/emc/ptmp/ruiyu.sun/ROTDIRS/HR46/logs/2020012600.

RuiyuSun · 2024-09-11T14:59:52Z

FHMAX_GFS=384 in the experiment

WenMeng-NOAA · 2024-09-11T15:42:02Z

@RuiyuSun From the log you provided at /lfs/h2/emc/stmp/ruiyu.sun/ROTDIRS/HR47/logs/2020012600/gfsfcst_seg0.log, I saw the following configurations:

parsing_model_configure_FV3.sh[30]: local WRITE_GROUP=4
+ parsing_model_configure_FV3.sh[31]: local WRTTASK_PER_GROUP=120
+ parsing_model_configure_FV3.sh[32]: local ITASKS=1
+ parsing_model_configure_FV3.sh[33]: local OUTPUT_HISTORY=.true.
+ parsing_model_configure_FV3.sh[34]: local HISTORY_FILE_ON_NATIVE_GRID=.true.
+ parsing_model_configure_FV3.sh[35]: local WRITE_DOPOST=.true.
+ parsing_model_configure_FV3.sh[36]: local WRITE_NSFLIP=.true.

@junwang-noaa Is 'HISTORY_FILE_ON_NATIVE_GRID' set for writing out model data files in native grid?

RussTreadon-NOAA · 2024-09-11T15:43:05Z

g-w PR #2792 changed

local HISTORY_FILE_ON_NATIVE_GRID=".false."

to

local HISTORY_FILE_ON_NATIVE_GRID=".true."

in ush/parsing_model_configure_FV3.sh

At the same time, we retain

local OUTPUT_HISTORY=${OUTPUT_HISTORY:-".true."}

RussTreadon-NOAA · 2024-09-11T15:44:26Z

As a test can we revert back to local HISTORY_FILE_ON_NATIVE_GRID=".false." in a working copy of ush/parsing_model_configure_FV3.sh and rerun a gdasfcst to see if/how the wall time changes?

CoryMartin-NOAA · 2024-09-11T15:46:51Z

@RussTreadon-NOAA Thanks for finding the issue! @DavidHuber-NOAA Is it required to write out 2 sets of history files on Gaussian grid and on native grid outputs? What is the native grid output used for? This doubles the memory requirement on the IO side. Also I want to confirm this configuration (2 sets of history files) could actually cause IO issue on all the platforms unless the machine has huge memory.

@junwang-noaa we only need native grid history when we will be using JEDI for the atmospheric analysis. We will likely have to write both since the Gaussian grid is presumably used for products/downstream?

junwang-noaa · 2024-09-11T15:48:10Z

So now the write grid component will do:

UPP
Gaussian history files
Native history files
Restart files.
The last two tasks require memory increase and slow down the write grid component, which could further slow down the forecast integration. So both write tasks per group and the number of write groups need to increase in order to catch up with the forecast.

CoryMartin-NOAA · 2024-09-11T15:51:59Z

Since we only need native history for GDAS fcst (and enkfgdas fcst) when using JEDI for atm, and we don't need that for GFSv17, perhaps we either:

revert to gaussian only in develop
add an option to only use native grid output if DO_JEDIATMVAR="YES"

Later on, we may want to just write out native grid and regrid to Gaussian offline as needed?

junwang-noaa · 2024-09-11T16:14:24Z

@CoryMartin-NOAA I want to confirm with you when you say "we only need native history for GDAS fcst", do you also need post products from the model? If yes, then we still need to have Gaussian grid fields on write grid component for inline post unless there is a plan to do cubed-sphere-grid to Gaussian grid interpolation and then offline post. We still increase the memory, but the writing time of the native history can be reduced.

@DavidHuber-NOAA @RuiyuSun I see you have following in the GFS forecast log:

quilting: .true.
quilting_restart: .true.
write_groups: 4
write_tasks_per_group: 120
itasks: 1
output_history: .true.
history_file_on_native_grid: .true.

So model are writing out 2 sets of C1152 history files, also since it is a coupled case, the quilting_restart can also be set to .false. because atm is waiting when other model components write out restart files. So please set the following:

quilting: .true.
quilting_restart: .false.
write_groups: 4
write_tasks_per_group: 120
itasks: 1
output_history: .true.
history_file_on_native_grid: .false.

CoryMartin-NOAA · 2024-09-11T16:54:02Z

@junwang-noaa I'll have to defer to someone like @WenMeng-NOAA for that. I do think we have some 'GDAS' products but I'm not sure.

WenMeng-NOAA · 2024-09-11T17:23:22Z

The gdas forecast products (e.g. gdas.tCCz.master.f and gdas.tCCz.sfluxgrbf*) are generated from inline post.

RuiyuSun · 2024-09-11T17:52:22Z

@junwang-noaa I see. Thanks for the suggestion.

DavidHuber-NOAA · 2024-09-11T19:32:40Z

I ran a test case on WCOSS2 with local HISTORY_FILE_ON_NATIVE_GRID=".false.". The gdasfcst completed in ~21.5 minutes. The log file for this test is available here: /lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_no_native_history.log.

junwang-noaa · 2024-09-11T20:00:08Z

@DavidHuber-NOAA Is it OK to turn off HISTORY_FILE_ON_NATIVE_GRID for GFSv17 implementation? are the 21.5 minutes within opn window? Also would you please send us the run directories on hera and wcoss so that we can investigate a little more?

DavidHuber-NOAA · 2024-09-11T20:13:25Z

@junwang-noaa I will defer the operational question to @aerorahul. Based on the discussion, I think turning off HISTORY_FILE_ON_NATIVE_GRID for GFSv17 would be the right way to go, but I will run a full cycle to verify.

Unfortunately, my run directories were removed automatically by the workflow. I don't think I can replicate the Hera run as I have updated my working version of the workflow, but I will regenerate the run directory on WCOSS2 at least and set KEEPDATA="YES" to prevent it from being deleted.

CatherineThomas-NOAA · 2024-09-11T20:20:00Z

@junwang-noaa @DavidHuber-NOAA
We do not need the history files on the cubed sphere for GFSv17. It is only needed for JEDI atmospheric DA, which is not a v17 target at this time.

junwang-noaa · 2024-09-11T20:42:50Z

Thanks, Cathy. Is the gdas fcst 21.5mins running time OK for the operational GFSv17?

DavidNew-NOAA · 2024-09-11T20:49:11Z

Right now the native grid cubed-sphere history files are used as backgrounds for JEDI DA. Eventually (something I'm working on right now), they will be interpolated to the Gaussian grid during post-processing, and the forecast model will only need to write to the native grid, not both. Until then, I would agree that we should only turn HISTORY_FILE_ON_NATIVE_GRID on when using JEDI in the workflow.

junwang-noaa · 2024-09-11T21:50:56Z

@DavidHuber-NOAA Thanks for the explanation. " they will be interpolated to the Gaussian grid during post-processing", do you mean that the post processing code will read in the native grid model output fields and interpolate these fields on Gaussian grid?

DavidNew-NOAA · 2024-09-11T21:58:04Z

@junwang-noaa Yes, that's correct

CatherineThomas-NOAA · 2024-09-12T13:08:34Z

@junwang-noaa: Yes, 21.5 minutes is very reasonable for the gdas forecast.

junwang-noaa · 2024-09-12T13:22:25Z

@DavidHuber-NOAA So setting HISTORY_FILE_ON_NATIVE_GRID to .false. will resolve the slowness issue on gdas fcst and GFS fcst jobs on wcoss2 without significantly increasing the number of write tasks and write groups. Some work needs to be done as @DavidNew-NOAA mentioned to turn back on HISTORY_FILE_ON_NATIVE_GRID. Also I noticed the slowness of writing the native history files on wcoss2 ( a run directory from this test case would be helpful). We will look into it on the model side, but this is for future implementations when native model history files are required. Please let me know if there is still any issue . Thanks

DavidHuber-NOAA · 2024-09-12T13:45:40Z

@junwang-noaa Thank you for the summary. I have copied the run directory run directory with HISTORY_FILE_ON_NATIVE_GRID disabled to /lfs/h2/emc/global/noscrub/David.Huber/keep/fcst_rundir_no_native_history. I will repeat this case with that option enabled and save the working directory and log file.

@DavidNew-NOAA @CoryMartin-NOAA Just to confirm, the native grid restart files are required for GDASApp analyses, correct? If so, I will add a conditional block around local HISTORY_FILE_ON_NATIVE_GRID=".true." for JEDI-based experiments.

CoryMartin-NOAA · 2024-09-12T13:47:23Z

@DavidHuber-NOAA yes, that would be perfect if you could do that.

DavidHuber-NOAA · 2024-09-12T13:52:40Z

Alright, sounds good @CoryMartin-NOAA.

@junwang-noaa I apologize. The gdasfcst for which I copied data to keep/ failed. I think I know the reason and will rerun shortly. I will let you know when I have finished running with both native grid outputs on and off.

aerorahul · 2024-09-12T19:39:45Z

@DavidHuber-NOAA did the work for PR #2914 2914. I just opened the PR after the GFSv17 meeting discussion to get eyes on it.

DavidHuber-NOAA · 2024-09-13T12:21:30Z

@junwang-noaa the run directories and log files have now been copied to /lfs/h2/emc/global/noscrub/David.Huber/keep as the following

Writing both native and gaussian grids: gdasfcst_w_native_rundir and gdasfcst_w_native.log (58:00 runtime)
Writing only the gaussian grid: gdasfcst_no_native_rundir and gdasfcst_no_native.log (22:23 runtime)

DavidHuber-NOAA added bug Something isn't working triage Issues that are triage labels Sep 5, 2024

DavidHuber-NOAA self-assigned this Sep 5, 2024

DavidHuber-NOAA removed the triage Issues that are triage label Sep 5, 2024

NOAA-EMC deleted a comment Sep 5, 2024

DavidHuber-NOAA assigned DavidHuber-NOAA and unassigned DavidHuber-NOAA Sep 9, 2024

DavidHuber-NOAA removed their assignment Sep 10, 2024

DavidHuber-NOAA self-assigned this Sep 12, 2024

aerorahul mentioned this issue Sep 12, 2024

Updates for HR4 tag #2914

Merged

16 tasks

DavidHuber-NOAA assigned aerorahul Sep 12, 2024

DavidHuber-NOAA closed this as completed in #2914 Sep 13, 2024

DavidHuber-NOAA closed this as completed in 4ad9695 Sep 13, 2024

C768 gdasfcst runs too slow on WCOSS2 #2891

C768 gdasfcst runs too slow on WCOSS2 #2891

Comments

DavidHuber-NOAA commented Sep 5, 2024 • edited Loading

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

DavidHuber-NOAA commented Sep 9, 2024

DavidHuber-NOAA commented Sep 10, 2024

CatherineThomas-NOAA commented Sep 10, 2024

WenMeng-NOAA commented Sep 10, 2024

DavidHuber-NOAA commented Sep 10, 2024

RuiyuSun commented Sep 11, 2024

DavidHuber-NOAA commented Sep 11, 2024

RussTreadon-NOAA commented Sep 11, 2024

DavidHuber-NOAA commented Sep 11, 2024

WenMeng-NOAA commented Sep 11, 2024

junwang-noaa commented Sep 11, 2024 • edited Loading

RuiyuSun commented Sep 11, 2024

RuiyuSun commented Sep 11, 2024

WenMeng-NOAA commented Sep 11, 2024

RussTreadon-NOAA commented Sep 11, 2024

RussTreadon-NOAA commented Sep 11, 2024

CoryMartin-NOAA commented Sep 11, 2024

junwang-noaa commented Sep 11, 2024 • edited Loading

CoryMartin-NOAA commented Sep 11, 2024

junwang-noaa commented Sep 11, 2024

CoryMartin-NOAA commented Sep 11, 2024

WenMeng-NOAA commented Sep 11, 2024

RuiyuSun commented Sep 11, 2024

DavidHuber-NOAA commented Sep 11, 2024

junwang-noaa commented Sep 11, 2024

DavidHuber-NOAA commented Sep 11, 2024 • edited Loading

CatherineThomas-NOAA commented Sep 11, 2024

junwang-noaa commented Sep 11, 2024

DavidNew-NOAA commented Sep 11, 2024 • edited Loading

junwang-noaa commented Sep 11, 2024

DavidNew-NOAA commented Sep 11, 2024

CatherineThomas-NOAA commented Sep 12, 2024

junwang-noaa commented Sep 12, 2024

DavidHuber-NOAA commented Sep 12, 2024

CoryMartin-NOAA commented Sep 12, 2024

DavidHuber-NOAA commented Sep 12, 2024

aerorahul commented Sep 12, 2024 • edited Loading

DavidHuber-NOAA commented Sep 13, 2024

DavidHuber-NOAA commented Sep 5, 2024 •

edited

Loading

junwang-noaa commented Sep 11, 2024 •

edited

Loading

junwang-noaa commented Sep 11, 2024 •

edited

Loading

DavidHuber-NOAA commented Sep 11, 2024 •

edited

Loading

DavidNew-NOAA commented Sep 11, 2024 •

edited

Loading

aerorahul commented Sep 12, 2024 •

edited

Loading