Feature request: signal handling when job is ending #1336

matthewhoffman · 2017-05-22T15:57:25Z

It is possible to have a queue system send a signal when the job is near its end time, e.g.:
https://slurm.schedmd.com/sbatch.html

--signal=[B:]<sig_num>[@<sig_time>]
    When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the 
resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than 
specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have 
an integer value between 0 and 65535. By default, no signal is sent before the job's end time. If a 
sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option 
to signal only the batch shell, none of the other processes will be signaled. By default all job steps 
will be signalled, but not the batch shell itself.

Within Fortran, it is possible to catch a signal, e.g.:
https://www.sharcnet.ca/help/images/4/42/Fortran_Signal_Handling.pdf

Combining these, it would be possible for MPAS to catch a signal if the job is ending and then do things like write a restart and terminate cleanly.

Without thinking about it more carefully, I'm not sure that any Framework changes would be necessary for this to be implemented in a core.

The text was updated successfully, but these errors were encountered:

mgduda · 2017-05-22T18:25:23Z

As far as I'm aware catching signals in Fortran is still non-standard, though many compilers (like GNU: https://gcc.gnu.org/onlinedocs/gfortran/SIGNAL.html) support it. Maybe we can investigate whether this would be feasible to do in C using the POSIX standard sigaction, then to call Fortran code from the C signal handler?

mgduda · 2017-05-22T18:27:35Z

It may also be the case that memory allocation in signal handlers is forbidden (https://stackoverflow.com/questions/33619071/signal-handling-and-check-pointing-for-mpif90/33647381), so we may need to do sneaky things like setting a global variable and returning from the handler, then relying on code that checks the global variable periodically to determine whether, e.g., restart files should be written.

matthewhoffman · 2017-05-22T18:36:37Z

@mgduda , thanks for the feedback on this. It was unclear to me if this is Fortran standard or not. The pdf I linked to seemed to imply it is now standard but some compilers still include their previously non-standard way of handling it. (But it didn't come out and say that, so I still wasn't sure about that.)

In any case, I suspect this feature may not in the end be super useful since we generally want to protect from unexpected termination anyway, and so if we are writing restarts at a regular frequency anyway, the ability to force a restart at wallclock end might not add a whole lot of extra value. Getting timer information on a run that times out might be useful, but maybe not worth the hassle here.

I mostly wanted to jot this idea down for posterity, so thanks for adding to it.

matthewhoffman · 2017-05-22T18:44:08Z

@mgduda , reading the link you added more carefully, perhaps a more useful way to get this behavior would be to include a timing functionality that triggers writing a restart and/or model termination after a specified wallclock duration that is configurable at runtime (i.e. in the namelist you set shutdown_after_elapsed_time = 15:45:00 if you want the model to shut down cleanly after the first time step that occurs within 15 minutes of the end of a 16 hr submission).

You would have to modify that time with each job to make it consistent with your job submission script, of course, but in my original proposal, you already needed to include a special line in the submission script with the appropriate time and arbitrarily chosen signal code, so that it isn't that much worse.

Another advantage to this approach is that I don't think it would require any framework changes - any core could implement that on their own now.

philipwjones · 2017-05-22T19:02:54Z

Indeed, we had this capability in POP for a while, but it was problematic because it was non-standard.
It did work for a while and @maltrud I think used it to some effect. Given the non-portability, on both Fortran implementation and batch schedulers, might be best to implement with a generic wrapper around a function that could be heavily ifdef'd to handle specific configs and is a no-op otherwise. Using the standard clock functions might be the better approach.

mgduda · 2017-05-23T16:59:06Z

@matthewhoffman Implementing the second option that you proposed -- the ability to force-write a restart stream after some specified elapsed wallclock time -- would be rather easy to implement in a portable way, I think. The Fortran date_and_time intrinsic can provide the starting time of a run and the current time after each timestep; with some arithmetic, a core could then decide to write its restart streams(s) with forceWriteNow=.true..

matthewhoffman added the enhancement label May 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: signal handling when job is ending #1336

Feature request: signal handling when job is ending #1336

matthewhoffman commented May 22, 2017

mgduda commented May 22, 2017

mgduda commented May 22, 2017

matthewhoffman commented May 22, 2017

matthewhoffman commented May 22, 2017

philipwjones commented May 22, 2017

mgduda commented May 23, 2017

Feature request: signal handling when job is ending #1336

Feature request: signal handling when job is ending #1336

Comments

matthewhoffman commented May 22, 2017

mgduda commented May 22, 2017

mgduda commented May 22, 2017

matthewhoffman commented May 22, 2017

matthewhoffman commented May 22, 2017

philipwjones commented May 22, 2017

mgduda commented May 23, 2017