Skip to content
This repository has been archived by the owner on Oct 23, 2020. It is now read-only.

Feature request: signal handling when job is ending #1336

Open
matthewhoffman opened this issue May 22, 2017 · 6 comments
Open

Feature request: signal handling when job is ending #1336

matthewhoffman opened this issue May 22, 2017 · 6 comments

Comments

@matthewhoffman
Copy link
Member

It is possible to have a queue system send a signal when the job is near its end time, e.g.:
https://slurm.schedmd.com/sbatch.html

--signal=[B:]<sig_num>[@<sig_time>]
    When a job is within sig_time seconds of its end time, send it the signal sig_num. Due to the 
resolution of event handling by Slurm, the signal may be sent up to 60 seconds earlier than 
specified. sig_num may either be a signal number or name (e.g. "10" or "USR1"). sig_time must have 
an integer value between 0 and 65535. By default, no signal is sent before the job's end time. If a 
sig_num is specified without any sig_time, the default time will be 60 seconds. Use the "B:" option 
to signal only the batch shell, none of the other processes will be signaled. By default all job steps 
will be signalled, but not the batch shell itself. 

Within Fortran, it is possible to catch a signal, e.g.:
https://www.sharcnet.ca/help/images/4/42/Fortran_Signal_Handling.pdf

Combining these, it would be possible for MPAS to catch a signal if the job is ending and then do things like write a restart and terminate cleanly.

Without thinking about it more carefully, I'm not sure that any Framework changes would be necessary for this to be implemented in a core.

@mgduda
Copy link
Contributor

mgduda commented May 22, 2017

As far as I'm aware catching signals in Fortran is still non-standard, though many compilers (like GNU: https://gcc.gnu.org/onlinedocs/gfortran/SIGNAL.html) support it. Maybe we can investigate whether this would be feasible to do in C using the POSIX standard sigaction, then to call Fortran code from the C signal handler?

@mgduda
Copy link
Contributor

mgduda commented May 22, 2017

It may also be the case that memory allocation in signal handlers is forbidden (https://stackoverflow.com/questions/33619071/signal-handling-and-check-pointing-for-mpif90/33647381), so we may need to do sneaky things like setting a global variable and returning from the handler, then relying on code that checks the global variable periodically to determine whether, e.g., restart files should be written.

@matthewhoffman
Copy link
Member Author

@mgduda , thanks for the feedback on this. It was unclear to me if this is Fortran standard or not. The pdf I linked to seemed to imply it is now standard but some compilers still include their previously non-standard way of handling it. (But it didn't come out and say that, so I still wasn't sure about that.)

In any case, I suspect this feature may not in the end be super useful since we generally want to protect from unexpected termination anyway, and so if we are writing restarts at a regular frequency anyway, the ability to force a restart at wallclock end might not add a whole lot of extra value. Getting timer information on a run that times out might be useful, but maybe not worth the hassle here.

I mostly wanted to jot this idea down for posterity, so thanks for adding to it.

@matthewhoffman
Copy link
Member Author

@mgduda , reading the link you added more carefully, perhaps a more useful way to get this behavior would be to include a timing functionality that triggers writing a restart and/or model termination after a specified wallclock duration that is configurable at runtime (i.e. in the namelist you set shutdown_after_elapsed_time = 15:45:00 if you want the model to shut down cleanly after the first time step that occurs within 15 minutes of the end of a 16 hr submission).

You would have to modify that time with each job to make it consistent with your job submission script, of course, but in my original proposal, you already needed to include a special line in the submission script with the appropriate time and arbitrarily chosen signal code, so that it isn't that much worse.

Another advantage to this approach is that I don't think it would require any framework changes - any core could implement that on their own now.

@philipwjones
Copy link
Contributor

Indeed, we had this capability in POP for a while, but it was problematic because it was non-standard.
It did work for a while and @maltrud I think used it to some effect. Given the non-portability, on both Fortran implementation and batch schedulers, might be best to implement with a generic wrapper around a function that could be heavily ifdef'd to handle specific configs and is a no-op otherwise. Using the standard clock functions might be the better approach.

@mgduda
Copy link
Contributor

mgduda commented May 23, 2017

@matthewhoffman Implementing the second option that you proposed -- the ability to force-write a restart stream after some specified elapsed wallclock time -- would be rather easy to implement in a portable way, I think. The Fortran date_and_time intrinsic can provide the starting time of a run and the current time after each timestep; with some arithmetic, a core could then decide to write its restart streams(s) with forceWriteNow=.true..

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants