You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Custodian version: https://github.com/nwinner/custodian branch cp2k. Issue does not relate to this branch, but I am technically using handlers that exist in my fork, rather than already exist in the main repo.
Python version: 3.7
Summary
Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check
It appears that the way that custodian is written, if any error was found and stored in the bool variable has_error, then instead of running a final check with all handlers, only ones with is_monitor=False will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that caused has_error.
Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.
Example code
This code snippet is taken from custodian.py:448
# While the job is running, we use the handlers that are# monitors to monitor the job.ifisinstance(p, subprocess.Popen):
ifself.monitors:
n=0whileTrue:
n+=1time.sleep(self.polling_time_step)
ifp.poll() isnotNone:
breakterminate=self.terminate_funcorp.terminateifn%self.monitor_freq==0:
has_error=self._do_check(self.monitors,
terminate)
ifterminateisnotNoneandterminate!=p.terminate:
time.sleep(self.polling_time_step)
else:
p.wait()
ifself.terminate_funcisnotNoneand \
self.terminate_func!=p.terminate:
self.terminate_func()
time.sleep(self.polling_time_step)
zero_return_code=p.returncode==0logger.info("{}.run has completed. ""Checking remaining handlers".format(job.name))
# Check for errors again, since in some cases non-monitor# handlers fix the problems detected by monitors# if an error has been found, not all handlers need to runifhas_error:
self._do_check([hforhinself.handlersifnoth.is_monitor])
else:
has_error=self._do_check(self.handlers)
The text was updated successfully, but these errors were encountered:
System
Summary
Consider a handler that checks for convergence. It checks to see if convergence is not reached while the job is running, and so it is a monitor. It can happen, however, that the program finishes because max steps reached and exists before custodian cycles to do another check of the monitors. It will proceed to do the final check
It appears that the way that custodian is written, if any error was found and stored in the bool variable
has_error
, then instead of running a final check with all handlers, only ones withis_monitor=False
will be checked. Because of this, my convergence handler was not included in the final check, and so it was bypassed and the job was marked as completed because a different handler addressed the problem that causedhas_error
.Maybe I'm not understanding the logic of the code, but this does not seem to be the way it should function. Clarifications would be appreciated if this is intended. I'm not sure about a proposed solution because there could be collisions where running every handler at the end causes a monitor to be run twice.
Example code
This code snippet is taken from custodian.py:448
The text was updated successfully, but these errors were encountered: