You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
We have one satellite dedicated to check_nwc_health checks for a specific sort of devices.
If for some reason, those devices are not reachable anymore due to an network outage or other, the checks will timeout and the check queue will fill up until the host has reaches maximum CPU or is out of memory.
Now the oom-killer wants to free up memory (normal behaviour), but will probably kill the "wrong" processes like mysqld or gearmand.
We are now in a loop, and the satellite is not able anymore to come back to a normal situation.
A hard reboot of the satellite does not solve the problem, as the check queue will fill up to fast and the oom-killer kills the wrong processes again.
Describe the solution you'd like
Maybe a good idea is to make the system relevant processes more resilient against the oom-killer by adding a parameter to the unit files.
[Service]
OOMScoreAdjust=-1000
We do not have tested the setting yet, so we cannot say if it really resolves the problem
Describe alternatives you've considered
To get the satellite back online, you have to kill check processes until the system is able again to process all of them ps -ef | grep 'perl' | grep -v grep | awk '{print $2}' | head -n 300 | xargs -r kill -9
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
We have one satellite dedicated to check_nwc_health checks for a specific sort of devices.
If for some reason, those devices are not reachable anymore due to an network outage or other, the checks will timeout and the check queue will fill up until the host has reaches maximum CPU or is out of memory.
Now the oom-killer wants to free up memory (normal behaviour), but will probably kill the "wrong" processes like mysqld or gearmand.
We are now in a loop, and the satellite is not able anymore to come back to a normal situation.
A hard reboot of the satellite does not solve the problem, as the check queue will fill up to fast and the oom-killer kills the wrong processes again.
Describe the solution you'd like
Maybe a good idea is to make the system relevant processes more resilient against the oom-killer by adding a parameter to the unit files.
We do not have tested the setting yet, so we cannot say if it really resolves the problem
Describe alternatives you've considered
To get the satellite back online, you have to kill check processes until the system is able again to process all of them
ps -ef | grep 'perl' | grep -v grep | awk '{print $2}' | head -n 300 | xargs -r kill -9
The text was updated successfully, but these errors were encountered: