Make important services resilient against oom-killer #1686

kbilev · 2024-06-12T05:42:39Z

Is your feature request related to a problem? Please describe.
We have one satellite dedicated to check_nwc_health checks for a specific sort of devices.
If for some reason, those devices are not reachable anymore due to an network outage or other, the checks will timeout and the check queue will fill up until the host has reaches maximum CPU or is out of memory.
Now the oom-killer wants to free up memory (normal behaviour), but will probably kill the "wrong" processes like mysqld or gearmand.
We are now in a loop, and the satellite is not able anymore to come back to a normal situation.
A hard reboot of the satellite does not solve the problem, as the check queue will fill up to fast and the oom-killer kills the wrong processes again.

Describe the solution you'd like
Maybe a good idea is to make the system relevant processes more resilient against the oom-killer by adding a parameter to the unit files.

[Service]
OOMScoreAdjust=-1000

We do not have tested the setting yet, so we cannot say if it really resolves the problem

Describe alternatives you've considered
To get the satellite back online, you have to kill check processes until the system is able again to process all of them
ps -ef | grep 'perl' | grep -v grep | awk '{print $2}' | head -n 300 | xargs -r kill -9

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make important services resilient against oom-killer #1686

Make important services resilient against oom-killer #1686

kbilev commented Jun 12, 2024

Make important services resilient against oom-killer #1686

Make important services resilient against oom-killer #1686

Comments

kbilev commented Jun 12, 2024