Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make important services resilient against oom-killer #1686

Open
kbilev opened this issue Jun 12, 2024 · 0 comments
Open

Make important services resilient against oom-killer #1686

kbilev opened this issue Jun 12, 2024 · 0 comments

Comments

@kbilev
Copy link
Contributor

kbilev commented Jun 12, 2024

Is your feature request related to a problem? Please describe.
We have one satellite dedicated to check_nwc_health checks for a specific sort of devices.
If for some reason, those devices are not reachable anymore due to an network outage or other, the checks will timeout and the check queue will fill up until the host has reaches maximum CPU or is out of memory.
Now the oom-killer wants to free up memory (normal behaviour), but will probably kill the "wrong" processes like mysqld or gearmand.
We are now in a loop, and the satellite is not able anymore to come back to a normal situation.
A hard reboot of the satellite does not solve the problem, as the check queue will fill up to fast and the oom-killer kills the wrong processes again.

Describe the solution you'd like
Maybe a good idea is to make the system relevant processes more resilient against the oom-killer by adding a parameter to the unit files.

[Service]
OOMScoreAdjust=-1000

We do not have tested the setting yet, so we cannot say if it really resolves the problem

Describe alternatives you've considered
To get the satellite back online, you have to kill check processes until the system is able again to process all of them
ps -ef | grep 'perl' | grep -v grep | awk '{print $2}' | head -n 300 | xargs -r kill -9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant