Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prevent runaway memory consumption #4048

Closed
wants to merge 4 commits into from

Conversation

johnbelamaric
Copy link
Contributor

@johnbelamaric johnbelamaric commented Sep 21, 2023

Fixes #4023

  • Abort a poll of a repository if one is already in progress. Previously, these would build up, with a new poll starting every minute for each repository. If they took longer than a minute, they would accumulate and overwhelm them system.
  • Use package revision ResourceVersion to avoid sending unnecessary watch updates for package revisions.
  • Reduce noisy logging.
  • Not clear this fixes the root cause, but it likely will fix the periodic failures.

Signed-off-by: John Belamaric <[email protected]>
@johnbelamaric johnbelamaric changed the title Reduce logging verbosity and increase utility Do not sent watch events if package revisions are unmodified Sep 25, 2023
@johnbelamaric
Copy link
Contributor Author

Summary of analysis:

  • The APF errors were a red herring. APF was being activated because the server was starting to get overwhelmed and time out. This tickled an APF bug. But disabling APF does not fix the problem, it just gets rid of the error messages; the server still crashes.
  • It seems to be a resource leak, or an issue where some back pressure is needed. I tried bumping the resources for the porch-server pod and it lasts longer but still eventually gets OOMKilled. As it is happening, the number of watch events increases, as do the latencies for returning API calls. My guess is there is a feedback loop where a delayed API call of some sort causes clients and/or the server itself to make additional requests, which in turn slows things down more, and then either leaks or a backlog fill up memory until it is killed and restarted.
  • I suspect the sending of extraneous watches is causing a lot of strain on the porch-server, so this PR fixes that "over notification". This may not fix the underlying issue - the system should protect itself so that it does not get into a runaway state and get OOMKilled.
  • Even if it doesn't fix the underlying issue, it should increase stability of the system, since it will reduce the chance that the system gets pushed past the point where the feedback loop kicks in.
  • See Disable APF #4042 (comment) for some graphs.
  • I am running a test now, we can watch it for a day or two and see if stability has increased.
  • We should follow up with load tests to see if we can induce this state even with the change in here.

@johnbelamaric johnbelamaric mentioned this pull request Sep 25, 2023
@johnbelamaric johnbelamaric changed the title Do not sent watch events if package revisions are unmodified Do not send watch events if package revisions are unmodified Sep 25, 2023
@johnbelamaric johnbelamaric changed the title Do not send watch events if package revisions are unmodified Prevent runaway memory consumption Sep 25, 2023
@johnbelamaric
Copy link
Contributor Author

soak test in progress

Signed-off-by: John Belamaric <[email protected]>
@johnbelamaric
Copy link
Contributor Author

/hold

I think I have the root cause, testing the theory.

@johnbelamaric johnbelamaric changed the title Prevent runaway memory consumption [WIP] Prevent runaway memory consumption Sep 27, 2023
@johnbelamaric
Copy link
Contributor Author

Replaced by #4050

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

porch: porch-server crash with "Unable to derive new concurrency limits"
1 participant