porch: Fix watcher duplication bug #4067
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As mentioned in #4050, it probably wasn't the root cause of the porch-server issues. It did help a lot, but after about 8 days we saw CPU spike and the API server became unavailable.
I found something curious. I noticed way way more "stopping watcher" messages than there should have been watchers. Thousands. Poking around I found something interesting. The WatcherManager maintains an array of watchers, which it sends watch events to. When one is stopped, it sets the array entry to nil, see here:
kpt/porch/pkg/engine/watchermanager.go
Line 97 in 8b5e2f5
When a new watcher is created, it goes through the array and attempts to fill an empty spot, appending if it doesn't find one. That is done in this this little gem:
kpt/porch/pkg/engine/watchermanager.go
Line 76 in 8b5e2f5
Notice what's missing in that loop? The
break
statement. Without that, this new watcher gets copied to EVERY empty spot. I suspect this is the root cause. Basically, as watchers come and go, the spots get freed, but they all get refilled immediately with duplicates. The array thus doesn't grow just to the peak number of watchers, but in fact grows continuously with at least one new entry for every two new watchers. This means that eventually the list gets full of many duplicates, so any watch event is sent a zillion times to the watchers.Hard to say if this is the cause of the CPU spike, but it is likely. And then, I wonder if the K8s APF bug means that the API server does not recover well after the spike. But that's just speculation.
In any case, this PR should fix it. I don't
break
but instead continue looping so I can count all filled slots in the array, just to log it. An array isn't really the right data structure here...