porch: Fix watcher duplication bug #4067

johnbelamaric · 2023-10-12T18:01:36Z

As mentioned in #4050, it probably wasn't the root cause of the porch-server issues. It did help a lot, but after about 8 days we saw CPU spike and the API server became unavailable.

I found something curious. I noticed way way more "stopping watcher" messages than there should have been watchers. Thousands. Poking around I found something interesting. The WatcherManager maintains an array of watchers, which it sends watch events to. When one is stopped, it sets the array entry to nil, see here:

kpt/porch/pkg/engine/watchermanager.go

Line 97 in 8b5e2f5

for i, watcher := range r.watchers {

When a new watcher is created, it goes through the array and attempts to fill an empty spot, appending if it doesn't find one. That is done in this this little gem:

kpt/porch/pkg/engine/watchermanager.go

Line 76 in 8b5e2f5

for i, watcher := range r.watchers {

Notice what's missing in that loop? The break statement. Without that, this new watcher gets copied to EVERY empty spot. I suspect this is the root cause. Basically, as watchers come and go, the spots get freed, but they all get refilled immediately with duplicates. The array thus doesn't grow just to the peak number of watchers, but in fact grows continuously with at least one new entry for every two new watchers. This means that eventually the list gets full of many duplicates, so any watch event is sent a zillion times to the watchers.

Hard to say if this is the cause of the CPU spike, but it is likely. And then, I wonder if the K8s APF bug means that the API server does not recover well after the spike. But that's just speculation.

In any case, this PR should fix it. I don't break but instead continue looping so I can count all filled slots in the array, just to log it. An array isn't really the right data structure here...

Signed-off-by: John Belamaric <[email protected]>

natasha41575

Good find! Makes sense to me.

Signed-off-by: John Belamaric <[email protected]>

Fix watcher duplication bug

6c62a78

Signed-off-by: John Belamaric <[email protected]>

natasha41575 approved these changes Oct 12, 2023

View reviewed changes

Free up space on GitHub runner

b54a42c

Signed-off-by: John Belamaric <[email protected]>

johnbelamaric merged commit c1f963f into kptdev:main Oct 12, 2023
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

porch: Fix watcher duplication bug #4067

porch: Fix watcher duplication bug #4067

johnbelamaric commented Oct 12, 2023

natasha41575 left a comment

porch: Fix watcher duplication bug #4067

porch: Fix watcher duplication bug #4067

Conversation

johnbelamaric commented Oct 12, 2023

natasha41575 left a comment

Choose a reason for hiding this comment