Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix) registry pods do not come up again after node failure #3366

Merged

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Aug 14, 2024

Description of the change:

PR 3201 attempted to solve for the issue by deleting the pods stuck in
Terminating due to unreachable node. However, the logic to do that was
included in EnsureRegistryServer, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of EnsureRegistryServer,
and puts it in CheckRegistryServer instead. This way, if there are any dead pods
detected during CheckRegistryServer, the value of healthy is returned false,
which inturn triggers EnsureRegistryServer.

Motivation for the change:

Architectural changes:

Testing remarks:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Bug fixes are accompanied by regression test(s)
  • e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
  • tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
  • Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive
  • Tests marked as [FLAKE] are truly flaky and have an issue
  • Code is properly formatted

@perdasilva
Copy link
Collaborator

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

@anik120
Copy link
Contributor Author

anik120 commented Aug 14, 2024

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

I was scared you'll ask this. :p

I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests.

I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR.

@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch from 215fe08 to db34956 Compare August 14, 2024 17:13
@perdasilva
Copy link
Collaborator

What's the signal

So far it looks pretty good =D should we add any unit tests for the new behaviour? and maybe the test we missed in the original bug fix?

I was scared you'll ask this. :p

I did think about it, looked into it, and it looks like it's going to be a little bit of effort to write a test for this. Mainly because it's not really unit testable so an e2e test is the viable option. However, I'm not sure we can mimic a node going down from within our tests.

I just figured the squeeze might not be worth the juice, however if someone has a better idea (or really insist we include a test for this even if it'll require some effort), I'm happy to include one in this PR.

What's the signal we get in the code that a node has gone down? Are some pods unreachable?

@anik120
Copy link
Contributor Author

anik120 commented Aug 15, 2024

What's the signal we get in the code that a node has gone down? Are some pods unreachable?

We don't really get any signal that a node's gone down, we just discover pod/s that have been Deleted, but they still showed up in the result of list pods action. So we say "something's awry, let's force clean this area and move on".

And that "discovery" is done here. We could add a test for that, but it's a pretty simple function.

The real useful test would have been if we could mimic pods "hanging around" in an e2e setting. In fact, the entire syncRegistryServer functionality is mostly e2e tested, this is the only unit test that's there.

@anik120
Copy link
Contributor Author

anik120 commented Aug 15, 2024

I may have found a way to e2e test this......working on it now....

@perdasilva
Copy link
Collaborator

t discover pod/s that have been Deleted

Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't?

@anik120
Copy link
Contributor Author

anik120 commented Aug 19, 2024

Okay I have some tests now.

I may have found a way to e2e test this......working on it now....

This did not work out. I was thinking about using gracefulDeletionPeriod as a hack, but that's not going to be a very reliable test (since gracefulDeletionPeriod doesn't necessarily force the pod to hang around for the period specified after it's been deleted, like I was hoping it would. Instead, it looks like as soon as the container in the pod stops, the GC comes and deletes it, which in our case is very quick. ie the registry server is shut down immediately so the pod will be cleaned up immediately too.)

Would it be possible to mock those client responses? e.g. list says they are there, get says they aren't?

I went with the unit tests route, that mocks these interactions. So the entire feature is tested by 3 components of tests:

  1. The CleanRegistryServer for both Grpc and ConfigMap is tested individually in the reconciler package
    note: it's more clear that the CleanRegistryServer is getting called because of the presence of a "deleted" pod when the test is run in verbose mode:
go test ./pkg/controller/registry/reconciler -run TestGrpcRegistryCleaner -v
=== RUN   TestGrpcRegistryCleaner
=== RUN   TestGrpcRegistryCleaner/Grpc/ExistingRegistry/DeletedPod
time="2024-08-19T16:42:20-04:00" level=info msg="force deleting dead pod" pod.name= pod.namespace=testns
--- PASS: TestGrpcRegistryCleaner (0.10s)
    --- PASS: TestGrpcRegistryCleaner/Grpc/ExistingRegistry/DeletedPod (0.10s)
PASS
ok  	github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/reconciler	0.939s

force deleting dead pod is logged.

  1. A level above, in the operator package (that uses the reconciler implementations), a unit test is put in place to check that reconciler.CleanRegistryServer is called.

This is the absolute best we can do. e2e isn't possible without much more digging in, and frankly at this point not necessary at all.

@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch from 26a24c7 to be108df Compare August 21, 2024 18:32
@anik120
Copy link
Contributor Author

anik120 commented Aug 21, 2024

Pivoted the PR to include the logic inside CheckRegistryServer, instead of adding a CleanRegistryServer component to the interface.

continue
}
healthy = false
forceDeletionErrs = append(forceDeletionErrs, pkgerrors.Errorf("found %s in a deleted but not removed state", pod.Name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return an error from this function when we find a wedged pod and successfully delete it? My initial thought is that we would only error from this function if:

  • we failed to determine healthy
  • we failed to delete wedged pods.

Copy link
Contributor Author

@anik120 anik120 Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out thinking the same, but then realized that that'd be an "artificial" error. If we include both of the scenarios you mentioned in the definition of "error", we're sort of sullying the definition which is "something went wrong". In the case of "we failed to determine healthy", we are already sending that signal through the boolean variable anyway. So decided to not include both, and only include the second scenario of "we failed to delete the wedged pods".

@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch from be108df to beeedd5 Compare August 21, 2024 19:53
logger.WithFields(logrus.Fields{"pod.namespace": sourceNamespace, "pod.name": pod.GetName()}).Debug("pod is alive")
continue
}
foundDeadPod = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trickier than I originally thought. If we have:

  1. at least one alive pod
  2. the alive pods are otherwise deemed healthy
  3. we successfully delete the dead pods

What should we return from CheckRegistryServer? Seems like we could say healthy in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this is a bit confusing, I'm thinking if we detect even one dead pod, we just force delete all the pods, and let EnsureRegistryServer recreate everything. Otherwise we're in a very non-deterministic state...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like that could have significant performance implications. There's a lot of disk/cpu/memory costs that are paid when catalog pods are started, so we should minimized that as much as possible.

I think it is okay to force delete the dead pods, and if there are any alive pods, we just move forward as if those were the only pods that were there to begin with.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair too. I was a bit confused about there being multiple pods in the first place, since my impression is that we only create one registry pod per catalog, but turns out we have pods mainly because we use the lister to list via label selector. In any case this should return one pod, unless I'm unaware of some feature that allows us to spin up multiple registry pods per catalog.

Changed it back to just deleting dead pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are definitely multiple pods when we're polling. In that case, we spin up a new pod to see what the digest is, and we compare the digests to see if we need to use the new pod (if it has a new digest) or keep using the old pod (when the digests match)

There may also be multiple pods when the pod spec changes (e.g. when the catalog source image is changed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, both are transitory states though, and we're in that state because EnsureRegistryServer has already been called. But yes I think this is in a good state now.

@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch 3 times, most recently from a9b8cd5 to 20a1716 Compare August 22, 2024 13:34
@tmshort
Copy link
Contributor

tmshort commented Aug 27, 2024

@joelanford are you good with this?

service == nil || c.currentServiceAccount(source) == nil {
return false, nil
}

if deadPodsDetected, e := detectAndDeleteDeadPods(logger, c.OpClient, currentPods, source.GetNamespace()); deadPodsDetected {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should return true, nil from this function in the case where:

  • There is at least one healthy pod
  • There are multiple "dead" pods
  • We deleted all of the "dead" pods successfully.

That sort of behavior would mean that we would short circuit and avoid calling EnsureRegistryServer, which would:

Copy link
Member

@joelanford joelanford Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, I think both of the following should result in identical behavior:

  1. There are 1 or more alive pods, and no dead pods
  2. There are 1 or more alive pods, and we deleted all of the dead pods (hence: there are no dead pods, so this is actually a variant of (1))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford thanks for the discussion yesterday. Thought about it a little more and we can do this. I've had to add an additional change though, highlighting it in the next comment...

@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch from 20a1716 to c8162e7 Compare August 29, 2024 14:03
@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch 2 times, most recently from 80cd5ae to 507e0de Compare August 29, 2024 14:35
[PR 3201](operator-framework#3201) attempted to solve for the issue by deleting the pods stuck in
`Terminating` due to unreachable node. However, the logic to do that was
included in `EnsureRegistryServer`, which only gets executed if polling in
requested by the user.

This PR moves the logic of checking for dead pods out of `EnsureRegistryServer`,
and puts it in `CheckRegistryServer` instead. This way, if there are any dead pods
detected during `CheckRegistryServer`, the value of `healthy` is returned `false`,
which inturn triggers `EnsureRegistryServer`.
@anik120 anik120 force-pushed the fix-catalog-pods-node-failure branch from 507e0de to 406fede Compare August 29, 2024 15:14
@anik120 anik120 added this pull request to the merge queue Aug 30, 2024
Merged via the queue into operator-framework:master with commit f243189 Aug 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants