Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KUBESAW-187: Adjust ksctl adm restart command to use rollout-restart #79

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

fbm3307
Copy link
Contributor

@fbm3307 fbm3307 commented Sep 12, 2024

This is to adjust. the restart command logic
it does the following

  1. If the command is run for host operator, it restart the whole host operator.(it deletes olm based pods(host-operator pods),
    waits for the new deployment to come up, then uses rollout-restart command for non-olm based - registration-service)
  2. If the command is run for member operator, it restart the whole member operator.(it deletes olm based pods(member-operator pods),
    waits for the new deployment to come up, then uses rollout-restart command for non-olm based deployments - webhooks)

Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a few print info lines for better UX?

pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice Job 👍

I left few minor comments. Also haven't got trough the test code since it looks still WIP.

pkg/cmd/adm/restart.go Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Copy link
Member

@filariow filariow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, just some minor feedback

pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
Copy link
Contributor

@mfrancisc mfrancisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall 👍

I have few aesthetic comments regarding the code and few questions/suggestions for the tests.

pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
return err
}

ctx.Printlnf("The deployment was scaled back to '%d'", originalReplicas)
if len(olmDeploymentList.Items) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively you could simply print both number of deployments found at the beginning:

fmt.Printf("OLM based deployments found in %s namespace: %d", ns, len(olmDeploymentList.Items))
fmt.Printf("NON-OLM based deployments found in %s namespace: %d", ns, len(nonOlmDeploymentlist.Items))

and avoid the if statements at line 93 and 104.

Copy link
Contributor Author

@fbm3307 fbm3307 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a better user experience should we let users know that there was 0 deployment hence we did not go ahead with restart? and not go through the rest of the code ?
this was the reason i wrote if else

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think locally it doesn't change much, having an else and printing that there were no deployments or printing the number of deployments found at the beginning and going throw the rest of the code which will just skip the for loops since there will be no items to iterate on. But I'm fine with whatever looks more readable.

pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
fakeClient.MockUpdate = requireDeploymentBeingUpdated(t, fakeClient, namespacedName, 3, &numberOfUpdateCalls)
fakeClient.MockGet = func(ctx context.Context, key runtimeclient.ObjectKey, obj runtimeclient.Object, opts ...runtimeclient.GetOption) error {
return fmt.Errorf("some error")
tests := map[string]struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too tricky , I think we could add some tests/scenarios also for:

  • there's some other noise pods in the namespace that should not be restarted
  • the pod/deployment doesn't restart correctly ( both olm and non-olm )

deployments := &appsv1.DeploymentList{}
if err := cl.List(context.TODO(), deployments, runtimeclient.InNamespace(ns)); err != nil {
return err
func checkRolloutStatus(f cmdutil.Factory, ioStreams genericclioptions.IOStreams, labelSelector string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fine for checking the latest rollout status, but it doesn't ensure that we actually have new pods right?
Should we introduce a check and verify that there are newly created pods after the restart ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be useful with the OLM based deployments since there we directly delete the pods thus we want to make sure new ones are created.

pkg/cmd/adm/restart_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@MatousJobanek MatousJobanek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to update the permissions in the corresponding Roles:

- kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: restart-deployment
labels:
provider: ksctl
rules:
- apiGroups:
- apps
resources:
- deployments
verbs:
- "get"
- "list"
- "patch"
- "update"

- kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: restart-deployment
labels:
provider: ksctl
rules:
- apiGroups:
- apps
resources:
- deployments
verbs:
- "get"
- "list"
- "patch"
- "update"

nonOlmDeployments := &appsv1.DeploymentList{}
if err := cl.List(ctx, nonOlmDeployments,
runtimeclient.InNamespace(ns),
runtimeclient.MatchingLabels{"provider": "codeready-toolchain"}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at our code (and the deployments) in member operator, we use the fully-qualified name of the provider label there.

toolchain.dev.openshift.com/provider: codeready-toolchain

But I see that there is a mismatch in the labels - reg-service uses the short version. Could you please fix it there?

Comment on lines 193 to 195
if err := cl.List(ctx, nonOlmDeployments,
runtimeclient.InNamespace(ns),
runtimeclient.MatchingLabels{"provider": "codeready-toolchain"}); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure that we don't restart the autoscaler deployment in member cluster - this is not really needed.

pkg/cmd/adm/restart.go Outdated Show resolved Hide resolved
Comment on lines 96 to 115
for _, olmDeployment := range olmDeploymentList.Items {
ctx.Printlnf("Proceeding to delete the Pods of %v", olmDeployment)

if err := deleteAndWaitForPods(ctx, cl, olmDeployment, f, ioStreams); err != nil {
return err
}
}
}
if len(nonOlmDeploymentlist.Items) != 0 {
for _, nonOlmDeployment := range nonOlmDeploymentlist.Items {

ctx.Printlnf("Proceeding to restart the non-OLM deployment %v", nonOlmDeployment)

if err := restartNonOlmDeployments(ctx, nonOlmDeployment, f, ioStreams); err != nil {
return err
}
//check the rollout status
ctx.Printlnf("Checking the status of the rolled out deployment %v", nonOlmDeployment)
if err := checkRolloutStatus(ctx, f, ioStreams, "provider=codeready-toolchain"); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to simplify this. You can group some logic together and make it more generic. As I mentioned in the other comment, when we are getting the the list of deployments, we don't have to care if they are OLM or non-OLM based - we just:

  • list them based on the labels
  • check if there are the operator deployments (if not then fail)
  • filter out the autoscaler one

then you can iterate over each of the deployments:

  1. check if the deployment has some owner-reference
    i. if it has then delete the pods
    ii. if not then use the rollout restart logic
  2. wait for the rollout status

with this approach, you can have the logic simpler, with only one for-loop, only one if-else statement, and call the checkRolloutStatus function only from one place

Comment on lines 142 to 144
ctx.Printlnf("Checking the status of the deleted pod's deployment %v", deployment)
//check the rollout status
if err := checkRolloutStatus(ctx, f, ioStreams, "kubesaw-control-plane=kubesaw-controller-manager"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the pods that the code is killing belong to the single deployment, right? you don't need to call checkRolloutStatus for every each one of them.

@@ -62,5 +62,5 @@ func UnregisterMemberCluster(ctx *clicontext.CommandContext, clusterName string)
}
ctx.Printlnf("\nThe deletion of the Toolchain member cluster from the Host cluster has been triggered")

return restartHostOperator(ctx, hostClusterClient, hostClusterConfig.OperatorNamespace)
return restart(ctx, clusterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the clusterName represents the member that is being unregistered, we want to restart host only

Comment on lines -18 to 21
deployment.Labels = map[string]string{"olm.owner.namespace": "toolchain-host-operator"}
deployment.Labels = map[string]string{"kubesaw-control-plane": "kubesaw-controller-manager"}

newClient, fakeClient := NewFakeClients(t, toolchainCluster, deployment)
numberOfUpdateCalls := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that this still works without any additional changes - most likely because you restart the member operator (which doesn't contain any deployments).

Comment on lines +200 to 202
type RolloutRestartRESTClient struct {
*fake.RESTClient
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this wrapper?

Comment on lines 29 to 30
func TestRestartDeployment(t *testing.T) {
// given
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but I haven't found anywhere in the tests that the pods were really deleted and the deployment was modified. Could you add some interceptor for this?

Comment on lines +89 to +111
fw := watch.NewFake()
dep := &appsv1.Deployment{}
dep.Name = deployment1.Name
dep.Status = appsv1.DeploymentStatus{
Replicas: 1,
UpdatedReplicas: 1,
ReadyReplicas: 1,
AvailableReplicas: 1,
UnavailableReplicas: 0,
Conditions: []appsv1.DeploymentCondition{{
Type: appsv1.DeploymentAvailable,
}},
}
dep.Labels = make(map[string]string)
dep.Labels[tc.labelKey] = tc.labelValue
c, err := runtime.DefaultUnstructuredConverter.ToUnstructured(dep.DeepCopyObject())
if err != nil {
t.Errorf("unexpected err %s", err)
}
u := &unstructured.Unstructured{}
u.SetUnstructuredContent(c)
go fw.Add(u)
return true, fw, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's record for which resources (and how many times) this was called

Copy link

codecov bot commented Oct 1, 2024

Codecov Report

Attention: Patch coverage is 62.62626% with 37 lines in your changes missing coverage. Please review.

Project coverage is 69.37%. Comparing base (bd2bf12) to head (f5c19de).

Files with missing lines Patch % Lines
pkg/cmd/adm/restart.go 62.24% 23 Missing and 14 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #79      +/-   ##
==========================================
- Coverage   69.66%   69.37%   -0.29%     
==========================================
  Files          43       43              
  Lines        2571     2596      +25     
==========================================
+ Hits         1791     1801      +10     
- Misses        589      596       +7     
- Partials      191      199       +8     
Files with missing lines Coverage Δ
pkg/cmd/adm/unregister_member.go 51.42% <100.00%> (ø)
pkg/cmd/adm/restart.go 55.17% <62.24%> (-4.17%) ⬇️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants