-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KUBESAW-187: Adjust ksctl adm restart command to use rollout-restart #79
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
…into kubesaw170_restart
Signed-off-by: Feny Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a few print info lines for better UX?
…into kubesaw170_restart
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Job 👍
I left few minor comments. Also haven't got trough the test code since it looks still WIP.
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, just some minor feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall 👍
I have few aesthetic comments regarding the code and few questions/suggestions for the tests.
pkg/cmd/adm/restart.go
Outdated
return err | ||
} | ||
|
||
ctx.Printlnf("The deployment was scaled back to '%d'", originalReplicas) | ||
if len(olmDeploymentList.Items) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively you could simply print both number of deployments found at the beginning:
fmt.Printf("OLM based deployments found in %s namespace: %d", ns, len(olmDeploymentList.Items))
fmt.Printf("NON-OLM based deployments found in %s namespace: %d", ns, len(nonOlmDeploymentlist.Items))
and avoid the if statements at line 93 and 104.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a better user experience should we let users know that there was 0 deployment hence we did not go ahead with restart? and not go through the rest of the code ?
this was the reason i wrote if else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think locally it doesn't change much, having an else and printing that there were no deployments or printing the number of deployments found at the beginning and going throw the rest of the code which will just skip the for loops since there will be no items to iterate on. But I'm fine with whatever looks more readable.
fakeClient.MockUpdate = requireDeploymentBeingUpdated(t, fakeClient, namespacedName, 3, &numberOfUpdateCalls) | ||
fakeClient.MockGet = func(ctx context.Context, key runtimeclient.ObjectKey, obj runtimeclient.Object, opts ...runtimeclient.GetOption) error { | ||
return fmt.Errorf("some error") | ||
tests := map[string]struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too tricky , I think we could add some tests/scenarios also for:
- there's some other noise pods in the namespace that should not be restarted
- the pod/deployment doesn't restart correctly ( both olm and non-olm )
pkg/cmd/adm/restart.go
Outdated
deployments := &appsv1.DeploymentList{} | ||
if err := cl.List(context.TODO(), deployments, runtimeclient.InNamespace(ns)); err != nil { | ||
return err | ||
func checkRolloutStatus(f cmdutil.Factory, ioStreams genericclioptions.IOStreams, labelSelector string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine for checking the latest rollout status, but it doesn't ensure that we actually have new pods right?
Should we introduce a check and verify that there are newly created pods after the restart ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be useful with the OLM based deployments since there we directly delete the pods thus we want to make sure new ones are created.
Signed-off-by: Feny Mehta <[email protected]>
Signed-off-by: Feny Mehta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to update the permissions in the corresponding Roles:
ksctl/resources/roles/host.yaml
Lines 7 to 22 in 1b90538
- kind: Role | |
apiVersion: rbac.authorization.k8s.io/v1 | |
metadata: | |
name: restart-deployment | |
labels: | |
provider: ksctl | |
rules: | |
- apiGroups: | |
- apps | |
resources: | |
- deployments | |
verbs: | |
- "get" | |
- "list" | |
- "patch" | |
- "update" |
ksctl/resources/roles/member.yaml
Lines 7 to 22 in 1b90538
- kind: Role | |
apiVersion: rbac.authorization.k8s.io/v1 | |
metadata: | |
name: restart-deployment | |
labels: | |
provider: ksctl | |
rules: | |
- apiGroups: | |
- apps | |
resources: | |
- deployments | |
verbs: | |
- "get" | |
- "list" | |
- "patch" | |
- "update" |
pkg/cmd/adm/restart.go
Outdated
nonOlmDeployments := &appsv1.DeploymentList{} | ||
if err := cl.List(ctx, nonOlmDeployments, | ||
runtimeclient.InNamespace(ns), | ||
runtimeclient.MatchingLabels{"provider": "codeready-toolchain"}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at our code (and the deployments) in member operator, we use the fully-qualified name of the provider label there.
toolchain.dev.openshift.com/provider: codeready-toolchain
But I see that there is a mismatch in the labels - reg-service uses the short version. Could you please fix it there?
pkg/cmd/adm/restart.go
Outdated
if err := cl.List(ctx, nonOlmDeployments, | ||
runtimeclient.InNamespace(ns), | ||
runtimeclient.MatchingLabels{"provider": "codeready-toolchain"}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure that we don't restart the autoscaler deployment in member cluster - this is not really needed.
pkg/cmd/adm/restart.go
Outdated
for _, olmDeployment := range olmDeploymentList.Items { | ||
ctx.Printlnf("Proceeding to delete the Pods of %v", olmDeployment) | ||
|
||
if err := deleteAndWaitForPods(ctx, cl, olmDeployment, f, ioStreams); err != nil { | ||
return err | ||
} | ||
} | ||
} | ||
if len(nonOlmDeploymentlist.Items) != 0 { | ||
for _, nonOlmDeployment := range nonOlmDeploymentlist.Items { | ||
|
||
ctx.Printlnf("Proceeding to restart the non-OLM deployment %v", nonOlmDeployment) | ||
|
||
if err := restartNonOlmDeployments(ctx, nonOlmDeployment, f, ioStreams); err != nil { | ||
return err | ||
} | ||
//check the rollout status | ||
ctx.Printlnf("Checking the status of the rolled out deployment %v", nonOlmDeployment) | ||
if err := checkRolloutStatus(ctx, f, ioStreams, "provider=codeready-toolchain"); err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to simplify this. You can group some logic together and make it more generic. As I mentioned in the other comment, when we are getting the the list of deployments, we don't have to care if they are OLM or non-OLM based - we just:
- list them based on the labels
- check if there are the operator deployments (if not then fail)
- filter out the autoscaler one
then you can iterate over each of the deployments:
- check if the deployment has some owner-reference
i. if it has then delete the pods
ii. if not then use the rollout restart logic - wait for the rollout status
with this approach, you can have the logic simpler, with only one for-loop, only one if-else statement, and call the checkRolloutStatus
function only from one place
pkg/cmd/adm/restart.go
Outdated
ctx.Printlnf("Checking the status of the deleted pod's deployment %v", deployment) | ||
//check the rollout status | ||
if err := checkRolloutStatus(ctx, f, ioStreams, "kubesaw-control-plane=kubesaw-controller-manager"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pods that the code is killing belong to the single deployment, right? you don't need to call checkRolloutStatus
for every each one of them.
pkg/cmd/adm/unregister_member.go
Outdated
@@ -62,5 +62,5 @@ func UnregisterMemberCluster(ctx *clicontext.CommandContext, clusterName string) | |||
} | |||
ctx.Printlnf("\nThe deletion of the Toolchain member cluster from the Host cluster has been triggered") | |||
|
|||
return restartHostOperator(ctx, hostClusterClient, hostClusterConfig.OperatorNamespace) | |||
return restart(ctx, clusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the clusterName represents the member that is being unregistered, we want to restart host
only
deployment.Labels = map[string]string{"olm.owner.namespace": "toolchain-host-operator"} | ||
deployment.Labels = map[string]string{"kubesaw-control-plane": "kubesaw-controller-manager"} | ||
|
||
newClient, fakeClient := NewFakeClients(t, toolchainCluster, deployment) | ||
numberOfUpdateCalls := 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that this still works without any additional changes - most likely because you restart the member operator (which doesn't contain any deployments).
type RolloutRestartRESTClient struct { | ||
*fake.RESTClient | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this wrapper?
func TestRestartDeployment(t *testing.T) { | ||
// given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but I haven't found anywhere in the tests that the pods were really deleted and the deployment was modified. Could you add some interceptor for this?
fw := watch.NewFake() | ||
dep := &appsv1.Deployment{} | ||
dep.Name = deployment1.Name | ||
dep.Status = appsv1.DeploymentStatus{ | ||
Replicas: 1, | ||
UpdatedReplicas: 1, | ||
ReadyReplicas: 1, | ||
AvailableReplicas: 1, | ||
UnavailableReplicas: 0, | ||
Conditions: []appsv1.DeploymentCondition{{ | ||
Type: appsv1.DeploymentAvailable, | ||
}}, | ||
} | ||
dep.Labels = make(map[string]string) | ||
dep.Labels[tc.labelKey] = tc.labelValue | ||
c, err := runtime.DefaultUnstructuredConverter.ToUnstructured(dep.DeepCopyObject()) | ||
if err != nil { | ||
t.Errorf("unexpected err %s", err) | ||
} | ||
u := &unstructured.Unstructured{} | ||
u.SetUnstructuredContent(c) | ||
go fw.Add(u) | ||
return true, fw, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's record for which resources (and how many times) this was called
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #79 +/- ##
==========================================
- Coverage 69.66% 69.37% -0.29%
==========================================
Files 43 43
Lines 2571 2596 +25
==========================================
+ Hits 1791 1801 +10
- Misses 589 596 +7
- Partials 191 199 +8
|
This is to adjust. the restart command logic
it does the following
waits for the new deployment to come up, then uses rollout-restart command for non-olm based - registration-service)
waits for the new deployment to come up, then uses rollout-restart command for non-olm based deployments - webhooks)