-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OSPRH-10386] Update logic for workflows #231
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lpiwowar The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b71be3f0076a40b68f502cf0246bdd43 ✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 28m 25s |
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5ca1eddc2fe14e05b305a4a9c30f1d7f ✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 47m 44s |
5c9e5fb
to
7f6cddb
Compare
Build failed (check pipeline). Post https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7b19afcc3de6489aa1848fd9c82f4347 ❌ openstack-k8s-operators-content-provider FAILURE in 6m 30s |
49f0827
to
d03ca9c
Compare
379e406
to
70f9c29
Compare
There was a race condition in the previous implementation of the workflows that was caused by the dependency of the workflow feature on a ConfigMap that served as counter. This commit simplifies the logic of the Reconcile loop in two ways: 1. Removes the dependency for an external workflow counter 2. Introduces NextAction function which decides what next action should be performed by the Reconcile loop based on the current state of the OpenShift cluster. The input of this function is the instance which is currently being processed and a workflowLength. Using these two arguments it then tells the Reconcile loop which actions it should perform: Wait, CreateFirstJob, CreateNextJob, EndTesting, Failure. This approach should simplify the reconcile logic and move the test-operator towards a unified Reconcile loop.
70f9c29
to
eef85ec
Compare
@lpiwowar: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
if lockReleased, err := r.ReleaseLock(ctx, instance); !lockReleased { | ||
return ctrl.Result{}, err | ||
workflowLength := 0 | ||
nextAction, nextWorkflowStep, err := r.NextAction(ctx, instance, workflowLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
workflowLenght will always be 0 here, is that expected?
In general really good patch IMHO. Some extra things:
I'd like to do a second round when my comments are addressed, but it's IMHO in good shape and near to being merged. Thanks! |
case EndTesting: | ||
if lockReleased, _ := r.ReleaseLock(ctx, instance); !lockReleased { | ||
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName)) | ||
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) Question: If we get this point it's because there was an error releasing the lock and/or there's no lock, right?
Are we loosing here some context by returning nil as error info?
runningJobName := r.GetJobName(instance, externalWorkflowCounter-1) | ||
err = r.Client.Get(ctx, client.ObjectKey{Namespace: instance.GetNamespace(), Name: runningJobName}, runningAnsibleJob) | ||
if err != nil && !k8s_errors.IsNotFound(err) { | ||
switch nextAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Good refactor and enhancement, much more clear this way!
Log.Info(InfoWaitingOnJob) | ||
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil | ||
|
||
case EndTesting: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(non-blocking) suggestion: I'd maintain the comment section about releasing the lock to let other instances spawn a job. I see the point of detilling that EndTesting might not be related to any error, and it's just the job was completed.
) | ||
|
||
const ( | ||
// RequeueAfterValue tells how much time should we wait before calling Reconcile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: I like the new placement for the const.
lockAcquired, _ := r.AcquireLock(ctx, instance, helper, false) | ||
if !lockAcquired { | ||
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName)) | ||
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) suggestion: I'd maintain here the error information from Acquirelock.
|
||
// GetLastJob returns job associated with an instance which has the highest value | ||
// stored in the workflowStep label | ||
func (r *Reconciler) GetLastJob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: this type of function easily gets a bit hard to review. I like this one.
@@ -276,7 +415,7 @@ func (r *Reconciler) EnsureLogsPVCExists( | |||
}, | |||
} | |||
|
|||
timeDuration, _ := time.ParseDuration("2m") | |||
timeDuration, err := time.ParseDuration("2m") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) question: I'm unsure about this new error. I'd say we would need to check that before pvc.NewPvc(testOperatorPvcDef, timeDuration)
It seems that we're not going to avoid any runtimeError by catching the error if we hold on after line ctrlResult, err := testOperatorPvc.CreateOrPatch(ctx, helper) to do something.
@@ -368,6 +508,10 @@ func (r *Reconciler) AcquireLock( | |||
return err == nil, err | |||
} | |||
|
|||
if cm.Data[testOperatorLockOnwerField] == instanceGUID { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: IIUC, we're adding a new behavior of returning True if there's a lock and the lock owner is the same as the one executing AcquireLock.
So basically we're adding idempotency to the method.
I guess that here was the error right?
"instanceName": instance.Name, | ||
"operator": "test-operator", | ||
workflowStepLabel: strconv.Itoa(nextWorkflowStep), | ||
instanceNameLabel: instance.Name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) question: Is not necessary "operator" here?
lockAcquired, _ := r.AcquireLock(ctx, instance, helper, instance.Spec.Parallel) | ||
if !lockAcquired { | ||
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName)) | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(blocking) question: Would be good to return RequeueAfterValue here? If not, could you share some info? Thanks!
There was a race condition in the previous implementation of the
workflows that was caused by the dependency of the workflow feature
on a ConfigMap that served as counter.
This commit simplifies the logic of the Reconcile loop in two ways:
Removes the dependency for an external workflow counter
Introduces NextAction function which decides what next action
should be performed by the Reconcile loop based on the current
state of the OpenShift cluster. The input of this function is the
instance which is currently being processed and a workflowLength.
Using these two arguments it then tells the Reconcile loop which
actions it should perform: Wait, CreateFirstJob, CreateNextJob,
EndTesting, Failure.
This approach should simplify the reconcile logic and move the
test-operator towards a unified Reconcile loop.