Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OSPRH-10386] Update logic for workflows #231

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lpiwowar
Copy link
Collaborator

@lpiwowar lpiwowar commented Oct 18, 2024

There was a race condition in the previous implementation of the
workflows that was caused by the dependency of the workflow feature
on a ConfigMap that served as counter.

This commit simplifies the logic of the Reconcile loop in two ways:

  1. Removes the dependency for an external workflow counter

  2. Introduces NextAction function which decides what next action
    should be performed by the Reconcile loop based on the current
    state of the OpenShift cluster. The input of this function is the
    instance which is currently being processed and a workflowLength.
    Using these two arguments it then tells the Reconcile loop which
    actions it should perform: Wait, CreateFirstJob, CreateNextJob,
    EndTesting, Failure.

This approach should simplify the reconcile logic and move the
test-operator towards a unified Reconcile loop.

Copy link

openshift-ci bot commented Oct 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lpiwowar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/b71be3f0076a40b68f502cf0246bdd43

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 28m 25s
podified-multinode-edpm-deployment-crc-test-operator TIMED_OUT in 3h 11m 49s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5ca1eddc2fe14e05b305a4a9c30f1d7f

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 47m 44s
podified-multinode-edpm-deployment-crc-test-operator TIMED_OUT in 3h 13m 05s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/7b19afcc3de6489aa1848fd9c82f4347

openstack-k8s-operators-content-provider FAILURE in 6m 30s
⚠️ podified-multinode-edpm-deployment-crc-test-operator SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@lpiwowar lpiwowar changed the title Bugfix - incorrect workflow order Update logic for workflows Nov 1, 2024
@lpiwowar lpiwowar changed the title Update logic for workflows [OSPRH-10386] Update logic for workflows Nov 1, 2024
@lpiwowar lpiwowar force-pushed the bugfix/OSPRH-10386 branch 2 times, most recently from 379e406 to 70f9c29 Compare November 1, 2024 15:03
There was a race condition in the previous implementation of the
workflows that was caused by the dependency of the workflow feature
on a ConfigMap that served as counter.

This commit simplifies the logic of the Reconcile loop in two ways:

  1. Removes the dependency for an external workflow counter

  2. Introduces NextAction function which decides what next action
     should be performed by the Reconcile loop based on the current
     state of the OpenShift cluster. The input of this function is the
     instance which is currently being processed and a workflowLength.
     Using these two arguments it then tells the Reconcile loop which
     actions it should perform: Wait, CreateFirstJob, CreateNextJob,
     EndTesting, Failure.

This approach should simplify the reconcile logic and move the
test-operator towards a unified Reconcile loop.
@lpiwowar lpiwowar marked this pull request as ready for review November 13, 2024 12:18
@lpiwowar lpiwowar requested review from evallesp and removed request for kopecmartin and olliewalsh November 13, 2024 12:20
Copy link

openshift-ci bot commented Nov 13, 2024

@lpiwowar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/precommit-check eef85ec link true /test precommit-check

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

if lockReleased, err := r.ReleaseLock(ctx, instance); !lockReleased {
return ctrl.Result{}, err
workflowLength := 0
nextAction, nextWorkflowStep, err := r.NextAction(ctx, instance, workflowLength)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

workflowLenght will always be 0 here, is that expected?

@evallesp
Copy link

In general really good patch IMHO.
It's good to have a single pattern across the different controllers, and I see the point (Apart from fixing a race condition) as a good work of refactoring.

Some extra things:

  • Comments in the switch clauses at ansibletest_controller are also applicable to other controllers.
  • It seems there's room for some refactoring, as the code between controllers' logic of switch clauses is quite similar.

I'd like to do a second round when my comments are addressed, but it's IMHO in good shape and near to being merged. Thanks!

case EndTesting:
if lockReleased, _ := r.ReleaseLock(ctx, instance); !lockReleased {
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName))
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) Question: If we get this point it's because there was an error releasing the lock and/or there's no lock, right?
Are we loosing here some context by returning nil as error info?

runningJobName := r.GetJobName(instance, externalWorkflowCounter-1)
err = r.Client.Get(ctx, client.ObjectKey{Namespace: instance.GetNamespace(), Name: runningJobName}, runningAnsibleJob)
if err != nil && !k8s_errors.IsNotFound(err) {
switch nextAction {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Good refactor and enhancement, much more clear this way!

Log.Info(InfoWaitingOnJob)
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil

case EndTesting:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking) suggestion: I'd maintain the comment section about releasing the lock to let other instances spawn a job. I see the point of detilling that EndTesting might not be related to any error, and it's just the job was completed.

)

const (
// RequeueAfterValue tells how much time should we wait before calling Reconcile

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: I like the new placement for the const.

lockAcquired, _ := r.AcquireLock(ctx, instance, helper, false)
if !lockAcquired {
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName))
return ctrl.Result{RequeueAfter: RequeueAfterValue}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) suggestion: I'd maintain here the error information from Acquirelock.


// GetLastJob returns job associated with an instance which has the highest value
// stored in the workflowStep label
func (r *Reconciler) GetLastJob(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: this type of function easily gets a bit hard to review. I like this one.

@@ -276,7 +415,7 @@ func (r *Reconciler) EnsureLogsPVCExists(
},
}

timeDuration, _ := time.ParseDuration("2m")
timeDuration, err := time.ParseDuration("2m")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) question: I'm unsure about this new error. I'd say we would need to check that before pvc.NewPvc(testOperatorPvcDef, timeDuration)

It seems that we're not going to avoid any runtimeError by catching the error if we hold on after line ctrlResult, err := testOperatorPvc.CreateOrPatch(ctx, helper) to do something.

@@ -368,6 +508,10 @@ func (r *Reconciler) AcquireLock(
return err == nil, err
}

if cm.Data[testOperatorLockOnwerField] == instanceGUID {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: IIUC, we're adding a new behavior of returning True if there's a lock and the lock owner is the same as the one executing AcquireLock.
So basically we're adding idempotency to the method.

I guess that here was the error right?

"instanceName": instance.Name,
"operator": "test-operator",
workflowStepLabel: strconv.Itoa(nextWorkflowStep),
instanceNameLabel: instance.Name,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) question: Is not necessary "operator" here?

lockAcquired, _ := r.AcquireLock(ctx, instance, helper, instance.Spec.Parallel)
if !lockAcquired {
Log.Info(fmt.Sprintf(InfoCanNotAcquireLock, testOperatorLockName))
return ctrl.Result{}, nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(blocking) question: Would be good to return RequeueAfterValue here? If not, could you share some info? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants