Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix occasional e2e failure #1614

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jwcesign
Copy link
Contributor

Fixes #N/A

Description

When I run make e2etests, one failure is: https://github.com/kubernetes-sigs/karpenter/actions/runs/10590933040/job/29347492851

The context is as follows:
nodeclaim:
image

corresponding node:
image

So this PR tries to re-enqueue the item.

How was this change tested?

It always has failure if you try about 8 times, and I tried 20+ times, the following failure disappear

 [FAIL] Performance Provisioning [It] should do complex provisioning and complex drift
  /home/runner/work/karpenter/karpenter/test/suites/perf/scheduling_test.go:134

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jwcesign
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @jwcesign. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 30, 2024
@jwcesign
Copy link
Contributor Author

By the way, should we re-enqueue the item if an unexpected error happens, like a broken Internet connection?

For example, the following code:

return reconcile.Result{}, fmt.Errorf("getting node for nodeclaim, %w", err)

I prefer to re-enqueue them, make the logic robust

@coveralls
Copy link

Pull Request Test Coverage Report for Build 10630261456

Details

  • 1 of 1 (100.0%) changed or added relevant line in 1 file are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 80.753%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/consolidation.go 4 87.25%
Totals Coverage Status
Change from base Build 10622339666: 0.0%
Covered Lines: 8391
Relevant Lines: 10391

💛 - Coveralls

@@ -50,7 +50,7 @@ func (r *Registration) Reconcile(ctx context.Context, nodeClaim *v1.NodeClaim) (
if err != nil {
if nodeclaimutil.IsNodeNotFoundError(err) {
nodeClaim.StatusConditions().SetUnknownWithReason(v1.ConditionTypeRegistered, "NodeNotFound", "Node not registered with cluster")
return reconcile.Result{}, nil
return reconcile.Result{Requeue: true}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be caught with our re-queue on nodes here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/nodeclaim/lifecycle/controller.go#L136-L138

If we don't have a node when we first look at this, the node should be created, and re-trigger this reconciliation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. Let me check on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants