Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[helm] Job tolerations are ignored #45903

Open
joeybenamy opened this issue Sep 25, 2024 · 17 comments
Open

[helm] Job tolerations are ignored #45903

joeybenamy opened this issue Sep 25, 2024 · 17 comments
Labels
area/platform issues related to the platform community team/deployments type/bug Something isn't working

Comments

@joeybenamy
Copy link

joeybenamy commented Sep 25, 2024

Helm Chart Version

1.0.0

What step the error happened?

On deploy

Relevant information

On prior versions of the Helm Chart, tolerations set in Helm values are properly propagated to the job pods. In the new version, the tolerations in Helm values are not added to the job pods. As a result, our jobs cannot be scheduled.

In Helm values:

global:
  jobs:
    kube:
      tolerations:
      - key: "usage"
        operator: "Equal"
        value: "airbyte"
        effect: "NoExecute"
      nodeSelector:
        usage: airbyte

From the job pods:

  nodeSelector:
    usage: airbyte
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

Relevant log output

Pod airbyte/source-postgres-check-16279-1-tefxt can't be scheduled on eks-airbyte-uat-20240307203630429500000001-d4c70d50-dd55-3f1f-3a66-11baf39a636f, predicate checking error: node(s) had untolerated taint {usage: airbyte}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {usage: airbyte}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"usage", Value:"airbyte", Effect:"NoExecute", TimeAdded:}}

@joeybenamy joeybenamy added area/platform issues related to the platform needs-triage type/bug Something isn't working labels Sep 25, 2024
@marcosmarxm marcosmarxm changed the title Job tolerations are ignored [helm] Job tolerations are ignored Sep 25, 2024
@marcosmarxm
Copy link
Member

@joeybenamy what was the previous version you're using?

@joeybenamy
Copy link
Author

joeybenamy commented Sep 25, 2024

@joeybenamy what was the previous version you're using?

0.344.2

@marcosmarxm
Copy link
Member

@airbytehq/platform-deployments fyi

@abuchanan-airbyte
Copy link
Contributor

abuchanan-airbyte commented Sep 27, 2024

This may have been fixed by airbytehq/airbyte-platform@57319f7 (oops, wrong link) airbytehq/airbyte-platform@2ed01e5

@alexremn
Copy link

Seems like duplicate of #28389
@abuchanan-airbyte thank you, awaiting for the release!

@joeybenamy
Copy link
Author

Seems like duplicate of #28389 @abuchanan-airbyte thank you, awaiting for the release!

Likewise. Thank you!

@joeybenamy
Copy link
Author

joeybenamy commented Oct 3, 2024

Testing with Helm Chart 1.1.0 and Airbyte platform 1.1.0. Tolerations are still not present on job pods.

@marcosmarxm
Copy link
Member

@abuchanan-airbyte and @tryangul fyi

@joeybenamy
Copy link
Author

@abuchanan-airbyte and @tryangul fyi

Any update on this? Is this a Helm Chart issue or an Airbyte platform issue?

@marcosmarxm
Copy link
Member

This is a work in progress @joeybenamy. Hope to get update EOW.

@talha-naeem1
Copy link

I am also facing an issue with this. Can someone please confirm if it's fixed now?

@AcidFlow
Copy link

A fix has been merged to the default branch as far as I've seen, however this is not available yet as part of a release.

We internally built an image of workload-launcher from v1.1.0 with the fix cherry-picked.
I can see the tolerations being propagated to the pod when using our custom image.

See: #28389 (comment)

@abuchanan-airbyte
Copy link
Contributor

abuchanan-airbyte commented Oct 31, 2024

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

@remisalmon
Copy link
Contributor

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

Anyone tried this release version with setting global.jobs.kube.tolerations on a cluster where all nodes are tainted? Tried on both aws eks and a local kind cluster and cannot not get a "rce-postgres-check-" (new source) job pod to get scheduled on either.

@joeybenamy
Copy link
Author

Just got an update from Airbyte:

We're working on setting up the 1.2.0 release candidate today. Not sure what the official release date is, but it will be soon. In the meantime, nightly releases are available

@remisalmon
Copy link
Contributor

You might try the latest nightly release version 1.1.0-dev-nightly-1730243169-7e1b11aeac (that's a helm chart version)

Anyone tried this release version with setting global.jobs.kube.tolerations on a cluster where all nodes are tainted? Tried on both aws eks and a local kind cluster and cannot not get a "rce-postgres-check-" (new source) job pod to get scheduled on either.

I figured my issue with those jobs tolerations: the helm chart values expect the operator to be set explicitly, which should not be necessary: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ "The default value for operator is Equal."

This is coming from airbytehq/airbyte-platform@2ca3c41#diff-3555dc77946bb010495d4a97b3060553f759452f781e0f5b54b3c8a37394c3b0R227 that was linked in #28389.

@joeybenamy
Copy link
Author

joeybenamy commented Nov 7, 2024

In Airbyte 1.2.0 and Helm chart 1.2.0, this issue appears to be fixed, but now using S3 for logs and state seems to be broken: #48407

So I'm still stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform community team/deployments type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants