Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter v1.0.0 chart does not respect Helm values in packaged CRDs leading to ArgoCD upgrade difficulty #6847

Open
booleanbetrayal opened this issue Aug 22, 2024 · 87 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@booleanbetrayal
Copy link

Description

Observed Behavior:

The setting of any webhook values (such as webhook.serviceName) in the karpenter primary chart is not respected. Instead, the chart embeds static CRDs, which are symlinked in this repository, unlike the karpenter-crd chart, which respects these values.

This makes upgrading from v0.37.0 to v1.0.0 using platforms like ArgoCD difficult and error-prone since ArgoCD is capable of managing CRDs and will likewise detect drift if they are updated manually or via a secondary chart such as karpenter-crd.

In our case, the conversion.webhook.clientConfig.service.name value for the conversion webhook (specified in the static CRDs) deployed with the karpenter chart did not match the name that the chart generates for the Service itself (using the karpenter.fullName helper), and so all webhook calls would fail. Likewise, disabling the webhook could not be done with values and manual edits will show diff changes in ArgoCD.

Attempting to utilize both charts in ArgoCD will attempt to create duplicate resources and could cause unexpected issues. There is no way to disable the deployment of CRDs in the karpenter primary chart.

A possible workaround for our specific issue would be to set the nameOverride value to karpenter such that it matches the static serviceName buried in the CRD, but ideally we don't even need these conversion webhooks and would prefer to have them disabled entirely.

Expected Behavior:

The karpenter Helm package respects Helm values in its CRDs, just as the karpenter-crd Chart does. If static CRDs are required they can be referenced elsewhere from the GitHub repository, etc.

Alternatively (and less elegantly), a dual-chart solution with a flag supporting the disabling of CRDs in the primary chart (or just not packaging them to begin with) could also be a possible fix vector, as it would allow the other chart to install without duplicates.

Reproduction Steps (Please include YAML):

Attempt to upgrade the karpenter chart from v0.37.0 to v1.0.0 using ArgoCD. Specify override Helm values for webhook properties and observe no diff is presented in the CRD definitions reflecting those overrides.

Versions:

  • Chart Version: 1.0.0
  • Kubernetes Version (kubectl version): 1.30.2
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@booleanbetrayal booleanbetrayal added bug Something isn't working needs-triage Issues that need to be triaged labels Aug 22, 2024
@engedaam
Copy link
Contributor

As part of ArgoCD, you should be able to disable CRD installation, no? https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-skip-crds. Also, It sounds like you updated the fullnameOverride? Does that align with what is passed as the webhook.serviceName in karpenter-crd charts?

@booleanbetrayal
Copy link
Author

Hi @engedaam and thanks for the quick feedback. That skip-crds setting applies to an ArgoCD Application, but that might deploy multiple charts / sub-charts. In our case, we have other Helm packages being deployed in this namespace / ArgoCD Application (kube-system) which are modeled in this fashion. We wouldn't want them to all carry this setting, as this is the only problematic deployment.

Other popular charts often include some flag to toggle CRD deployment for more granular fidelity. For example, ESO uses a installCRDs flag for this feature.

We should be able to update the fullnameOverride for the karpenter chart to match the "karpenter" value for conversion.webhook.clientConfig.service.name that its bundled (static) CRDs set, and hopefully everything just works at that point since the Service would be named appropriately. However, that requires rebuilding all our existing Karpenter resources to support the new naming, along with IAM trust policies, etc.

At this point, I'm curious how people running Karpenter within an ArgoCD Application are performing the v1.0.0 upgrade.

@booleanbetrayal
Copy link
Author

At this point, I'm curious how people running Karpenter within an ArgoCD Application are performing the v1.0.0 upgrade.

Just to provide more context, Helm won't even overwrite resources that are managed-by ArgoCD, and I'm not sure how to spoof Helm's .Release.Service stamping. As a result, AFAIK, the official v1 Migration Guide can't be followed for ArgoCD Karpenter deployments and the CRD upgrade step. Even if you could spoof it so Helm CLI did deploy those karpenter-crd CRDs, the Karpenter CRDs would immediately show up as drifting out of sync with the karpenter Chart source in ArgoCD Application display, as they would differ from the bundled variants.

@Vinaum8
Copy link

Vinaum8 commented Aug 23, 2024

+1

@Vinaum8
Copy link

Vinaum8 commented Aug 23, 2024

The expected behavior according to this issue would be perfect, as we would have the option of enabling or not the installation of CRDs with the karpenter helm chart.

@artem-nefedov
Copy link
Contributor

artem-nefedov commented Aug 23, 2024

Given 75d482f is already in main, I assume official guidelines from now on will be "disable CRDs in the main chart and always use karpenter-crd if you require any sort of CRD customization".

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Aug 23, 2024

@artem-nefedov - How does one disable CRDs in the main chart? main is still currently bundling all the static CRDs as it was before, without any mechanism to prevent deployment in ArgoCD, as far as I can tell.

EDIT: I've verified that the 1.0.1 package release is continuing to distribute the static CRDs.

@artem-nefedov
Copy link
Contributor

@artem-nefedov - How does one disable CRDs in the main chart? main is still currently bundling all the static CRDs as it was before, without any mechanism to prevent deployment in ArgoCD, as far as I can tell.

  1. With helm CLI, you would use --skip-crds option.
  2. With Flux2 GitOps, you would set spec.install.crds: Skip and spec.upgrade.crds: Skip in HelmRelease.
  3. I don't use ArgoCD, but quick search found that it supports feature to skip CRDs since January 2022: feat: add skipCrds flag for helm charts argoproj/argo-cd#8012

@booleanbetrayal
Copy link
Author

@artem-nefedov - I've posted a response to that workaround elsewhere but the gist is that there may be multiple Helm charts comprising an ArgoCD Application. Skipping CRDs to relieve a problematic deployment from one Chart could cascade into multiple issues with the maintenance of CRDs in other charts.

@artem-nefedov
Copy link
Contributor

@booleanbetrayal Maybe I'm missing something, but so far this looks like solvable problem. Split it into two separate Applications, and optionally add a third top-level Application that installs both if you still want to have a single entrypoint.

@adamwilner
Copy link

this exact thing happened to me. spent several hours with it. skipping CRDs at an application level is not an option. i would need to be able to include both the karpenter and karpenter-crds charts in the same ArgoCD application with only the CRDs from the karpenter chart skipped. or even better, if there was a solution that only required using the karpenter chart where the service naming was consistent

@booleanbetrayal
Copy link
Author

@booleanbetrayal Maybe I'm missing something, but so far this looks like solvable problem. Split it into two separate Applications, and optionally add a third top-level Application that installs both if you still want to have a single entrypoint.

@artem-nefedov - So if my Application is kube-system, deploying to the kube-system Namespace and I am running Karpenter within it, along with multiple other Charts (AWS-LB, ESO, etc) that typically control their CRD deployments with things like Helm flags, I am supposed to re-architect my application to move / rename resources into a separate Applications and Namespaces in all my clusters, juggling Karpenter controller uptime so that Nodes can continue to spool, all in order to support this v1.0.0 upgrade?

You might not have much familiarity with ArgoCD, but I can tell you that this is not a light chore, would be prone to outage chances, is not documented as a migration vector, and would feel especially egregious in light of AWS's recent endorsement of ArgoCD deployment approaches.

@jmdeal
Copy link
Contributor

jmdeal commented Aug 26, 2024

For context, we've went with this design following Helm v3's recommendations for managing CRDs. These recommendations state that, in a main chart, CRDs should be placed in the crds directory. Resources in this directory can not be templated, which is why the packaged CRDs are hardcoded with the package defaults (this only affects the conversion block). When templating is required, the recommendation is to use a separate chart, in this case karpenter-crd.

Other popular charts often include some flag to toggle CRD deployment for more granular fidelity.

The linked examples in this thread appear to be working around Helm's design for managing CRDs. Unless absolutely necessary, we shouldn't go down this path and stick to helm standard practices.

Skipping CRDs at an application level is not an option.

This does not appear to be necessary. skipCrds is scoped to the source, not the application. You should be able to skip CRDs in one source and not another within a single application. I was able to verify with the following application; CRDs were not installed from the main Karpenter chart, but were installed from the aws-controllers-k8s chart:

project: default
destination:
  server: https://kubernetes.default.svc
  namespace: karpenter
sources:
  - repoURL: public.ecr.aws/aws-controllers-k8s
    targetRevision: 1.2.17
    chart: ec2-chart
  - repoURL: public.ecr.aws/karpenter
    targetRevision: 1.0.1
    chart: karpenter-crd
  - repoURL: public.ecr.aws/karpenter
    targetRevision: 1.0.1
    helm:
      parameters:
        - name: settings.clusterName
          value: redacted
        - name: serviceAccount.annotations.eks\.amazonaws\.com/role-arn
          value: redacted
      skipCrds: true
    chart: karpenter

@booleanbetrayal
Copy link
Author

@jmdeal - This is a bit of an annoying workaround for users who already have karpenter being deployed in an ArgoCD application, originating from a private GitOps repo, containing relevant chart binaries and YAML values. For example, here is the core of our Application definition:

project: kube-system
source:
  helm:
    valueFiles:
      - staging-values.yaml
  path: kube-system
  repoURL: [email protected]/some-org/some-private-gitops-repo
  targetRevision: main

In this instance we would either have to expose our Application to third-party supply chain repositories (a deal-breaker for lots of orgs who attempt to mitigate supply-side attacks / reduce third-party dependency failures) or figure out a new scheme of pushing karpenter chart + value definitions to a private repo under a new repo, tag, or path.

Even if that were feasible for most users, they would then be forced to perform a moderately risky Application migration to move a pre-1.0.0 running Karpenter controller + provisioning (NodePool, NodeClass) manifests, into the appropriate new source state. Feels like a lot of users could accidentally induce some sort of Karpenter downtime in such a requirement if they fail to atomically swap resources with identical configuration.

If you all absolutely do not wish to implement a convenience flag for this, are adamant on keeping with Helm 3 recommendations, why not just drop the static CRDs from the primary chart and make the karpenter-crd chart an explicit or implicit dependency? That would also align with the same Helm 3 Recommendations (option 2) which states:

Another way to do this is to put the CRD definition in one chart, and then put any resources that use that CRD in another chart.

In this method, each chart must be installed separately. However, this workflow may be more useful for cluster operators who have admin access to a cluster

Or alternatively, just drop the conversion webhook altogether and clarify upgrade requirements to appropriate provisioning manifests?

@thomasgouveia
Copy link

If you all absolutely do not wish to implement a convenience flag for this, are adamant on keeping with Helm 3 recommendations, why not just drop the static CRDs from the primary chart and make the karpenter-crd chart an explicit or implicit dependency? That would also align with the same Helm 3 Recommendations (option 2)

+1. I also mentioned adding a dependency to the karpenter chart here. Do you have any feedback about it?

@trojaond
Copy link

trojaond commented Aug 27, 2024

@booleanbetrayal Have you considered including karpenter-crd as a separate helm chart in your private repo and reference it from a single ArgoCD application. In this way it does not affect the supply chain.

  project: {project}
  destination:
    namespace: kube-system
    server: 'https://kubernetes.default.svc'
  sources:
  - repoURL: {private_git_repository}
    targetRevision: HEAD
    path: '{path_to_helmcharts}/karpenter-crd/'
  - repoURL: {private_git_repository}
    targetRevision: HEAD
    path: '{path_to_helmcharts}/karpenter/'
    helm:
      skipCrds: true
      values: |-
          ...

This respect the Helm 3 Recommendations (option 2) as there are 2 charts but they are being deployed from a single ArgoCD Application.

Also worth mentioning that multi-source application is still considered a beta feature

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Aug 27, 2024

@trojaond - Currently, our ArgoCD application paths are all namespace-oriented with differing values files based on environment (development, staging, production, etc). This change breaks with those conventions and would require some other scripting changes to accomodate, so even if this ArgoCD multi-source beta feature is viable (and does not "change in backwards incompatible ways", it's not very desirable. It would be much more operator friendly if there were just a single chart with a convenience flag or two separate charts, one with templated CRDs and one with none.

EDIT - Additionally, I think the Option 2 statement ...

Another way to do this is to put the CRD definition in one chart, and then put any resources that use that CRD in another chart.

... implies that the CRDs are omitted from one of the chart, not included in both, and not requiring command-line flags to work around the duplication.

@Vinaum8
Copy link

Vinaum8 commented Aug 27, 2024

Why not put the option to enable or disable CRD in the Helm Chart?

@artem-nefedov
Copy link
Contributor

Why not put the option to enable or disable CRD in the Helm Chart?

Helm doesn't allow any templating for CRDs that are located in "crds" directory, so it's impossible to affect them based on input values. They can only be turned off by --skip-crds option (and similar things in various controllers). To make it possible to turn off CRDs via values, they must be moved from "crds" to "templates", like it's done in karpenter-crd chart, but that could bring other unwanted side effects (there are reasons why "crds" directory exists and is handled differently).

@booleanbetrayal
Copy link
Author

Helm doesn't allow any templating for CRDs that are located in "crds" directory, so it's impossible to affect them based on input values. They can only be turned off by --skip-crds option (and similar things in various controllers). To make it possible to turn off CRDs via values, they must be moved from "crds" to "templates", like it's done in karpenter-crd chart, but that could bring other unwanted side effects (there are reasons why "crds" directory exists and is handled differently).

@artem-nefedov - What are the side effects of templating CRDs? Lots of Helm charts template CRDs, including ArgoCD itself.

@andrewjamesbrown
Copy link
Contributor

@artem-nefedov unless I'm mistaken, helm supports this via helm/helm#7138 (i.e what ArgoCD is doing)

@artem-nefedov
Copy link
Contributor

artem-nefedov commented Aug 27, 2024

@artem-nefedov unless I'm mistaken, helm supports this via helm/helm#7138 (i.e what ArgoCD is doing)

@andrewjamesbrown This is only for outputting CRDs via helm template command. It still doesn't allow any actual templating (changing the content) based on input values. And it requires a separate option.

@artem-nefedov
Copy link
Contributor

@artem-nefedov - What are the side effects of templating CRDs? Lots of Helm charts template CRDs, including ArgoCD itself.

@booleanbetrayal For comprehensive reasoning (as the team behind helm sees it), read this:
https://github.com/helm/community/blob/f9e06c16d89ccea1bea77c01a6a96ae3b309f823/architecture/crds.md

@andrewjamesbrown
Copy link
Contributor

@artem-nefedov my mistake... ArgoCD CRDs are in the templates/crds directory not crds
🤦

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

@bryantbiggs
Created additional specific issues i'm having currently: #6981 #6982 #6983
If we can resolve this, then at least there's a semblance of a path forward. Right now it's just filling up the logs with errors.

@dcherniv
Copy link

dcherniv commented Sep 11, 2024

I finally resolved my issues in prod with stale nodeclaims RE: #6981

Issue #6981 is likely caused, in my case, by the fact that conversion webhooks do NOT work on karpenter 1.0.2 chart:
#6982 If you manage to get conversion webhook working on 1.0.2 then the below probably wouldn't apply to you.

Warning: below step may incur downtime if you go too fast! As well i wouldn't advise to go this route if you have a large cluster. Luckily mine was a handful of nodes, which still took me a better part of the day.
This was by strategically terminating nodes in aws console whose nodeclaim didn't have group field for some reason:

for i in `kubectl get nodeclaims | grep -v NAME | awk '{print $1}'` ; do echo $i ; kubectl get nodeclaim -o yaml $i | grep group ; done
default-arm64-6pv26
    group: karpenter.k8s.aws
default-arm64-8xdhn
    group: karpenter.k8s.aws
default-arm64-9xmjx
    group: karpenter.k8s.aws
default-arm64-fcdc7
    group: karpenter.k8s.aws
default-arm64-kndsr
    group: karpenter.k8s.aws
default-arm64-vrzlw
    group: karpenter.k8s.aws
default-bvqlp
    group: karpenter.k8s.aws
default-h74m4
[...]

for the nodeclaims that didn't have group set in the above output, i had to terminate the instance forcefully through aws console.
Once karpenter provisioned replacements for these, i quickly reverted back to 0.37.3, with webhooks enabled (because 0.37.3 with webhooks disabled doesn't even work (see this one #6975)
0.37.3 cleaned up stale claims with no problems. Carefully monitoring the nodeclaims (You dont want 0.37.3 to provision new claims because 1.0.2 wont be able to delete them so you'd be at square 1). Anyway, you gotta be quick here.
I executed the "procedure" real quick:

  kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.k8s.aws
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.sh
  kubectl delete mutatingwebhookconfiguration defaulting.webhook.karpenter.k8s.aws

Then quickly synced to 1.0.2 in argocd with prune option.
Finally, i have nodeclaims on the new v1 spec and no errors in karpenter logs.
And don't forget to sync the IAM policy to v1 version because v1 policy is not compatible with 0.37.3 (see #6983)

To go back and forth between 0.37.3 and 1.0.2 i suggest removing ALL of the conditionals manually in the IAM policy for the time. This is outline in step 8 here: https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure
Unfortunately it fails to mention that policy is completely incompatible. 0.37.3 doesn't work with policy for 1.0.2 and vice-versa

PS. Don't forget to disable the webhooks on version 1.0.2 when you are going back and forth. I suggest having two values files prepared for this case (see #6982)

Good luck!

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 12, 2024

For a lot of folks here, it seems like the validating and mutating webhooks are hanging around even after the upgrade to v1.0 where they are dropped in the chart and I'm wondering if we can help with this bit. Is this expected behavior for Argo? I'd expect that when a resource is removed from the helm release that the respective resource is also pruned by whichever GitOps tool is applying it.

@booleanbetrayal
Copy link
Author

booleanbetrayal commented Sep 12, 2024

@jonathan-innis - ArgoCD has an optional Prune option in syncing, which will delete any tracked resources that are presently missing from the chart manifests (due to templating or value conditions). Until they are Pruned, they will reflect in the ArgoCD UI as a diff with missing contents. However, in our case, I do not believe we had any visibility in our Application that any ValidatingWebhookConfigurations or MutatingWebhookConfigurations ever existed.

If some other process had updated the metadata.labels values on those resources, or they had been explicitly overwritten by the chart manifest themselves (possibly?) then ArgoCD would lose the ability to track them as they would become detached from ArgoCD management. Therefore, they would not show up in the Application at all, let alone any indication of diff, and would not cleaned up when Prune was enabled.

I am theorizing here that there was some label mismanagement as part of the numerous charts / versions / hooks / templating helpers that broke these resources at deployment onset, but I'm definitely not rewinding the clock on attempting to reproduce this one.

I think it really behooves the Karpenter team to incorporate ArgoCD and Flux tooling into release testing procedures (cc: @bryantbiggs)

@adrianmiron
Copy link

I finally resolved my issues in prod with stale nodeclaims RE: #6981

Issue #6981 is likely caused, in my case, by the fact that conversion webhooks do NOT work on karpenter 1.0.2 chart: #6982 If you manage to get conversion webhook working on 1.0.2 then the below probably wouldn't apply to you.

Warning: below step may incur downtime if you go too fast! As well i wouldn't advise to go this route if you have a large cluster. Luckily mine was a handful of nodes, which still took me a better part of the day. This was by strategically terminating nodes in aws console whose nodeclaim didn't have group field for some reason:

for i in `kubectl get nodeclaims | grep -v NAME | awk '{print $1}'` ; do echo $i ; kubectl get nodeclaim -o yaml $i | grep group ; done
default-arm64-6pv26
    group: karpenter.k8s.aws
default-arm64-8xdhn
    group: karpenter.k8s.aws
default-arm64-9xmjx
    group: karpenter.k8s.aws
default-arm64-fcdc7
    group: karpenter.k8s.aws
default-arm64-kndsr
    group: karpenter.k8s.aws
default-arm64-vrzlw
    group: karpenter.k8s.aws
default-bvqlp
    group: karpenter.k8s.aws
default-h74m4
[...]

for the nodeclaims that didn't have group set in the above output, i had to terminate the instance forcefully through aws console. Once karpenter provisioned replacements for these, i quickly reverted back to 0.37.3, with webhooks enabled (because 0.37.3 with webhooks disabled doesn't even work (see this one #6975) 0.37.3 cleaned up stale claims with no problems. Carefully monitoring the nodeclaims (You dont want 0.37.3 to provision new claims because 1.0.2 wont be able to delete them so you'd be at square 1). Anyway, you gotta be quick here. I executed the "procedure" real quick:

  kubectl delete validatingwebhookconfiguration validation.webhook.config.karpenter.sh
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.k8s.aws
  kubectl delete validatingwebhookconfiguration validation.webhook.karpenter.sh
  kubectl delete mutatingwebhookconfiguration defaulting.webhook.karpenter.k8s.aws

Then quickly synced to 1.0.2 in argocd with prune option. Finally, i have nodeclaims on the new v1 spec and no errors in karpenter logs. And don't forget to sync the IAM policy to v1 version because v1 policy is not compatible with 0.37.3 (see #6983)

To go back and forth between 0.37.3 and 1.0.2 i suggest removing ALL of the conditionals manually in the IAM policy for the time. This is outline in step 8 here: https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure Unfortunately it fails to mention that policy is completely incompatible. 0.37.3 doesn't work with policy for 1.0.2 and vice-versa

PS. Don't forget to disable the webhooks on version 1.0.2 when you are going back and forth. I suggest having two values files prepared for this case (see #6982)

Good luck!

When doing rollback to remoe stale NodeClaims, do it to 0.37.2 , with webhooks disabled, that worked for me.

However what happens if during the switch your cluster wants to spawn nodes...you get 0.37.2 NodeClaims again :)) ..... unless you setup different NodePools for both versions, block 0.37 NodePools from creating nodes when moving to 1.X and keep that setting when going back to 0.37 to avoid spawing bad nodes...in other words it is not worth doing on 10 clusters

@dcherniv
Copy link

dcherniv commented Sep 12, 2024

I just tried to rollback to 0.37.3 from 1.0.2 on one of the non-important clusters
Upon changing the manifests to 0.37.3 with webhook.enabled in karpenter chart, webhooks showed a diff (not present argocd wants to install)
image
I synced just one of the webhooks and disabled webhook.enabled in the values.
Argocd found that webhook is not in the configuration and wants to delete it:
image
I didn't delete it, but instead bumped the chart to 1.0.2 again all the while keeping the webhook.enabled: false in the values
Argocd still detected that webhook is present but shouldn't be present on chart 1.0.2
image
Syncing with prune on argocd now, deleted the webhook properly.

This was partial sync above.

On a full sync. I just did a full rollback to 0.37.3 with webhook.enabled true.
Webhooks are enabled. Chart is on 0.37.3.
image
Disabling webhooks in values on the same chart 0.37.3 and refreshing in argocd, boom webhooks disappeared entirely from the configuration, even though they are still there on the cluster.
Attaching the webhook yaml as text.
validation.webhook.config.karpenter.sh.txt
So this problem happens even on chart 0.37.3 without upgrading at all. As soon as webhook.enabled value is set to false the webhooks disappear from configuration but not deinstalled.

It still wants to delete the cert which is under the same conditional
image

I bet you the problem is universal and has nothing to do with argocd. Probably happens with helm too.

@dcherniv
Copy link

When doing rollback to remoe stale NodeClaims, do it to 0.37.2 , with webhooks disabled, that worked for me.

0.37.2 worked. I just tried following the v1 upgrade guide and there it was mentioned that you have to be on the latest patch version, which was unfortunately broken :(

@dcherniv
Copy link

dcherniv commented Sep 12, 2024

I think this has something to do with https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents maybe?
since the webhook specifies karpenter ?deployment/rs? as its owner in the owner block:

  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Namespace
    name: karpenter
    uid: a4d1f1ad-a593-4666-8677-62337f9c602f

But when we flip webhook.enabled flag in the chart values this causes the re-creation of the karpenter replicaset (since it goes by uid, its hard to tell who is the owner, whether its the rs or the deployment)
When the rs is deleted the webhooks become orphans and are removed from consideration or something.
i'm not 100% on how this owner reference block works, but it seems plausible in my head

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Sep 16, 2024

I am theorizing here that there was some label mismanagement as part of the numerous charts / versions / hooks / templating helpers that broke these resources at deployment onset, but I'm definitely not rewinding the clock on attempting to reproduce this one.

That would be my guess as well. We need to look into the reconciler in the code that's injecting the certificate into the Validation and Mutating Webhook Configurations. I suspect that it may be doing exactly what you are saying and removing ownership. And solving that problem would reduce a lot of the pain here.

I think it really behooves the Karpenter team to incorporate ArgoCD and Flux tooling into release testing procedures

I'd be interested to see what other projects do around testing with GitOps tooling integrations. I'd imagine that there's some prior art out there for integration testing but I'd be interested to see which CNCF projects have it.

@iNoahNothing
Copy link

iNoahNothing commented Sep 16, 2024

FWIW I found an easier way to recover from this
1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

Edit: you can actually just temporarily update the nodeclaims.karpenter.sh CRD to remove the validation that blocks editing the spec and manually add the group. This made it so that we didn't have to orphan any nodes

@adrianmiron
Copy link

FWIW I found an easier way to recover from this

  1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
  2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
  3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
  4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

That sounds clever, but i would not do that on my prod clusters :D

@iNoahNothing
Copy link

FWIW I found an easier way to recover from this

  1. kubectl delete the NodeClaims that were left over from earlier karpenter deploy
  2. Temporarily update the nodeclaims.karpenter.sh crd to make group not a required field so that you can edit the crd without failing CEL validation
  3. kubectl edit the NodeClaims and remove the finalizer so that the resource gets removed from the API server
  4. Restart karpenter and let it spin up new nodes

This will temporarily orphan the old nodes that were claimed by the deleted NodeClaims but Karpenter appears to cleanly remove those for me once it was back in a healthy state. This was relatively painless and did not incur downtime

That sounds clever, but i would not do that on my prod clusters :D

Curious as to why this would be thought of as riskier to you than force deleting the nodes and having to the delicate and well timed dance of rolling back and rolling forward described above. Since the nodes will be drained before being removed everything was restarted and running before the nodes were reaped.

@adrianmiron
Copy link

Well i would not use either solution on prod, i will wait for an official version which does this for us without any tricks. We do not need to move to 1.0.X right now and most of the stuff here is just to discover issues and possible solutions so the devs know about them

@Vinaum8
Copy link

Vinaum8 commented Sep 18, 2024

Well i would not use either solution on prod, i will wait for an official version which does this for us without any tricks. We do not need to move to 1.0.X right now and most of the stuff here is just to discover issues and possible solutions so the devs know about them

me too

@jonathan-innis
Copy link
Contributor

That would be my guess as well. We need to look into the reconciler in the code that's injecting the certificate into the Validation and Mutating Webhook Configurations. I suspect that it may be doing exactly what you are saying and removing ownership. And solving that problem would reduce a lot of the pain here.

So at least for this specific issue, I think we've tracked down what's going on here.

There's an open issue here that actually specifically describes this interaction between Knative and ArgoCD. What it comes down to is that ArgoCD refuses to delete objects with ownerReferences -- even if it was the original creator of the object.

Because knative injects an owner reference onto the ValidatingWebhookConfiguration and MutatingWebhookConfiguration as part of its reconciliation flow, this causes ArgoCD to ignore these objects when pruning and leave them behind on the cluster. I'm trying out what happens when we don't have these owner references by patching the knative package that we are using here: #7042 to see if that resolves the issue (I suspect it should).

Code link to where ArgoCD does this check: https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L1167

@jonathan-innis
Copy link
Contributor

Trying to get this feature into upstream knative/pkg so we can add the change and patch this back to the Karpenter versions that are affected by this: knative/pkg#3095

@ShahroZafar
Copy link

ShahroZafar commented Sep 24, 2024

Hi, we are planning to upgrade to v1 but before upgrading I landed on this issue.

We are using ArgoCD for managing karpenter controller. We have 2 applications

  1. Karpenter controller
  2. Nodepools and ec2nodeclasses

We plan to go from v0.37.3 (We upgraded to this version just because it was mentioned in the upgrade guide. Had to manually update the namespace to karpenter in the CRD to fix the karpenter service not found issue in the CRDs ) to v1.0.2.

I see that conversion webhooks are removed in v1.0.2 helm chart. Since the webhooks are removed (Though this issue points correctly that they are just removed from the argocd when you put webhook enabled false and not actually which i confirmed on v0.37.3 as well), how the conversion from v1beta1 to v1 is supposed to work.

If webhook is to work, should we put the 2. application out of sync so that argocd dont revert the changes made by conversion webhook

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Oct 4, 2024

We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:

  • v0.33.10
  • v0.34.11
  • v0.35.10
  • v0.36.7
  • v0.37.5

@laserpedro
Copy link

laserpedro commented Oct 7, 2024

hello, a little summary that may help here as a feedback for our v1 upgrade:

env:

  • EKS 1.30
  • Karpenter deployed via ArgoCD / Karpenter node classes and nodepools deployed also with ArgoCD.

Feedback:
upgrade to minor patch 0.37.5 to 1.0.6 enables a clean deletion of the webhooks of karpenter : validating and mutating.
Though we are encountering a failure of the migration webhook to v1 of the CRDs: the only error we can see from the controller's log regarding the webhook is detailed here: #6898 , but eventually our CRDs are still all in v1beta1.
And also IDK if this is an issue specific to our network policies (explicit deny on the creation of public IP address), but with the latest v1 policy (the one in the migration guide or the one in the eks tf module, submodule /karpenter) we receive a permission denied error now whereas 0.37.X policy was not generating this error:
You are not authorized to perform this operation. User: arn:aws:sts::XXXX:assumed-role/KarpenterController-20240626054425729500000022/1728281406041809636 is not authorized to perform: ec2:RunInstances on resource: arn:aws:ec2:me-south-1:XXXX:network-interface/* with an explicit deny in a service control policy (subnets have no auto assignment of public IPV4 address ....): so here there is sth strange to me.

UPDATED:
When passing explicitly to the nodeclass: associatePublicIPAddress: false, the nodes can be created ... /
If not specified, the default value depends on the subnet's public IP auto-assign setting., so maybe an issue here on a the new policy preventing to retrieve the subnet setting, but we can rule this out:

 statement {
    sid       = "AllowRegionalReadActions"
    resources = ["*"]
    actions = [
      "ec2:DescribeAvailabilityZones",
      "ec2:DescribeImages",
      "ec2:DescribeInstances",
      "ec2:DescribeInstanceTypeOfferings",
      "ec2:DescribeInstanceTypes",
      "ec2:DescribeLaunchTemplates",
      "ec2:DescribeSecurityGroups",
      "ec2:DescribeSpotPriceHistory",
      "ec2:DescribeSubnets"
    ]

    condition {
      test     = "StringEquals"
      variable = "aws:RequestedRegion"
      values   = [local.region]
    }
  }

So maybe a deeper issue with the controller itself ...

@ravigude
Copy link

ravigude commented Oct 7, 2024

Any one upgraded from v0.37.5 to 1.0.2 ?

@adrianmiron
Copy link

Any one upgraded from v0.37.5 to 1.0.2 ?

Yes, it worked for me like this :

  • upgrade to 0.37.5 ( with webhooks enabled , with them off you get a RBAC error...needs a fix )
  • upgrade resources + IAM policy to v1.0.X version
  • upgrade to v1.0.6 ( do not disable the webhook in de CRD yet, it will trigger a bunch of errors)
  • disable the webhook inside the CRD

@jmdeal
Copy link
Contributor

jmdeal commented Oct 7, 2024

Hi @laserpedro, I followed up in #6898, I don't believe what you are seeing indicates an issue in the conversion webhook. As for the network policy issue, this is unrelated to Argo. This was a change made at v1 to reduce implicit field management since associatePublicIPAddress is now available on the EC2NodeClass (#6213). This could be a breaking change for users with SCPs and should be called out in the upgrade guide, I'll cut a docs PR to get it in.

@Vinaum8
Copy link

Vinaum8 commented Oct 8, 2024

I decided work with two helm charts in my argocd server:

  • karpenter/karpenter
  • karpenter/karpenter-crd

With configurations in the applicationsets:

ignoreDifferences:
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: ec2nodeclasses.karpenter.k8s.aws
    jqPathExpressions:
    - .spec.conversion.webhook.clientConfig
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: nodeclaims.karpenter.sh
    jqPathExpressions:
    - .spec.conversion.webhook.clientConfig
  - group: apiextensions.k8s.io
    kind: CustomResourceDefinition
    name: nodepools.karpenter.sh
    jqPathExpressions:
    - .spec.conversion.webhook.clientConfig

Karpenter-crd:

webhook:
    # -- Whether to enable the webhooks.
    enabled: true
    serviceName: insert_here_correct_name_service_karpenter
    serviceNamespace: insert_here_correc_name_namespace_karpenter
    # -- The container port to use for the webhook.
    port: 8443

@icicimov
Copy link

What a mess ... BTW why do we have to have the finalizer:

  finalizers:
    - karpenter.k8s.aws/termination

on the EC2NodeClass? If one wants to move from role to instanceProfile or vice-versa we need to delete it first which is not possible without deleting the ec2nodeclasses.karpenter.k8s.aws CRD first which is kinda convoluted?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests