-
Notifications
You must be signed in to change notification settings - Fork 958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter v1.0.0 chart does not respect Helm values in packaged CRDs leading to ArgoCD upgrade difficulty #6847
Comments
As part of ArgoCD, you should be able to disable CRD installation, no? https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-skip-crds. Also, It sounds like you updated the |
Hi @engedaam and thanks for the quick feedback. That Other popular charts often include some flag to toggle CRD deployment for more granular fidelity. For example, ESO uses a We should be able to update the At this point, I'm curious how people running Karpenter within an ArgoCD Application are performing the v1.0.0 upgrade. |
Just to provide more context, Helm won't even overwrite resources that are managed-by ArgoCD, and I'm not sure how to spoof Helm's |
+1 |
The expected behavior according to this issue would be perfect, as we would have the option of enabling or not the installation of CRDs with the karpenter helm chart. |
Given 75d482f is already in main, I assume official guidelines from now on will be "disable CRDs in the main chart and always use karpenter-crd if you require any sort of CRD customization". |
@artem-nefedov - How does one disable CRDs in the main chart? EDIT: I've verified that the 1.0.1 package release is continuing to distribute the static CRDs. |
|
@artem-nefedov - I've posted a response to that workaround elsewhere but the gist is that there may be multiple Helm charts comprising an ArgoCD Application. Skipping CRDs to relieve a problematic deployment from one Chart could cascade into multiple issues with the maintenance of CRDs in other charts. |
@booleanbetrayal Maybe I'm missing something, but so far this looks like solvable problem. Split it into two separate Applications, and optionally add a third top-level Application that installs both if you still want to have a single entrypoint. |
this exact thing happened to me. spent several hours with it. skipping CRDs at an application level is not an option. i would need to be able to include both the karpenter and karpenter-crds charts in the same ArgoCD application with only the CRDs from the karpenter chart skipped. or even better, if there was a solution that only required using the karpenter chart where the service naming was consistent |
@artem-nefedov - So if my Application is You might not have much familiarity with ArgoCD, but I can tell you that this is not a light chore, would be prone to outage chances, is not documented as a migration vector, and would feel especially egregious in light of AWS's recent endorsement of ArgoCD deployment approaches. |
For context, we've went with this design following Helm v3's recommendations for managing CRDs. These recommendations state that, in a main chart, CRDs should be placed in the
The linked examples in this thread appear to be working around Helm's design for managing CRDs. Unless absolutely necessary, we shouldn't go down this path and stick to helm standard practices.
This does not appear to be necessary. project: default
destination:
server: https://kubernetes.default.svc
namespace: karpenter
sources:
- repoURL: public.ecr.aws/aws-controllers-k8s
targetRevision: 1.2.17
chart: ec2-chart
- repoURL: public.ecr.aws/karpenter
targetRevision: 1.0.1
chart: karpenter-crd
- repoURL: public.ecr.aws/karpenter
targetRevision: 1.0.1
helm:
parameters:
- name: settings.clusterName
value: redacted
- name: serviceAccount.annotations.eks\.amazonaws\.com/role-arn
value: redacted
skipCrds: true
chart: karpenter
|
@jmdeal - This is a bit of an annoying workaround for users who already have project: kube-system
source:
helm:
valueFiles:
- staging-values.yaml
path: kube-system
repoURL: [email protected]/some-org/some-private-gitops-repo
targetRevision: main In this instance we would either have to expose our Application to third-party supply chain repositories (a deal-breaker for lots of orgs who attempt to mitigate supply-side attacks / reduce third-party dependency failures) or figure out a new scheme of pushing Even if that were feasible for most users, they would then be forced to perform a moderately risky Application migration to move a pre-1.0.0 running Karpenter controller + provisioning (NodePool, NodeClass) manifests, into the appropriate new source state. Feels like a lot of users could accidentally induce some sort of Karpenter downtime in such a requirement if they fail to atomically swap resources with identical configuration. If you all absolutely do not wish to implement a convenience flag for this, are adamant on keeping with Helm 3 recommendations, why not just drop the static CRDs from the primary chart and make the
Or alternatively, just drop the conversion webhook altogether and clarify upgrade requirements to appropriate provisioning manifests? |
+1. I also mentioned adding a dependency to the |
@booleanbetrayal Have you considered including
This respect the Helm 3 Recommendations (option 2) as there are 2 charts but they are being deployed from a single ArgoCD Application. Also worth mentioning that multi-source application is still considered a beta feature |
@trojaond - Currently, our ArgoCD application paths are all namespace-oriented with differing values files based on environment (development, staging, production, etc). This change breaks with those conventions and would require some other scripting changes to accomodate, so even if this ArgoCD multi-source beta feature is viable (and does not "change in backwards incompatible ways", it's not very desirable. It would be much more operator friendly if there were just a single chart with a convenience flag or two separate charts, one with templated CRDs and one with none. EDIT - Additionally, I think the Option 2 statement ...
... implies that the CRDs are omitted from one of the chart, not included in both, and not requiring command-line flags to work around the duplication. |
Why not put the option to enable or disable CRD in the Helm Chart? |
Helm doesn't allow any templating for CRDs that are located in "crds" directory, so it's impossible to affect them based on input values. They can only be turned off by |
@artem-nefedov - What are the side effects of templating CRDs? Lots of Helm charts template CRDs, including ArgoCD itself. |
@artem-nefedov unless I'm mistaken, helm supports this via helm/helm#7138 (i.e what ArgoCD is doing) |
@andrewjamesbrown This is only for outputting CRDs via |
@booleanbetrayal For comprehensive reasoning (as the team behind helm sees it), read this: |
@artem-nefedov my mistake... ArgoCD CRDs are in the |
@bryantbiggs |
I finally resolved my issues in prod with stale nodeclaims RE: #6981 Issue #6981 is likely caused, in my case, by the fact that conversion webhooks do NOT work on karpenter 1.0.2 chart: Warning: below step may incur downtime if you go too fast! As well i wouldn't advise to go this route if you have a large cluster. Luckily mine was a handful of nodes, which still took me a better part of the day.
for the nodeclaims that didn't have group set in the above output, i had to terminate the instance forcefully through aws console.
Then quickly synced to 1.0.2 in argocd with prune option. To go back and forth between 0.37.3 and 1.0.2 i suggest removing ALL of the conditionals manually in the IAM policy for the time. This is outline in step 8 here: https://karpenter.sh/v1.0/upgrading/v1-migration/#upgrade-procedure PS. Don't forget to disable the webhooks on version 1.0.2 when you are going back and forth. I suggest having two values files prepared for this case (see #6982) Good luck! |
For a lot of folks here, it seems like the validating and mutating webhooks are hanging around even after the upgrade to v1.0 where they are dropped in the chart and I'm wondering if we can help with this bit. Is this expected behavior for Argo? I'd expect that when a resource is removed from the helm release that the respective resource is also pruned by whichever GitOps tool is applying it. |
@jonathan-innis - ArgoCD has an optional Prune option in syncing, which will delete any tracked resources that are presently missing from the chart manifests (due to templating or value conditions). Until they are Pruned, they will reflect in the ArgoCD UI as a diff with missing contents. However, in our case, I do not believe we had any visibility in our Application that any If some other process had updated the I am theorizing here that there was some label mismanagement as part of the numerous charts / versions / hooks / templating helpers that broke these resources at deployment onset, but I'm definitely not rewinding the clock on attempting to reproduce this one. I think it really behooves the Karpenter team to incorporate ArgoCD and Flux tooling into release testing procedures (cc: @bryantbiggs) |
When doing rollback to remoe stale NodeClaims, do it to 0.37.2 , with webhooks disabled, that worked for me. However what happens if during the switch your cluster wants to spawn nodes...you get 0.37.2 NodeClaims again :)) ..... unless you setup different NodePools for both versions, block 0.37 NodePools from creating nodes when moving to 1.X and keep that setting when going back to 0.37 to avoid spawing bad nodes...in other words it is not worth doing on 10 clusters |
I just tried to rollback to 0.37.3 from 1.0.2 on one of the non-important clusters This was partial sync above. On a full sync. I just did a full rollback to 0.37.3 with webhook.enabled true. It still wants to delete the cert which is under the same conditional I bet you the problem is universal and has nothing to do with argocd. Probably happens with helm too. |
0.37.2 worked. I just tried following the v1 upgrade guide and there it was mentioned that you have to be on the latest patch version, which was unfortunately broken :( |
I think this has something to do with https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents maybe?
But when we flip webhook.enabled flag in the chart values this causes the re-creation of the karpenter replicaset (since it goes by uid, its hard to tell who is the owner, whether its the rs or the deployment) |
That would be my guess as well. We need to look into the reconciler in the code that's injecting the certificate into the Validation and Mutating Webhook Configurations. I suspect that it may be doing exactly what you are saying and removing ownership. And solving that problem would reduce a lot of the pain here.
I'd be interested to see what other projects do around testing with GitOps tooling integrations. I'd imagine that there's some prior art out there for integration testing but I'd be interested to see which CNCF projects have it. |
FWIW I found an easier way to recover from this
Edit: you can actually just temporarily update the |
That sounds clever, but i would not do that on my prod clusters :D |
Curious as to why this would be thought of as riskier to you than force deleting the nodes and having to the delicate and well timed dance of rolling back and rolling forward described above. Since the nodes will be drained before being removed everything was restarted and running before the nodes were reaped. |
Well i would not use either solution on prod, i will wait for an official version which does this for us without any tricks. We do not need to move to 1.0.X right now and most of the stuff here is just to discover issues and possible solutions so the devs know about them |
me too |
So at least for this specific issue, I think we've tracked down what's going on here. There's an open issue here that actually specifically describes this interaction between Knative and ArgoCD. What it comes down to is that ArgoCD refuses to delete objects with ownerReferences -- even if it was the original creator of the object. Because knative injects an owner reference onto the Code link to where ArgoCD does this check: https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L1167 |
Trying to get this feature into upstream |
Hi, we are planning to upgrade to v1 but before upgrading I landed on this issue. We are using ArgoCD for managing karpenter controller. We have 2 applications
We plan to go from v0.37.3 (We upgraded to this version just because it was mentioned in the upgrade guide. Had to manually update the namespace to karpenter in the CRD to fix the karpenter service not found issue in the CRDs ) to v1.0.2. I see that conversion webhooks are removed in v1.0.2 helm chart. Since the webhooks are removed (Though this issue points correctly that they are just removed from the argocd when you put webhook enabled false and not actually which i confirmed on v0.37.3 as well), how the conversion from v1beta1 to v1 is supposed to work. If webhook is to work, should we put the 2. application out of sync so that argocd dont revert the changes made by conversion webhook |
We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:
|
hello, a little summary that may help here as a feedback for our v1 upgrade: env:
Feedback: UPDATED:
So maybe a deeper issue with the controller itself ... |
Any one upgraded from v0.37.5 to 1.0.2 ? |
Yes, it worked for me like this :
|
Hi @laserpedro, I followed up in #6898, I don't believe what you are seeing indicates an issue in the conversion webhook. As for the network policy issue, this is unrelated to Argo. This was a change made at v1 to reduce implicit field management since |
I decided work with two helm charts in my argocd server:
With configurations in the applicationsets:
Karpenter-crd:
|
What a mess ... BTW why do we have to have the finalizer:
on the EC2NodeClass? If one wants to move from |
Description
Observed Behavior:
The setting of any
webhook
values (such aswebhook.serviceName
) in thekarpenter
primary chart is not respected. Instead, the chart embeds static CRDs, which are symlinked in this repository, unlike thekarpenter-crd
chart, which respects these values.This makes upgrading from v0.37.0 to v1.0.0 using platforms like ArgoCD difficult and error-prone since ArgoCD is capable of managing CRDs and will likewise detect drift if they are updated manually or via a secondary chart such as
karpenter-crd
.In our case, the
conversion.webhook.clientConfig.service.name
value for the conversion webhook (specified in the static CRDs) deployed with thekarpenter
chart did not match the name that the chart generates for the Service itself (using thekarpenter.fullName
helper), and so all webhook calls would fail. Likewise, disabling the webhook could not be done with values and manual edits will show diff changes in ArgoCD.Attempting to utilize both charts in ArgoCD will attempt to create duplicate resources and could cause unexpected issues. There is no way to disable the deployment of CRDs in the
karpenter
primary chart.A possible workaround for our specific issue would be to set the
nameOverride
value tokarpenter
such that it matches the staticserviceName
buried in the CRD, but ideally we don't even need these conversion webhooks and would prefer to have them disabled entirely.Expected Behavior:
The
karpenter
Helm package respects Helm values in its CRDs, just as thekarpenter-crd
Chart does. If static CRDs are required they can be referenced elsewhere from the GitHub repository, etc.Alternatively (and less elegantly), a dual-chart solution with a flag supporting the disabling of CRDs in the primary chart (or just not packaging them to begin with) could also be a possible fix vector, as it would allow the other chart to install without duplicates.
Reproduction Steps (Please include YAML):
Attempt to upgrade the
karpenter
chart fromv0.37.0
tov1.0.0
using ArgoCD. Specify override Helm values forwebhook
properties and observe no diff is presented in the CRD definitions reflecting those overrides.Versions:
kubectl version
): 1.30.2The text was updated successfully, but these errors were encountered: