Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

kapp-controller api-service is unable to handle requests causing tanzu managment-cluster create failures #4507

Open
ridaz opened this issue Mar 21, 2023 · 1 comment
Labels
area/addons area/lcm Related to Cluster Lifecycle management kind/bug PR/Issue related to a bug

Comments

@ridaz
Copy link

ridaz commented Mar 21, 2023

Bug description

tanzu management-cluster create and delete fails with timeout due to tanzu-addons-manager-controller
crash loop caused by inability to reach kapp-controller's apiserver for data.packaging.carvel.dev/v1alpha1

We have seen a few cases where tanzu management-cluster create and delete fails in the following way:

  1. tanzu management-cluster create/ delete times out
  2. tanzu-addons-controller-manager is not fully running (get pods shows 0/n pods running)
  3. logs for tanzu-addons-controller-manager show errors about data.packaging.carvel.dev/v1alpha1:
I0901 23:06:15.856616       1 request.go:645] Throttling request took 1.007519759s, request: GET:https://[fd00:100:96::1]:443/apis/data.packaging.carvel.dev/v1alpha1?timeout=32s
E0901 23:06:19.163153       1 addon_controller.go:188] controllers/Addon "msg"="error retrieving GroupVersion" "error"="the server is currently unable to handle the request"  "GroupVersion"="data.packaging.carvel.dev/v1alpha1"

*This issue looks similar to #571
*

### Workaround:

  1. Restart of kapp-controller and tkr-validator pod helped to get the create/delete operations pass through.
kapp-controller-6d864cc846-kxgdc           2/2     Running   0          3m38s
tkr-conversion-webhook-manager-774f74f64c-f2ngr          1/1     Running   0                 78s
  1. This also helped get the tanzu-addons-controller-manager back to Running state.

Affected product area (please put an X in all that apply)

  • ( ) APIs
  • (X) Addons
  • ( ) CLI
  • ( ) Docs
  • ( ) IAM
  • ( ) Installation
  • ( ) Plugin
  • ( ) Security
  • (X) Test and Release
  • ( ) User Experience
  • ( ) Developer Experience

Expected behavior

tanzu management-cluster create/delete succeeds eventually.

Steps to reproduce the bug

The issue is intermittent, so no specific steps are available to reproduce the issue.

Version (include the SHA if the version is not obvious)
kapp-controller: v0.41.2

Environment where the bug was observed (cloud, OS, etc)
VMC on Nitros

Relevant Debug Output (Logs, manifests, etc)

  1. kubectl get packagerepositories.packaging.carvel.dev utkg-packages-repo -n vmware-system-pkgs -o yaml | less
usefulErrorMessage: |-
I0316 20:55:53.701753   38347 request.go:601] Waited for 1.031315546s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/addons.cluster.x-k8s.io/v1alpha3
I0316 20:56:04.901721   38347 request.go:601] Waited for 1.020640753s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/netoperator.vmware.com/v1alpha1
I0316 20:56:16.368738   38347 request.go:601] Waited for 1.023073347s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1beta1
I0316 20:56:27.834792   38347 request.go:601] Waited for 1.019206685s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/infrastructure.cluster.vmware.com/v1alpha3
I0316 20:56:39.302278   38347 request.go:601] Waited for 1.005437717s due to client-side throttling, not priority and fairness, request: GET:https://172.24.0.1:443/apis/crd.projectcalico.org/v1
  kapp: Error: unable to retrieve the complete list of server APIs: data.packaging.carvel.dev/v1alpha1: the server is currently unable to handle the request (possibly related issue: https://github.com/vmware-tanzu/carvel-kapp/issues/12)
  1. addons controller
I0316 21:29:23.950689       1 logr.go:252] clusterbootstrap-resource "msg"="validate create"  "name"="v1.23.15---vmware.1-tkg.4"
W0316 21:29:29.030873       1 reflector.go:324] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refused
E0316 21:29:29.030907       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.TanzuKubernetesRelease: failed to list *v1alpha1.TanzuKubernetesRelease: conversion webhook for run.tanzu.vmware.com/v1alpha3, Kind=TanzuKubernetesRelease failed: Post "https://tkr-conversion-webhook-service.vmware-system-tkg.svc:443/convert?timeout=30s": dial tcp 172.24.252.7:443: connect: connection refusedin addons controller
  1. tkr-conversion webhook
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.217:57968: EOF
2023/03/14 20:06:50 http: TLS handshake error from 10.73.129.216:53886: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.217:41790: EOF
2023/03/14 20:08:09 http: TLS handshake error from 10.73.129.216:35878: EOF
  1. packages CR is missing from the environment
kubectl get packages
error: the server doesn't have a resource type "packages"
@ridaz ridaz added kind/bug PR/Issue related to a bug needs-triage Indicates an issue or PR needs to be triaged labels Mar 21, 2023
@github-actions
Copy link

Hey @ridaz! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Tanzu Framework.

@codegold79 codegold79 added area/addons area/lcm Related to Cluster Lifecycle management and removed needs-triage Indicates an issue or PR needs to be triaged labels Mar 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/addons area/lcm Related to Cluster Lifecycle management kind/bug PR/Issue related to a bug
Projects
None yet
Development

No branches or pull requests

2 participants