Skip to content
This repository has been archived by the owner on Sep 12, 2024. It is now read-only.

Fixes the wrong use of k8s client-go #58

Merged
merged 5 commits into from
May 7, 2024
Merged

Conversation

Nino-K
Copy link
Member

@Nino-K Nino-K commented May 3, 2024

Previously we were using the client incorrectly by not passing the context, this was causing internal network errors in the client due to the API server not being ready as you can see in the below logs:

2024/05/03 09:03:12 [DEBUG]   list of services from core v1----> []
W0503 09:03:12.649264     255 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: network is unreachable
2024/05/03 09:03:12 [DEBUG]   watching kubernetes services
2024/05/03 09:03:12 [DEBUG]   inside state watching
2024/05/03 09:03:12 [DEBUG]   kubernetes: error watching [error=failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: network is unreachable]
2024/05/03 09:03:12 [ERROR]   kubernetes: unexpected error watching [error=failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: network is unreachable]
W0503 09:03:14.012174     255 reflector.go:424] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:169: failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused
2024/05/03 09:03:14 [DEBUG]   kubernetes: error watching [error=failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused]
2024/05/03 09:03:14 [DEBUG]   kubernetes: got error, rolling back [error=failed to list *v1.Service: Get "https://127.0.0.1:6443/api/v1/services?limit=500&resourceVersion=0": dial tcp 127.0.0.1:6443: connect: connection refused]
2024/05/03 09:03:15 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]

However, we continued and blocked the event channel.

With the new approach, we can determine when the API server is ready, this way we return a FATAL error and the process exists. This allows the init process to retry again until the API server becomes ready.

The below logs are for the the new approach:

2024/05/03 09:29:58 [INFO]    Starting Rancher Desktop Agent in [AdminInstall=true] mode
2024/05/03 09:29:58 [DEBUG]   kubernetes: loaded kubeconfig /etc/rancher/k3s/k3s.yaml
2024/05/03 09:29:58 [FATAL]   error watching services: error listing services: Get "https://127.0.0.1:6443/api/v1/services": dial tcp 127.0.0.1:6443: connect: network is unreachable
2024/05/03 09:30:03 [INFO]    Starting Rancher Desktop Agent in [AdminInstall=true] mode
2024/05/03 09:30:03 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:04 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:05 [DEBUG]   kubernetes: failed to read kubeconfig [error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory][config-path=/etc/rancher/k3s/k3s.yaml]
2024/05/03 09:30:06 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:07 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:08 [DEBUG]   checking if container engine API is running at /var/run/docker.sock
2024/05/03 09:30:08 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:09 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:10 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:11 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:12 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:13 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:14 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:15 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:16 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:17 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:18 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:19 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:20 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:21 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:22 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:23 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:24 [DEBUG]   kubernetes: failed to read kubeconfig [config-path=/etc/rancher/k3s/k3s.yaml][error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory]
2024/05/03 09:30:25 [DEBUG]   kubernetes: failed to read kubeconfig [error=could not load Kubernetes client config from /etc/rancher/k3s/k3s.yaml: stat /etc/rancher/k3s/k3s.yaml: no such file or directory][config-path=/etc/rancher/k3s/k3s.yaml]
2024/05/03 09:30:26 [DEBUG]   kubernetes: loaded kubeconfig /etc/rancher/k3s/k3s.yaml
2024/05/03 09:30:26 [ERROR]   context cancellation: context canceled
2024/05/03 09:30:26 [FATAL]   error watching services: error listing services: apiserver not ready
2024/05/03 09:30:31 [INFO]    Starting Rancher Desktop Agent in [AdminInstall=true] mode
2024/05/03 09:30:31 [DEBUG]   kubernetes: loaded kubeconfig /etc/rancher/k3s/k3s.yaml

Fixes: rancher-sandbox/rancher-desktop#6809

@Nino-K Nino-K requested a review from mook-as May 3, 2024 16:43
Previously we were using the client incorrectly by not passing the
context, this was causing internal network error in the client due to
the API server not being ready. However, we continued and blocked on the
event channel. With the new approach, we can determine when the API server
is ready, this way we return a FATAL error and the process exists. This
allows the init process to retry again until the API server becomes
ready.

Signed-off-by: Nino Kodabande <[email protected]>
@Nino-K Nino-K force-pushed the fix-connect-refuse-k8s-client branch from 721bb53 to b1c38c4 Compare May 3, 2024 16:44
@@ -48,14 +48,17 @@ func watchServices(ctx context.Context, client *kubernetes.Clientset) (<-chan ev
informerFactory := informers.NewSharedInformerFactory(client, 1*time.Hour)
serviceInformer := informerFactory.Core().V1().Services()
sharedInformer := serviceInformer.Informer()
_, _ = sharedInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
_, _ = serviceInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're no longer using sharedInformer here? It seems to be the same thing here.

AddFunc: func(obj interface{}) {
log.Debugf("Service Informer: Add func called with: %+v", obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters, but these logs should be trace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK for the trace logging to work it needs to be handled from the main. I don't think just changing these to trace would work 🤔
Also, I don't recall if we handle trace logging at all in RD anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, trace would mean it's basically off and needs a code change to turn on. We don't really want to log everything in release builds (even with debug turned on).

if err != nil {
return nil, nil, fmt.Errorf("error listing services: %w", err)
}
log.Debugf("coreV1 services list :%+v", services.Items)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here — trace?

@@ -101,15 +104,16 @@ func watchServices(ctx context.Context, client *kubernetes.Clientset) (<-chan ev
informerFactory.WaitForCacheSync(ctx.Done())
informerFactory.Start(ctx.Done())

services, err := serviceInformer.Lister().List(labels.Everything())
services, err := client.CoreV1().Services(corev1.NamespaceAll).List(ctx, v1.ListOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with Nino: This manages to return an error (good!), but we don't handle it (bad!) and exit the whole process. At least init restarts us, but we should avoid needing to do that.

@Nino-K Nino-K requested a review from mook-as May 3, 2024 18:27
@Nino-K Nino-K force-pushed the fix-connect-refuse-k8s-client branch 2 times, most recently from 90febd7 to 37f265a Compare May 4, 2024 02:32
This will prevent the process to exit with a FATAL error
which prevents the init process restarting guest agent again.
The error is now handled internally withing the process.

Signed-off-by: Nino Kodabande <[email protected]>
@Nino-K Nino-K force-pushed the fix-connect-refuse-k8s-client branch from 37f265a to f8f722a Compare May 6, 2024 15:18
case isTimeout(err):
case errors.Is(err, unix.ENETUNREACH):
case errors.Is(err, unix.ECONNREFUSED):
case apierrors.IsInternalError(err):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for all internal errors; I guess that's fine?

@@ -258,6 +267,10 @@ func isTimeout(err error) bool {
return false
}

func isAPINotReady(err error) bool {
return strings.Contains(err.Error(), "apiserver not ready")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have a comment about where this comes from (k3s, over HTTP so we can't test for a specific error object). Ideally we also check that err is of the right type here, but I guess that can be optional.

// If it's a time out, the server may not be running yet
switch {
case isTimeout(err):
fallthrough
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be clearer as:

switch {
default:
  return err
case isTimeout(err):
case errors.Is(err, unix.ENETUNREACH):
case errors.Is(err, unix.ECONNREFUSED):
case isAPINotReady(err):
}
// Sleep and continue for all the expected errors
time.Sleep(time.Second)
continue

AddFunc: func(obj interface{}) {
log.Debugf("Service Informer: Add func called with: %+v", obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, trace would mean it's basically off and needs a code change to turn on. We don't really want to log everything in release builds (even with debug turned on).

@@ -181,7 +181,7 @@ func main() {
k8sServiceListenerIP := net.ParseIP(*k8sServiceListenerAddr)

if k8sServiceListenerIP == nil || !(k8sServiceListenerIP.Equal(net.IPv4zero) ||
k8sServiceListenerIP.Equal(net.IPv4(127, 0, 0, 1))) { //nolint:gomnd // IPv4 addr localhost
k8sServiceListenerIP.Equal(net.IPv4(127, 0, 0, 1))) { //nolint:gomnd,mnd // IPv4 addr localhost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to update the lint config to allow magic numbers for net.IPv4 instead.

@Nino-K Nino-K force-pushed the fix-connect-refuse-k8s-client branch from 5634943 to d6b729e Compare May 7, 2024 16:33
@Nino-K Nino-K force-pushed the fix-connect-refuse-k8s-client branch from d6b729e to ae6abd7 Compare May 7, 2024 16:40
@Nino-K Nino-K requested a review from mook-as May 7, 2024 16:42
@mook-as mook-as merged commit afce3ee into main May 7, 2024
3 checks passed
@mook-as mook-as deleted the fix-connect-refuse-k8s-client branch May 7, 2024 16:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows guestAgent does not pick up ports from services during the initialization
2 participants