Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics-server fails to get metrics from recently started nodes #1571

Open
IuryAlves opened this issue Sep 17, 2024 · 1 comment
Open

metrics-server fails to get metrics from recently started nodes #1571

IuryAlves opened this issue Sep 17, 2024 · 1 comment
Labels
kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@IuryAlves
Copy link

What happened:

The metrics-server fails with error: E0916 14:23:37.254021 1 scraper.go:149] "Failed to scrape node" err="Get \"https://10.34.50.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-50-99.eu-west-1.compute.internal"

What you expected to happen:

The metrics-server can get metrics from nodes successfully.

Anything else we need to know?:

This problem happens when the autoscaler (we use Karpenter) adds or removes new nodes. For a brief period of time the node will fail to report metrics in the metrics/resource, causing the HPA to have many FailedToGetResourceMetric events.

Environment:

  • Kubernetes distribution (GKE, EKS, Kubeadm, the hard way, etc.):
    EKS 17

  • Container Network Setup (flannel, calico, etc.):

  • Amazon VPC CNI plugin for Kubernetes
  • CoreDNS
  • KubeProxy

Note: This issue is not network related

  • Kubernetes version (use kubectl version):
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.12-eks-2f46c53
  • Metrics Server manifest
spoiler for Metrics Server manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "6"
    meta.helm.sh/release-name: metrics-server
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2023-01-31T14:48:01Z"
  generation: 6
  labels:
    app.kubernetes.io/instance: metrics-server
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: metrics-server
    app.kubernetes.io/version: 0.7.2
    application: none
    customer-level: none
    environment: dev
    helm.sh/chart: metrics-server-7.2.14
    owner: squad-platform
  name: metrics-server
  namespace: kube-system
  resourceVersion: "386279220"
  uid: ed6fd84e-fdb6-46e5-b25c-80a09135476f
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/instance: metrics-server
      app.kubernetes.io/name: metrics-server
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2023-07-17T14:25:57+02:00"
        prometheus.io/path: /metrics
        prometheus.io/port: "8443"
        prometheus.io/scrape: "true"
      creationTimestamp: null
      labels:
        app.kubernetes.io/instance: metrics-server
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: metrics-server
        app.kubernetes.io/version: 0.7.2
        application: none
        customer-level: none
        environment: dev
        helm.sh/chart: metrics-server-7.2.14
        owner: squad-platform
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - podAffinityTerm:
              labelSelector:
                matchLabels:
                  app.kubernetes.io/instance: metrics-server
                  app.kubernetes.io/name: metrics-server
              topologyKey: kubernetes.io/hostname
            weight: 1
      automountServiceAccountToken: true
      containers:
      - args:
        - --secure-port=8443
        - --kubelet-preferred-address-types=InternalIP,Hostname,InternalDNS,ExternalDNS,ExternalIP
        - --metric-resolution=20s
        command:
        - metrics-server
        image: docker.io/bitnami/metrics-server:0.7.2-debian-12-r3
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /livez
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: metrics-server
        ports:
        - containerPort: 8443
          name: https
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readyz
            port: https
            scheme: HTTPS
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: 150m
            ephemeral-storage: 2Gi
            memory: 192Mi
          requests:
            cpu: 100m
            ephemeral-storage: 50Mi
            memory: 128Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop:
            - ALL
          privileged: false
          readOnlyRootFilesystem: true
          runAsGroup: 1001
          runAsNonRoot: true
          runAsUser: 1001
          seLinuxOptions: {}
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: empty-dir
          subPath: tmp-dir
        - mountPath: /opt/bitnami/metrics-server/apiserver.local.config
          name: empty-dir
          subPath: app-tmp-dir
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 1001
        fsGroupChangePolicy: Always
      serviceAccount: metrics-server
      serviceAccountName: metrics-server
      terminationGracePeriodSeconds: 30
      volumes:
      - emptyDir: {}
        name: empty-dir
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2023-01-31T14:48:01Z"
    lastUpdateTime: "2024-09-16T18:56:36Z"
    message: ReplicaSet "metrics-server-75fb4689b7" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-09-17T06:56:46Z"
    lastUpdateTime: "2024-09-17T06:56:46Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 6
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
  • Kubelet config:
spoiler for Kubelet config:
{
  "kubeletconfig": {
    "enableServer": true,
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCipherSuites": [
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305",
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305",
      "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_RSA_WITH_AES_128_GCM_SHA256"
    ],
    "serverTLSBootstrap": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/etc/kubernetes/pki/ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 50,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "172.31.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupRoot": "/",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "none",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "hairpin-veth",
    "maxPods": 58,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": false,
    "evictionHard": {
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "protectKernelDefaults": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "featureGates": {
      "RotateKubeletServerCertificate": true
    },
    "failSwapOn": true,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "kubeReserved": {
      "cpu": "90m",
      "ephemeral-storage": "1Gi",
      "memory": "893Mi"
    },
    "systemReservedCgroup": "/system",
    "kubeReservedCgroup": "/runtime",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/usr/libexec/kubernetes/kubelet-plugins/volume/exec/",
    "providerID": "aws:///eu-west-1a/i-02487199a514a5c47",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 2,
      "options": {
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/containerd/containerd.sock"
  }
}
  • Metrics server logs:
spoiler for Metrics Server logs:
I0917 06:56:25.141023       1 serving.go:374] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
I0917 06:56:29.056085       1 handler.go:275] Adding GroupVersion metrics.k8s.io v1beta1 to ResourceManager
I0917 06:56:29.260891       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0917 06:56:29.260914       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0917 06:56:29.260949       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0917 06:56:29.260975       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.260993       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0917 06:56:29.260998       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.261280       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::apiserver.local.config/certificates/apiserver.crt::apiserver.local.config/certificates/apiserver.key"
I0917 06:56:29.261308       1 secure_serving.go:213] Serving securely on [::]:8443
I0917 06:56:29.261348       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0917 06:56:29.541673       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0917 06:56:29.543057       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0917 06:56:29.543068       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
te error: tls: internal error" node="ip-10-34-40-218.eu-west-1.compute.internal"
E0913 07:11:41.371925       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.36.55:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-36-55.eu-west-1.compute.internal"
E0913 07:15:41.362052       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.37.167:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-37-167.eu-west-1.compute.internal"
E0913 07:19:41.367676       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.34.76:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-34-76.eu-west-1.compute.internal"
E0913 07:23:41.376918       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.43.160:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-43-160.eu-west-1.compute.internal"
E0913 07:26:41.376301       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.241:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-241.eu-west-1.compute.internal"
E0913 07:30:41.339715       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.219:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-219.eu-west-1.compute.internal"
E0913 07:33:41.354489       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.47.203:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-47-203.eu-west-1.compute.internal"
E0913 07:38:41.359856       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.45.209:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-45-209.eu-west-1.compute.internal"
E0913 07:40:41.350880       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.44.2:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-44-2.eu-west-1.compute.internal"
E0913 07:42:41.372912       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.99:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-99.eu-west-1.compute.internal"
E0913 07:45:41.374758       1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.34.41.27:10250/metrics/resource\": remote error: tls: internal error" node="ip-10-34-41-27.eu-west-1.compute.internal"
  • Status of Metrics API:
spolier for Status of Metrics API:
Name:         v1beta1.metrics.k8s.io
Namespace:
Labels:       app.kubernetes.io/instance=metrics-server
            app.kubernetes.io/managed-by=Helm
            app.kubernetes.io/name=metrics-server
            app.kubernetes.io/version=0.7.2
            application=none
            customer-level=none
            environment=dev
            helm.sh/chart=metrics-server-7.2.14
            owner=squad-platform
Annotations:  meta.helm.sh/release-name: metrics-server
            meta.helm.sh/release-namespace: kube-system
API Version:  apiregistration.k8s.io/v1
Kind:         APIService
Metadata:
Creation Timestamp:  2023-01-31T14:48:01Z
Resource Version:    386279218
UID:                 7503894f-8f1f-4f61-9df8-a663cdd0298d
Spec:
Group:                     metrics.k8s.io
Group Priority Minimum:    100
Insecure Skip TLS Verify:  true
Service:
  Name:            metrics-server
  Namespace:       kube-system
  Port:            443
Version:           v1beta1
Version Priority:  100
Status:
Conditions:
  Last Transition Time:  2024-09-17T06:56:46Z
  Message:               all checks passed
  Reason:                Passed
  Status:                True
  Type:                  Available
Events:                    <none>

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 17, 2024
@dgrisonnet
Copy link
Member

/triage accepted
/kind support

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. kind/support Categorizes issue or PR as a support question. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 19, 2024
@dgrisonnet dgrisonnet removed the kind/bug Categorizes issue or PR as related to a bug. label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants