chore: Additional upstream metrics #1672

jigisha620 · 2024-09-17T00:45:37Z

Fixes #N/A

Description
Adding new metrics.

karpenter_nodes_drained_total
karpenter_ignored_pod_count
karpenter_pods_current_unbound_duration_seconds
karpenter_pods_bound_duration_seconds
karpenter_pods_unstarted_time_seconds
karpenter_cluster_state_unsynced_time_seconds
karpenter_nodes_eviction_requests_total

How was this change tested?
Added tests and tested on local cluster

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2024-09-17T00:45:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-09-17T00:45:46Z

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2024-09-17T00:58:51Z

Pull Request Test Coverage Report for Build 11003546560

Details

28 of 28 (100.0%) changed or added relevant lines in 8 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.09%) to 80.838%

Files with Coverage Reduction	New Missed Lines	%
pkg/scheduling/requirements.go	2	98.01%

Totals
Change from base Build 10975568388:	0.09%
Covered Lines:	8450
Relevant Lines:	10453

💛 - Coveralls

pkg/controllers/provisioning/provisioner.go

pkg/controllers/metrics/pod/controller.go

pkg/controllers/node/termination/terminator/eviction.go

pkg/controllers/state/cluster.go

pkg/metrics/constants.go

pkg/metrics/metrics.go

pkg/controllers/metrics/pod/controller.go

pkg/metrics/constants.go

jonathan-innis · 2024-09-23T15:40:59Z

pkg/controllers/state/suite_test.go

@@ -1241,6 +1245,9 @@ var _ = Describe("Cluster State Sync", func() {
 			ExpectReconcileSucceeded(ctx, nodeClaimController, client.ObjectKeyFromObject(nodeClaim))
 		}
 		Expect(cluster.Synced(ctx)).To(BeFalse())
+		metric, found := FindMetricWithLabelValues("karpenter_cluster_state_unsynced_time_seconds", map[string]string{})


Is there a more rigorous check that we could put in place to make sure that this works properly? If this test passed on the previous iteration, that means that we may not have a rigorous enough test to catch regressions here.

I think this would work because we also reset this metric after each test. So even if it succeeded in previous iteration, the values would get cleaned up.

pkg/controllers/state/cluster.go

jonathan-innis · 2024-09-23T16:15:16Z

pkg/controllers/metrics/pod/controller.go

+			Subsystem: metrics.PodSubsystem,
+			Name:      "bound_duration_seconds",
+			Help:      "The time from pod creation until the pod is bound.",
+			Buckets:   metrics.DurationBuckets(),


Our pod_startup_duration_seconds doesn't have any additional labels -- it's also not a histogram like it should be. At this point, we already marked it as stable so we can't change it from a summary metric. We can consider marking it as deprecated and replace it with a histogram metric called karpenter_pods_start_duration_seconds.

The other option: According to the docs, the only difference between the summary metric and the histogram metric is the addition of the _bucket metric vs a metric on the basename that has the label (quantile). That means that we could have them existing simultaneously as a histogram and summary metric -- we should look into if this is possible to avoid the breaking change/rename

jonathan-innis · 2024-09-23T16:19:52Z

pkg/controllers/metrics/pod/controller.go

+		podUnstartedTimeSeconds.With(map[string]string{
+			podName:      pod.Name,
+			podNameSpace: pod.Namespace,
+		}).Set(time.Since(pod.CreationTimestamp.Time).Seconds())
 		c.pendingPods.Insert(key)
 		return
 	}
 	cond, ok := lo.Find(pod.Status.Conditions, func(c corev1.PodCondition) bool {
 		return c.Type == corev1.PodReady
 	})
 	if c.pendingPods.Has(key) && ok {


Is it correct to not emit the metric when we can't find the ready status? Or should we treat the lack of ready status here as unknown?

jonathan-innis · 2024-09-23T16:20:27Z

pkg/controllers/metrics/pod/controller.go

+	if pod.Status.Phase == phasePending {
+		// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound,
+		// as shown by the PodScheduled condition not being set to true, do we wish to emit the pod_current_unbound_time_seconds metric.
+		if ok && condScheduled.Status != corev1.ConditionTrue {


Same comment here. Is it correct to treat the lack of ready status as something that we shouldn't emit a metric on?

jonathan-innis · 2024-09-23T16:22:16Z

pkg/controllers/metrics/pod/controller.go

+		return c.Type == corev1.PodScheduled
+	})
+	if pod.Status.Phase == phasePending {
+		// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound,


I think we just want this metric to increase from the pod's creation timestamp. I don't think that we want to emit the difference between it being scheduled and it being bound. Ideally, we're just looking to see how long it took for Karpenter to take a pod from created to bound

jonathan-innis · 2024-09-23T16:23:33Z

pkg/controllers/metrics/pod/controller.go

@@ -133,21 +165,65 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
 		},
 	})
 	c.recordPodStartupMetric(pod)
+	c.recordPodBoundMetric(pod, labels)


We aren't ruling out pods that Karpenter doesn't think can schedule. As it stands right now, you're going to have metrics that are going to increase forever because Karpenter will never be able to schedule them. Can we build this so that we filter out pods that Karpenter thinks it has no chance at getting to?

k8s-ci-robot requested review from engedaam and tallaxes September 17, 2024 00:45

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 17, 2024

jigisha620 force-pushed the additional-metrics branch 5 times, most recently from 448ac20 to d8bd4a8 Compare September 17, 2024 21:31

jonathan-innis reviewed Sep 17, 2024

View reviewed changes

pkg/controllers/provisioning/provisioner.go Outdated Show resolved Hide resolved

pkg/controllers/provisioning/provisioner.go Outdated Show resolved Hide resolved

jigisha620 force-pushed the additional-metrics branch 2 times, most recently from 82fd92c to 1dbd11d Compare September 18, 2024 00:40

jonathan-innis reviewed Sep 18, 2024

View reviewed changes

pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved

jigisha620 force-pushed the additional-metrics branch from 1dbd11d to 179a217 Compare September 19, 2024 06:19

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2024

jigisha620 changed the title ~~chore: Additional upstream metrics Part1~~ chore: Additional upstream metrics Sep 19, 2024

jonathan-innis reviewed Sep 20, 2024

View reviewed changes

jigisha620 force-pushed the additional-metrics branch 2 times, most recently from ac04b26 to c437c76 Compare September 20, 2024 20:25

jonathan-innis reviewed Sep 23, 2024

View reviewed changes

chore: Additional upstream metrics

f892132

jigisha620 force-pushed the additional-metrics branch from c437c76 to f892132 Compare September 23, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Additional upstream metrics #1672

chore: Additional upstream metrics #1672

jigisha620 commented Sep 17, 2024 •

edited

Loading

k8s-ci-robot commented Sep 17, 2024

k8s-ci-robot commented Sep 17, 2024

coveralls commented Sep 17, 2024 •

edited

Loading

jonathan-innis Sep 23, 2024

jigisha620 Sep 23, 2024

jonathan-innis Sep 23, 2024

jonathan-innis Sep 23, 2024

jonathan-innis Sep 23, 2024

jonathan-innis Sep 23, 2024

jonathan-innis Sep 23, 2024

chore: Additional upstream metrics #1672

Are you sure you want to change the base?

chore: Additional upstream metrics #1672

Conversation

jigisha620 commented Sep 17, 2024 • edited Loading

k8s-ci-robot commented Sep 17, 2024

k8s-ci-robot commented Sep 17, 2024

coveralls commented Sep 17, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11003546560

Details

💛 - Coveralls

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jigisha620 Sep 23, 2024

Choose a reason for hiding this comment

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jonathan-innis Sep 23, 2024

Choose a reason for hiding this comment

jigisha620 commented Sep 17, 2024 •

edited

Loading

coveralls commented Sep 17, 2024 •

edited

Loading