-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Additional upstream metrics #1672
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jigisha620 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @jigisha620. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 11003546560Details
💛 - Coveralls |
448ac20
to
d8bd4a8
Compare
82fd92c
to
1dbd11d
Compare
1dbd11d
to
179a217
Compare
ac04b26
to
c437c76
Compare
@@ -1241,6 +1245,9 @@ var _ = Describe("Cluster State Sync", func() { | |||
ExpectReconcileSucceeded(ctx, nodeClaimController, client.ObjectKeyFromObject(nodeClaim)) | |||
} | |||
Expect(cluster.Synced(ctx)).To(BeFalse()) | |||
metric, found := FindMetricWithLabelValues("karpenter_cluster_state_unsynced_time_seconds", map[string]string{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more rigorous check that we could put in place to make sure that this works properly? If this test passed on the previous iteration, that means that we may not have a rigorous enough test to catch regressions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would work because we also reset this metric after each test. So even if it succeeded in previous iteration, the values would get cleaned up.
Subsystem: metrics.PodSubsystem, | ||
Name: "bound_duration_seconds", | ||
Help: "The time from pod creation until the pod is bound.", | ||
Buckets: metrics.DurationBuckets(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our pod_startup_duration_seconds
doesn't have any additional labels -- it's also not a histogram like it should be. At this point, we already marked it as stable so we can't change it from a summary metric. We can consider marking it as deprecated and replace it with a histogram metric called karpenter_pods_start_duration_seconds
.
The other option: According to the docs, the only difference between the summary metric and the histogram metric is the addition of the _bucket
metric vs a metric on the basename that has the label (quantile
). That means that we could have them existing simultaneously as a histogram and summary metric -- we should look into if this is possible to avoid the breaking change/rename
podUnstartedTimeSeconds.With(map[string]string{ | ||
podName: pod.Name, | ||
podNameSpace: pod.Namespace, | ||
}).Set(time.Since(pod.CreationTimestamp.Time).Seconds()) | ||
c.pendingPods.Insert(key) | ||
return | ||
} | ||
cond, ok := lo.Find(pod.Status.Conditions, func(c corev1.PodCondition) bool { | ||
return c.Type == corev1.PodReady | ||
}) | ||
if c.pendingPods.Has(key) && ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to not emit the metric when we can't find the ready status? Or should we treat the lack of ready status here as unknown?
if pod.Status.Phase == phasePending { | ||
// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound, | ||
// as shown by the PodScheduled condition not being set to true, do we wish to emit the pod_current_unbound_time_seconds metric. | ||
if ok && condScheduled.Status != corev1.ConditionTrue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here. Is it correct to treat the lack of ready status as something that we shouldn't emit a metric on?
return c.Type == corev1.PodScheduled | ||
}) | ||
if pod.Status.Phase == phasePending { | ||
// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just want this metric to increase from the pod's creation timestamp. I don't think that we want to emit the difference between it being scheduled and it being bound. Ideally, we're just looking to see how long it took for Karpenter to take a pod from created to bound
@@ -133,21 +165,65 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco | |||
}, | |||
}) | |||
c.recordPodStartupMetric(pod) | |||
c.recordPodBoundMetric(pod, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't ruling out pods that Karpenter doesn't think can schedule. As it stands right now, you're going to have metrics that are going to increase forever because Karpenter will never be able to schedule them. Can we build this so that we filter out pods that Karpenter thinks it has no chance at getting to?
c437c76
to
f892132
Compare
Fixes #N/A
Description
Adding new metrics.
How was this change tested?
Added tests and tested on local cluster
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.