Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Additional upstream metrics #1672

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jigisha620
Copy link
Contributor

@jigisha620 jigisha620 commented Sep 17, 2024

Fixes #N/A

Description
Adding new metrics.

  1. karpenter_nodes_drained_total
  2. karpenter_ignored_pod_count
  3. karpenter_pods_current_unbound_duration_seconds
  4. karpenter_pods_bound_duration_seconds
  5. karpenter_pods_unstarted_time_seconds
  6. karpenter_cluster_state_unsynced_time_seconds
  7. karpenter_nodes_eviction_requests_total

How was this change tested?
Added tests and tested on local cluster

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jigisha620
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @jigisha620. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 17, 2024
@coveralls
Copy link

coveralls commented Sep 17, 2024

Pull Request Test Coverage Report for Build 11003546560

Details

  • 28 of 28 (100.0%) changed or added relevant lines in 8 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.09%) to 80.838%

Files with Coverage Reduction New Missed Lines %
pkg/scheduling/requirements.go 2 98.01%
Totals Coverage Status
Change from base Build 10975568388: 0.09%
Covered Lines: 8450
Relevant Lines: 10453

💛 - Coveralls

@jigisha620 jigisha620 force-pushed the additional-metrics branch 5 times, most recently from 448ac20 to d8bd4a8 Compare September 17, 2024 21:31
@jigisha620 jigisha620 force-pushed the additional-metrics branch 2 times, most recently from 82fd92c to 1dbd11d Compare September 18, 2024 00:40
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 19, 2024
@jigisha620 jigisha620 changed the title chore: Additional upstream metrics Part1 chore: Additional upstream metrics Sep 19, 2024
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/termination/terminator/eviction.go Outdated Show resolved Hide resolved
pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
pkg/metrics/constants.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/controllers/metrics/pod/controller.go Outdated Show resolved Hide resolved
@jigisha620 jigisha620 force-pushed the additional-metrics branch 2 times, most recently from ac04b26 to c437c76 Compare September 20, 2024 20:25
pkg/metrics/constants.go Outdated Show resolved Hide resolved
@@ -1241,6 +1245,9 @@ var _ = Describe("Cluster State Sync", func() {
ExpectReconcileSucceeded(ctx, nodeClaimController, client.ObjectKeyFromObject(nodeClaim))
}
Expect(cluster.Synced(ctx)).To(BeFalse())
metric, found := FindMetricWithLabelValues("karpenter_cluster_state_unsynced_time_seconds", map[string]string{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more rigorous check that we could put in place to make sure that this works properly? If this test passed on the previous iteration, that means that we may not have a rigorous enough test to catch regressions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would work because we also reset this metric after each test. So even if it succeeded in previous iteration, the values would get cleaned up.

pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
pkg/controllers/state/cluster.go Outdated Show resolved Hide resolved
Subsystem: metrics.PodSubsystem,
Name: "bound_duration_seconds",
Help: "The time from pod creation until the pod is bound.",
Buckets: metrics.DurationBuckets(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our pod_startup_duration_seconds doesn't have any additional labels -- it's also not a histogram like it should be. At this point, we already marked it as stable so we can't change it from a summary metric. We can consider marking it as deprecated and replace it with a histogram metric called karpenter_pods_start_duration_seconds.

The other option: According to the docs, the only difference between the summary metric and the histogram metric is the addition of the _bucket metric vs a metric on the basename that has the label (quantile). That means that we could have them existing simultaneously as a histogram and summary metric -- we should look into if this is possible to avoid the breaking change/rename

podUnstartedTimeSeconds.With(map[string]string{
podName: pod.Name,
podNameSpace: pod.Namespace,
}).Set(time.Since(pod.CreationTimestamp.Time).Seconds())
c.pendingPods.Insert(key)
return
}
cond, ok := lo.Find(pod.Status.Conditions, func(c corev1.PodCondition) bool {
return c.Type == corev1.PodReady
})
if c.pendingPods.Has(key) && ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to not emit the metric when we can't find the ready status? Or should we treat the lack of ready status here as unknown?

if pod.Status.Phase == phasePending {
// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound,
// as shown by the PodScheduled condition not being set to true, do we wish to emit the pod_current_unbound_time_seconds metric.
if ok && condScheduled.Status != corev1.ConditionTrue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here. Is it correct to treat the lack of ready status as something that we shouldn't emit a metric on?

return c.Type == corev1.PodScheduled
})
if pod.Status.Phase == phasePending {
// The podsScheduled condition may be True or Unknown while a pod is in the pending state. Only when the pod is not bound,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just want this metric to increase from the pod's creation timestamp. I don't think that we want to emit the difference between it being scheduled and it being bound. Ideally, we're just looking to see how long it took for Karpenter to take a pod from created to bound

@@ -133,21 +165,65 @@ func (c *Controller) Reconcile(ctx context.Context, req reconcile.Request) (reco
},
})
c.recordPodStartupMetric(pod)
c.recordPodBoundMetric(pod, labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't ruling out pods that Karpenter doesn't think can schedule. As it stands right now, you're going to have metrics that are going to increase forever because Karpenter will never be able to schedule them. Can we build this so that we filter out pods that Karpenter thinks it has no chance at getting to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants