Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU metrics not collected by aws-cloudwatch-metrics #1097

Open
claudio-vellage opened this issue Apr 24, 2024 · 0 comments · May be fixed by #1171
Open

GPU metrics not collected by aws-cloudwatch-metrics #1097

claudio-vellage opened this issue Apr 24, 2024 · 0 comments · May be fixed by #1171
Labels
bug Something isn't working

Comments

@claudio-vellage
Copy link

Describe the bug

I've setup the aws-cloudwatch-metrics through the helm chart linked here, I've also set the image.tag=1.300037.0b583, because it seems that the GPU metrics should be collected by default starting from 1.300034.0 according to this link.

Also the RBAC permissions have been manually updated to include services: #1095 as well as I've explicitly set enhancedContainerInsights.enabled=true (and fixed the documentation for this value here).

I still can't see the metrics in ContainerInsights and I start to believe, that I have to add additional settings to the ConfigMap to explicitly enable the GPU metrics collection. Can someone confirm this, or should GPU metrics collection would out of the box?

Steps to reproduce

Install aws-cloudwatch-metrics on a EKS cluster with GPU nodes (e.g. g5.xlarge). Check CloudWatch for GPU metrics.

Expected outcome

I'd expect the GPU metrics to show up in CloudWatch

Environment

  • Chart name: aws-cloudwatch-metrics
  • Chart version: 0.0.11
  • Kubernetes version: 1.29.3-eks-adc7111
  • Using EKS (yes/no), if so version? 1.29.3-eks-adc7111

Additional Context:

I've successfully set up the metrics collection for GPU metrics on EC2 instances before, but it doesn't seem to work on EKS using this chart.

@claudio-vellage claudio-vellage added the bug Something isn't working label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant