Why duplicate metrics occured when a job scheduling to this server #174

WYmindsky · 2021-04-01T07:30:10Z

yaml：pod-gpu-exporter-daemonset.yaml
docker image：pod-gpu-metrics-exporter:1.0.0-alpha
dcgm：dcgm-exporter:1.4.6

Duplicate metrics occured when a job scheduling to this server for long time

JulesBelveze · 2021-04-30T09:32:39Z

Hey @WYmindsky I'm experiencing the same behaviour. Did you find out why this occurs?

WYmindsky · 2021-05-10T09:29:09Z

Hey @WYmindsky I'm experiencing the same behaviour. Did you find out why this occurs?

It's still there

nikkon-dev · 2021-06-08T05:24:25Z

Hi,

Could you provide the logs from the dcgm-exporter itself?
It looks like there are two dcgm-exporter instances one aware of k8s environment (were able to connect to pod api) and another one didn't. The container_name, pod_namespace, pod_name labels are gathered from the k8s infra and if there are no such labels - connection to the k8s from the dcgm-exporter failed and that should be reflected in the dcgm-exporter logs.

WBR,
Nik

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why duplicate metrics occured when a job scheduling to this server #174

Why duplicate metrics occured when a job scheduling to this server #174

WYmindsky commented Apr 1, 2021

JulesBelveze commented Apr 30, 2021

WYmindsky commented May 10, 2021

nikkon-dev commented Jun 8, 2021

Why duplicate metrics occured when a job scheduling to this server #174

Why duplicate metrics occured when a job scheduling to this server #174

Comments

WYmindsky commented Apr 1, 2021

JulesBelveze commented Apr 30, 2021

WYmindsky commented May 10, 2021

nikkon-dev commented Jun 8, 2021