Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

dcgm-exporter cannnot installed successfully on 2080Ti #176

Open
ReyRen opened this issue Apr 9, 2021 · 8 comments
Open

dcgm-exporter cannnot installed successfully on 2080Ti #176

ReyRen opened this issue Apr 9, 2021 · 8 comments

Comments

@ReyRen
Copy link

ReyRen commented Apr 9, 2021

root@master:~# helm install --generate-name  gpu-helm-charts/dcgm-exporter --set arguments=null
NAME: dcgm-exporter-1617960354
LAST DEPLOYED: Fri Apr  9 17:25:57 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Get the application URL by running these commands:
  export POD_NAME=$(kubectl get pods -n default -l "app.kubernetes.io/name=dcgm-exporter,app.kubernetes.io/instance=dcgm-exporter-1617960354" -o jsonpath="{.items[0].metadata.name}")
  kubectl -n default port-forward $POD_NAME 8080:9400 &
  echo "Visit http://127.0.0.1:8080/metrics to use your application"
root@master:~# kubectl logs -f  dcgm-exporter-1617960354-5jxgh
time="2021-04-09T09:26:00Z" level=info msg="Starting dcgm-exporter"
time="2021-04-09T09:26:00Z" level=info msg="DCGM successfully initialized!"
time="2021-04-09T09:26:00Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-09T09:26:00Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-09T09:26:00Z" level=info msg="Pipeline starting"
time="2021-04-09T09:26:00Z" level=info msg="Starting webserver"

It doesn't work even if using --set arguments=null option.

default       dcgm-exporter-1617960354-5jxgh                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-74ddv                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-7cwq7                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-cl525                                    0/1     CrashLoopBackOff    5          2m48s
default       dcgm-exporter-1617960354-jlx66                                    0/1     CrashLoopBackOff    5          2m48s

The helm version I used is v.2.3.1

@ReyRen
Copy link
Author

ReyRen commented Apr 12, 2021

hello? :)

@dbeer
Copy link
Contributor

dbeer commented Apr 12, 2021

ReyRen - you say the exporter didn't start, but I see the message "Starting webserver". Is your issue that it isn't collecting DCP metrics?

@ReyRen
Copy link
Author

ReyRen commented Apr 13, 2021

@dbeer thanks get reply from you.
Yes, the issue is about "dcgm-exporter cannnot installed" with helm installed, and the reason caused Crashloopbackoff is

root@master:~# kubectl logs -f dcgm-exporter-1618278113-kvdd7
time="2021-04-13T01:41:58Z" level=info msg="Starting dcgm-exporter"
time="2021-04-13T01:41:58Z" level=info msg="DCGM successfully initialized!"
time="2021-04-13T01:41:58Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2021-04-13T01:41:58Z" level=info msg="Kubernetes metrics collection enabled!"
time="2021-04-13T01:41:58Z" level=info msg="Starting webserver"
time="2021-04-13T01:41:58Z" level=info msg="Pipeline starting"

as for as I noticed.
So, how can I workaround.
Thanks again

@aii-shanker-jj
Copy link

I did set the following in the values.yaml & able to overcome crashloopbackoff error.

extraEnv:
  - name: "DCGM_EXPORTER_INTERVAL"
    value: "5000"

@dbeer
Copy link
Contributor

dbeer commented Apr 27, 2021

ReyRen - can you post the error you're seeing where Helm says the exporter can't be installed? I don't see it in your previous posts.

@nikkon-dev
Copy link

Hi,

DCP metrics (DCGM_FI_PROF_*) are not supported on 2080Ti cards. You need to provide a CSV configuration file without such metrics (they present in the default CSV config file).

WBR,
Nik

@Kaka1127
Copy link

@nikkon-dev
Which GPU support the DCP metrics?

Best regards.
Kaka

@nikkon-dev
Copy link

@Kaka1127,

The DCP metrics are supported for Datacenter grade GPUs (former Tesla brands).

WBR,
Nik

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants