Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-runtime segfault #228

Open
xhejtman opened this issue Oct 30, 2023 · 2 comments
Open

nvidia-container-runtime segfault #228

xhejtman opened this issue Oct 30, 2023 · 2 comments

Comments

@xhejtman
Copy link

Hello,

I am getting randomly:

[4746547.221468] nvidia-containe[2847336]: segfault at 7f6ee8000020 ip 00007f6f24ca1cf2 sp 00007f6ef7ffed60 error 6 in libc.so.6[7f6f24c28000+195000] likely on CPU 127 (core 31, socket 1)

is is not related to a particular pod, I think this is result of some monitoring action (liveness/readiness probes). It is hard to debug as there is no core dump file.

Version is 1.14.3.

@elezar
Copy link
Member

elezar commented Oct 30, 2023

@xhejtman could you describe your environment?

Is this happening sporadically, or can it be reproduced?

@xhejtman
Copy link
Author

Sporadically, but quite often. It may happen, that two probes run in parallel and sometimes it segfaults.

It is kubernetes 1.26.9 using RKE2. Gpu operator 23.6.1, but I tried today nvidia toolkit from the latest gpu-operator.

I can test or provide logs if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants