fix: avoid nvidia-smi query if the process will get killed #943

PabloNA97 · 2024-09-27T15:58:46Z

Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.

fixes #854

Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.

IAlibay

Thanks @PabloNA97 - I'm going to put a block on this because we need to discuss this a bit on our end. My initial take is that I'm not sure that this is the solution to the deeper issue of "why are we even running this if there's no GPU".

I'm also not sure I understand why slurm behaves differently here, having a special case like this shouldn't be necessary.

Ideally we should only run this if a user has requested a GPU.

IAlibay · 2024-09-27T16:36:05Z

openfe/utils/system_probe.py

+    - 'SLURM_JOB_GPUS'
+    - 'SLURM_GPUS'


I'm afraid these might be very specific to some deployments of SLURM (at least I know in the clusters I've managed in the past we didn't have these env variables set).

Thanks for the feedback, appreciate it! :D
I applied these changes on my end to test it with CPU. But if using CPU for production is a bad idea in itself then it is not worth it to merge in the main branch.

Not sure this is mentioned in the openfe docs - maybe it would be worth to include a small comment about it if it's not.

codecov · 2024-09-27T17:40:31Z

Codecov Report

Attention: Patch coverage is 33.33333% with 12 lines in your changes missing coverage. Please review.

Project coverage is 91.53%. Comparing base (015f34e) to head (f43b56c).

Files with missing lines	Patch %	Lines
openfe/utils/system_probe.py	33.33%	12 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #943      +/-   ##
==========================================
- Coverage   94.59%   91.53%   -3.06%     
==========================================
  Files         134      134              
  Lines        9934     9952      +18     
==========================================
- Hits         9397     9110     -287     
- Misses        537      842     +305

Flag	Coverage Δ
fast-tests	`91.53% <33.33%> (?)`
slow-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mikemhenry · 2024-09-27T19:34:02Z

I've made a comment on the issue but just to add to this PR, the heuristic of "this is slurm so we don't want to run nvidia-smi" is a bad one since many people are using this software on slrurm systems and don't have this issue. I rather we just improve our try/except on running the command, then trying to decide if we should run it.

PabloNA97 · 2024-09-27T20:46:21Z

I've made a comment on the issue but just to add to this PR, the heuristic of "this is slurm so we don't want to run nvidia-smi" is a bad one since many people are using this software on slrurm systems and don't have this issue. I rather we just improve our try/except on running the command, then trying to decide if we should run it.

The behaviour of the original code only changes whenever you (1) run on slurm and (2) there is no gpu requested. In that case I thought it didn't make sense to run nvidia-smi. But as @IAlibay explained maybe it doesn't make sense to contemplate this case after all.

fix: avoid nvidia-smi if it will fail

f43b56c

Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.

IAlibay requested changes Sep 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid nvidia-smi query if the process will get killed #943

fix: avoid nvidia-smi query if the process will get killed #943

PabloNA97 commented Sep 27, 2024

IAlibay left a comment

IAlibay Sep 27, 2024

PabloNA97 Sep 27, 2024 •

edited

Loading

PabloNA97 Sep 27, 2024

codecov bot commented Sep 27, 2024 •

edited

Loading

mikemhenry commented Sep 27, 2024

PabloNA97 commented Sep 27, 2024

fix: avoid nvidia-smi query if the process will get killed #943

Are you sure you want to change the base?

fix: avoid nvidia-smi query if the process will get killed #943

Conversation

PabloNA97 commented Sep 27, 2024

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay Sep 27, 2024

Choose a reason for hiding this comment

PabloNA97 Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

PabloNA97 Sep 27, 2024

Choose a reason for hiding this comment

codecov bot commented Sep 27, 2024 • edited Loading

Codecov Report

mikemhenry commented Sep 27, 2024

PabloNA97 commented Sep 27, 2024

PabloNA97 Sep 27, 2024 •

edited

Loading

codecov bot commented Sep 27, 2024 •

edited

Loading