Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid nvidia-smi query if the process will get killed #943

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

PabloNA97
Copy link

Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.

fixes #854

Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.
Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PabloNA97 - I'm going to put a block on this because we need to discuss this a bit on our end. My initial take is that I'm not sure that this is the solution to the deeper issue of "why are we even running this if there's no GPU".

I'm also not sure I understand why slurm behaves differently here, having a special case like this shouldn't be necessary.

Ideally we should only run this if a user has requested a GPU.

Comment on lines +305 to +306
- 'SLURM_JOB_GPUS'
- 'SLURM_GPUS'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid these might be very specific to some deployments of SLURM (at least I know in the clusters I've managed in the past we didn't have these env variables set).

Copy link
Author

@PabloNA97 PabloNA97 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, appreciate it! :D
I applied these changes on my end to test it with CPU. But if using CPU for production is a bad idea in itself then it is not worth it to merge in the main branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is mentioned in the openfe docs - maybe it would be worth to include a small comment about it if it's not.

Copy link

codecov bot commented Sep 27, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 12 lines in your changes missing coverage. Please review.

Project coverage is 91.53%. Comparing base (015f34e) to head (f43b56c).

Files with missing lines Patch % Lines
openfe/utils/system_probe.py 33.33% 12 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #943      +/-   ##
==========================================
- Coverage   94.59%   91.53%   -3.06%     
==========================================
  Files         134      134              
  Lines        9934     9952      +18     
==========================================
- Hits         9397     9110     -287     
- Misses        537      842     +305     
Flag Coverage Δ
fast-tests 91.53% <33.33%> (?)
slow-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mikemhenry
Copy link
Contributor

I've made a comment on the issue but just to add to this PR, the heuristic of "this is slurm so we don't want to run nvidia-smi" is a bad one since many people are using this software on slrurm systems and don't have this issue. I rather we just improve our try/except on running the command, then trying to decide if we should run it.

@PabloNA97
Copy link
Author

I've made a comment on the issue but just to add to this PR, the heuristic of "this is slurm so we don't want to run nvidia-smi" is a bad one since many people are using this software on slrurm systems and don't have this issue. I rather we just improve our try/except on running the command, then trying to decide if we should run it.

The behaviour of the original code only changes whenever you (1) run on slurm and (2) there is no gpu requested. In that case I thought it didn't make sense to run nvidia-smi. But as @IAlibay explained maybe it doesn't make sense to contemplate this case after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalledProcessError: 9
3 participants