Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid nvidia-smi query if the process will get killed #943

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions openfe/utils/system_probe.py
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,55 @@
return socket.gethostname()


def _slurm_environment() -> bool:
"""
Check if the current environment is managed by SLURM.
"""

slurm_job_id = os.environ.get("SLURM_JOB_ID")

if slurm_job_id:
return True

Check warning on line 284 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L284

Added line #L284 was not covered by tests
else:
return False


def _check_slurm_gpu_info() -> bool:
"""
Check if the GPU information is available in the SLURM environment.

Returns
-------
bool
True if the GPU information is available in the SLURM environment, False
otherwise.

Notes
-----
This function checks if the GPU information is available in the SLURM environment by
inspecting the environment variables.

The function returns True if any of the following environment variables are present:
- 'SLURM_JOB_GPUS'
- 'SLURM_GPUS'
Comment on lines +305 to +306
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid these might be very specific to some deployments of SLURM (at least I know in the clusters I've managed in the past we didn't have these env variables set).

Copy link
Author

@PabloNA97 PabloNA97 Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, appreciate it! :D
I applied these changes on my end to test it with CPU. But if using CPU for production is a bad idea in itself then it is not worth it to merge in the main branch.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is mentioned in the openfe docs - maybe it would be worth to include a small comment about it if it's not.

- 'CUDA_VISIBLE_DEVICES'

Otherwise, it returns False.
"""

slurm_job_gpus = os.environ.get("SLURM_JOB_GPUS")
slurm_gpus = os.environ.get("SLURM_GPUS")
cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES")

Check warning on line 314 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L312-L314

Added lines #L312 - L314 were not covered by tests

logging.debug(f"SLURM_JOB_GPUS: {slurm_job_gpus}")
logging.debug(f"SLURM_GPUS_PER_NODE: {slurm_gpus}")
logging.debug(f"CUDA_VISIBLE_DEVICES: {cuda_visible_devices}")

Check warning on line 318 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L316-L318

Added lines #L316 - L318 were not covered by tests

if slurm_job_gpus or slurm_gpus or cuda_visible_devices:
return True

Check warning on line 321 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L320-L321

Added lines #L320 - L321 were not covered by tests
else:
return False

Check warning on line 323 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L323

Added line #L323 was not covered by tests

def _get_gpu_info() -> dict[str, dict[str, str]]:
"""
Get GPU information using the 'nvidia-smi' command-line utility.
Expand Down Expand Up @@ -336,6 +385,12 @@
"utilization.memory,memory.total,driver_version,"
)

if _slurm_environment() and not _check_slurm_gpu_info():
logging.debug(

Check warning on line 389 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L389

Added line #L389 was not covered by tests
"SLURM environment detected, but GPU information is not available."
)
return {}

Check warning on line 392 in openfe/utils/system_probe.py

View check run for this annotation

Codecov / codecov/patch

openfe/utils/system_probe.py#L392

Added line #L392 was not covered by tests

try:
nvidia_smi_output = subprocess.check_output(
["nvidia-smi", GPU_QUERY, "--format=csv"]
Expand Down
Loading