You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
HTCondor changed its behavior. When GPUs are available on the host it will set those up in the machine unless explicitly told not to do so. This is part of its changes to encourage explicit setting and distinguish from leaving things undefined.
Not setting a resource is different from setting it to 0.
Factory operators still expect not to have any GPU in the machine if they do not ask explicitly for it, setting GLIDEIN_Resource_Slots
There are multiple ways to tell HTCondor not to consider GPUs:
cuda_visible_devices=none then no devices
NOTE: Cuda_visible_devices empty means all
set request_gpu to 0
set STARTD_DETECT_GPUS False
set Machine_resource_gpus=0 (side effect of turning off detection)
set gpus=0 in the SLOT_TYPE definition
-- In this case, detection still happens (detected_spus shows quantity and is advertised, but you are not allowed to use it)
After discussing with TJ in a meeting on 10/9 seems that the last 2 are the preferred solutions
Describe the solution you'd like
When GLIDEIN_Resource_Slots is not defined or does not include GPUs
set Machine_resource_gpus=0 in the configuration of the slots.
This should be in the generated condor config made for the glidein (in condor_startup.sh)
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Info (please complete the following information):
Stakeholders and components can be a comma-separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line on its own.
Priority: high
Stakeholders: CMS, FactoryOps, OSG
Components: glidein
Additional context
NA
The text was updated successfully, but these errors were encountered:
Some clarifications.
Not setting and setting to 0 (GLIDEIN_Resource_Slots is not defined, or does not include GPUs, or GPUs=0) should all have the same behavior of not having the GPU in the slot (via Machine_resource_gpus=0). The GPU is not physically disabled or other - just ignored by HTCondor and not usable by the jobs.
The HTCondor configuration is created in condor_startup.sh and that script is already parsing the attribute GLIDEIN_Resource_Slots when present.
GLIDEIN_Resource_Slots is documented in https://glideinwms.fnal.gov/doc.v3_6/factory/custom_vars.html
Here are some examples:
Is your feature request related to a problem? Please describe.
HTCondor changed its behavior. When GPUs are available on the host it will set those up in the machine unless explicitly told not to do so. This is part of its changes to encourage explicit setting and distinguish from leaving things undefined.
Not setting a resource is different from setting it to 0.
Factory operators still expect not to have any GPU in the machine if they do not ask explicitly for it, setting
GLIDEIN_Resource_Slots
There are multiple ways to tell HTCondor not to consider GPUs:
NOTE: Cuda_visible_devices empty means all
-- In this case, detection still happens (detected_spus shows quantity and is advertised, but you are not allowed to use it)
After discussing with TJ in a meeting on 10/9 seems that the last 2 are the preferred solutions
Describe the solution you'd like
When
GLIDEIN_Resource_Slots
is not defined or does not include GPUsset
Machine_resource_gpus=0
in the configuration of the slots.This should be in the generated condor config made for the glidein (in condor_startup.sh)
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Info (please complete the following information):
Stakeholders and components can be a comma-separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line on its own.
Additional context
NA
The text was updated successfully, but these errors were encountered: