Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set GPUs explicitly to 0 when not explicitly requested #444

Open
mambelli opened this issue Oct 14, 2024 · 1 comment · May be fixed by #446
Open

Set GPUs explicitly to 0 when not explicitly requested #444

mambelli opened this issue Oct 14, 2024 · 1 comment · May be fixed by #446
Assignees
Labels
cms CMS stakeholder factoryops Factory Operations stakeholder FEATURE For FEATURES glidein for affected component High High priority osg OSG stakeholder

Comments

@mambelli
Copy link
Contributor

Is your feature request related to a problem? Please describe.
HTCondor changed its behavior. When GPUs are available on the host it will set those up in the machine unless explicitly told not to do so. This is part of its changes to encourage explicit setting and distinguish from leaving things undefined.
Not setting a resource is different from setting it to 0.
Factory operators still expect not to have any GPU in the machine if they do not ask explicitly for it, setting GLIDEIN_Resource_Slots

There are multiple ways to tell HTCondor not to consider GPUs:

  • cuda_visible_devices=none then no devices
    NOTE: Cuda_visible_devices empty means all
  • set request_gpu to 0
  • set STARTD_DETECT_GPUS False
  • set Machine_resource_gpus=0 (side effect of turning off detection)
  • set gpus=0 in the SLOT_TYPE definition
    -- In this case, detection still happens (detected_spus shows quantity and is advertised, but you are not allowed to use it)

After discussing with TJ in a meeting on 10/9 seems that the last 2 are the preferred solutions

Describe the solution you'd like
When GLIDEIN_Resource_Slots is not defined or does not include GPUs
set Machine_resource_gpus=0 in the configuration of the slots.
This should be in the generated condor config made for the glidein (in condor_startup.sh)

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Info (please complete the following information):
Stakeholders and components can be a comma-separated list or on multiple lines.
If you add a new stakeholder or component, not on the sample list, add it on a line on its own.

  • Priority: high
  • Stakeholders: CMS, FactoryOps, OSG
  • Components: glidein

Additional context
NA

@github-actions github-actions bot added cms CMS stakeholder factoryops Factory Operations stakeholder FEATURE For FEATURES glidein for affected component High High priority osg OSG stakeholder labels Oct 14, 2024
@mambelli
Copy link
Contributor Author

Some clarifications.
Not setting and setting to 0 (GLIDEIN_Resource_Slots is not defined, or does not include GPUs, or GPUs=0) should all have the same behavior of not having the GPU in the slot (via Machine_resource_gpus=0). The GPU is not physically disabled or other - just ignored by HTCondor and not usable by the jobs.
The HTCondor configuration is created in condor_startup.sh and that script is already parsing the attribute GLIDEIN_Resource_Slots when present.
GLIDEIN_Resource_Slots is documented in https://glideinwms.fnal.gov/doc.v3_6/factory/custom_vars.html
Here are some examples:

<attr name="GLIDEIN_Resource_Slots" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="GPUs,1,type=main"/>
<attr name="GLIDEIN_Resource_Slots" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="True" type="string" value="ioslot,2,disk=1GB;monitor;GPUs,3,,main"/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cms CMS stakeholder factoryops Factory Operations stakeholder FEATURE For FEATURES glidein for affected component High High priority osg OSG stakeholder
Projects
None yet
2 participants