-
Notifications
You must be signed in to change notification settings - Fork 107
GPU Support
Starting with WMCore release 1.5.3.pre4
, support to GPU workflows has been added to Request Manager for the ReReco and TaskChain workflow types. Implementation at the job description level, such that the request GPU parameters can be consumed, is planned for the coming weeks and will be made available in the next WMAgent stable release (likely in the 1.5.4
series).
This wiki is meant to provide a short overview of how GPU workflows are supported in WMCore. Most of its discussion and specification happened in this GH issue #10388, which links an even more important discussion under the CMSSW repository in #cmssw/33057.
Note that these new GPU parameters have to be defined during the workflow creation process, and they cannot be changed in a later stage. Further discussion needs to happen to see what the use cases are and whether further flexibility needs to be provided, either during workflow assignment or workflow resubmission (ACDC).
Two new optional request parameters have been created to support GPU in central production workflows. They are:
-
RequiresGPU
: this parameter defines whether the workflow requires GPU resources or not. It's expected to be a python string data type and the allowed values are:-
forbidden
: workflow will not request for GPU resources, thus cmsRun will not use any GPUs during the data processing (default) -
optional
: use of GPU resources is optional. If there are any GPUs available in the worker node, cmsRun process shall leverage it, otherwise only CPUs will be used. -
required
: GPU resources must be provisioned and cmsRun process expects it to be available in the worker node.
-
-
GPUParams
: this is actually a set of key/value pairs defining what are the GPU hardware requirements, which should be used during the resource provisioning and job matchmaking in the grid. It's expected to be a JSON encoded python dictionary (with a default value asNone
JSON encoded). This parameter can have up to 6 key/value pairs, where the mandatory arguments are:-
GPUMemoryMB
: the minimum amount of GPU memory in MB, that must be available in the worker node. It's expected to be a python integer data type and greater than 0. -
CUDACapabilities
: a list of CUDA Compute Capability that must be available in the worker node. It's expected to be a list of python strings data type, where each CUDA Capability is up to 100 characters and it matches this regular expressionr"^\d+\.\d+(\.\d+)?$"
. -
CUDARuntime
: it defines which CUDA Runtime (API?) version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expressionr"^\d+\.\d+(\.\d+)?$"
.
-
while these 3 extra optional arguments - within GPUParams
- are supported as well:
-
GPUName
: the GPU name that must be available in the worker node. It's expected to be a python string data type and up to 100 characters. -
CUDADriverVersion
: it defines which CUDA Driver Version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expressionr"^\d+\.\d+(\.\d+)?$"
. -
CUDARuntimeVersion
: it defines which CUDA Runtime Version must be available in the worker node. It's expected to be a python string data type, up to 100 characters and matching this regular expressionr"^\d+\.\d+(\.\d+)?$"
.
There are two levels of parameter validation, which will be enforced by Request Manager, and workflow creation will fail in case any of these input parameter validation is not successful.
This top level validation is performed between RequiresGPU
and GPUParams
. Their relationship is such that, if RequiresGPU
is set to either optional
or required
, then the GPUParams
parameter must be provided at the request level. When RequiresGPU
is set to forbidden
, then GPUParams
should not be provided and its default value will be assigned.
The GPU requirements parameters are provided through the GPUParams
request parameter. It goes through data type and regular expression checks, where each key/value pair is going to be validated according to their specification in the section above. It will enforce the 3 mandatory arguments to be within the GPUParams
parameter, including their data type and whether its content matches the regular expression. As well as the optional key/value pairs, in case they are provided. If any of these checks fail, workflow creation should fail in Request Manager as well.
ReReco workflows are specified in the format of a flat python dictionary, so these two new parameters can be provided at the top-level request specification, and such GPU settings will be applied to the data processing task (and skims, if defined in the workflow).
Any of the other WMCore specific tasks, such as Merge, LogCollect, Cleanup and Harvesting shall remain with the default GPU parameter values (thus, keeping the default forbidden
value).
For informational purposes, this support has been implemented through this PR #10799
TaskChain workflows are specified in a nested python dictionary model, such that specific tasks can have their own settings. With that said, these two new GPU parameters can now be defined at both top-level and task-level (or actually only task-level, to avoid confusions!). If GPU parameters are not defined for a given task, it will have the default values, thus RequiresGPU=forbidden.
Any of the other WMCore specific tasks, such as Merge, LogCollect, Cleanup and Harvesting shall remain with the default GPU parameter values (thus, keeping the default forbidden
value).
For informational purposes, this support has been implemented through this PR #10805
Requirements need to be clarified with the stakeholders
This section describes how the workflow level GPU configuration gets reflected in the job classad description.
Note that, as agreed with the Submission Infrastructure team, we decided not to support the optional
value at this moment, and if a workflow is marked as RequiresGPU=optional
, WMAgent will map that to not required.
We have implemented 5 new job parameters, they are:
-
RequestGPUs
: it corresponds to the number of GPUs requested by the job. For this initial phase, we are hard-coding it to1
whenever GPUs are required, else it has the value0
. -
RequiresGPU
: will be a string with value1
in the case GPUs are required by that specific job. Otherwise it will have a0
value, and GPUs won't be required. -
GPUMemoryMB
: a string with the value GPU memory (in MB) required by that job. It defaults toundefined
. -
CUDACapability
: a comma-separated string with the CUDA capability required by that job. It defaults toundefined
. -
CUDARuntime
: a string with the CUDA capability required by that job. It defaults toundefined
.
Once a new stable WMAgent branch is released (likely in the 1.5.4 series), these 4 new classads will be present in every single central production job.