-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into gitlabci_updates
- Loading branch information
Showing
3 changed files
with
124 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Multi-Instance GPU (MIG) mode | ||
|
||
MIG mode can be enabled and configured on Polaris by passing a valid configuration file to `qsub`: | ||
> qsub ... -l mig_config=/home/ME/path/to/mig_config.json ... | ||
You can find a concise explanation of MIG concepts and terms at https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#concepts | ||
|
||
## Configuration | ||
|
||
Please study the following example of a valid configuration file: | ||
> { | ||
> "group1": { | ||
> "gpus": [0,1], | ||
> "mig_enabled": true, | ||
> "instances": {"7g.40gb": ["4c.7g.40gb", "3c.7g.40gb"] } | ||
> }, | ||
> "group2": { | ||
> "gpus": [2,3], | ||
> "mig_enabled": true, | ||
> "instances": {"3g.20gb": ["2c.3g.20gb", "1c.3g.20gb"], "2g.10gb": ["2g.10gb"], "1g.5gb": ["1g.5gb"], "1g.5gb": ["1g.5gb"]} | ||
> } | ||
> } | ||
### Notes | ||
- Group names are arbitrary, but must be unique | ||
- `"gpus"` must be an array of integers. if only one physical gpu is being configured in a group, it must still be contained within an array(ex. `"gpus": [0],`) | ||
- Only groups with `mig_enabled` set to `true` will be configured | ||
- `instances` denote the MIG gpu instances and the nested compute instances you wish to be configured | ||
- syntax is `{"gpu instance 1": ["cpu instance 1", "cpu instance 2"], ...}` | ||
- valid gpu instances are `1g.5gb`, `1g.10gb`, `2g.10gb`, `3g.20gb`, `4g.20gb`, and `7g.40gb`. the first number denotes the number of slots used out of 7 total, and the second number denotes memory in GB | ||
- the default cpu instance for any gpu instance has the same identifier as the gpu instance(in which case it will be the only one configurable) | ||
- other cpu instances can be configured with the identifier syntax `Xc.Y`, where `X` is the number of slots available in that gpu instance, and `Y` is the gpu instance identifier string | ||
- some gpu instances cannot be configured adjacently, despite there being sufficient slots/memory remaining(ex. `3g.20gb` and `4g.20gb`). Please see NVIDIA MIG documentation for further details | ||
- Currently, MIG configuration is only available in the debug, debug-scaling, and preemptable queues. submissions to other queues will result in any MIG config files passed being silently ignored | ||
- Files which do not match the above syntax will be silently rejected, and any invalid configurations in properly formatted files will be silently ignored. Please test any changes to your configuration in an interactive job session before use | ||
- A basic validator script is available at `/soft/pbs/mig_conf_validate.sh`. It will check for simple errors in your config, and print the expected configuration. For example: | ||
> ascovel@polaris-login-02:~> /soft/pbs/mig_conf_validate.sh -h | ||
> usage: mig_conf_validate.sh -c CONFIG_FILE | ||
> ascovel@polaris-login-02:~> /soft/pbs/mig_conf_validate.sh -c ./polaris-mig/mig_config.json | ||
> expected MIG configuration: | ||
> GPU GPU_INST COMPUTE_INST | ||
> ------------------------------- | ||
> 0 7g.40gb 4c.7g.40gb | ||
> 0 7g.40gb 3c.7g.40gb | ||
> 1 7g.40gb 4c.7g.40gb | ||
> 1 7g.40gb 3c.7g.40gb | ||
> 2 2g.10gb 2g.10gb | ||
> 2 4g.20gb 2c.4g.20gb | ||
> 2 4g.20gb 2c.4g.20gb | ||
> 3 2g.10gb 2g.10gb | ||
> 3 4g.20gb 2c.4g.20gb | ||
> 3 4g.20gb 2c.4g.20gb | ||
> ascovel@polaris-login-02:~> | ||
## Example use of MIG compute instances | ||
|
||
The following example demonstrates the use of MIG compute instances via the `CUDA_VISIBLE_DEVICES` environment variable: | ||
> ascovel@polaris-login-02:~/polaris-mig> qsub -l mig_config=/home/ascovel/polaris-mig/mig_config.json -l select=1 -l walltime=60:00 -l filesystems=home:grand:swift -A Operations -q R639752 -k doe -I | ||
> qsub: waiting for job 640002.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov to start | ||
> qsub: job 640002.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov ready | ||
> | ||
> ascovel@x3209c0s19b0n0:~> cat ./polaris-mig/mig_config.json | ||
> { | ||
> "group1": { | ||
> "gpus": [0,1], | ||
> "mig_enabled": true, | ||
> "instances": {"7g.40gb": ["4c.7g.40gb", "3c.7g.40gb"] } | ||
> }, | ||
> "group2": { | ||
> "gpus": [2,3], | ||
> "mig_enabled": true, | ||
> "instances": {"4g.20gb": ["2c.4g.20gb", "2c.4g.20gb"], "2g.10gb": ["2g.10gb"] } | ||
> } | ||
> } | ||
> ascovel@x3209c0s19b0n0:~> nvidia-smi -L | grep -Po -e "MIG[0-9a-f\-]+" | ||
> MIG-63aa1884-acb8-5880-a586-173f6506966c | ||
> MIG-b86283ae-9953-514f-81df-99be7e0553a5 | ||
> MIG-79065f64-bdbb-53ff-89e3-9d35f270b208 | ||
> MIG-6dd56a9d-e362-567e-95b1-108afbcfc674 | ||
> MIG-76459138-79df-5d00-a11f-b0a2a747bd9e | ||
> MIG-4d5c9fb3-b0e3-50e8-a60c-233104222611 | ||
> MIG-bdfeeb2d-7a50-5e39-b3c5-767838a0b7a3 | ||
> MIG-87a2c2f3-d008-51be-b64b-6adb56deb679 | ||
> MIG-3d4cdd8c-fc36-5ce9-9676-a6e46d4a6c86 | ||
> MIG-773e8e18-f62a-5250-af1e-9343c9286ce1 | ||
> ascovel@x3209c0s19b0n0:~> for mig in $( nvidia-smi -L | grep -Po -e "MIG[0-9a-f\-]+" ) ; do CUDA_VISIBLE_DEVICES=${mig} ./saxpy & done 2>/dev/null | ||
> ascovel@x3209c0s19b0n0:~> nvidia-smi | tail -n 16 | ||
> +-----------------------------------------------------------------------------+ | ||
> | Processes: | | ||
> | GPU GI CI PID Type Process name GPU Memory | | ||
> | ID ID Usage | | ||
> |=============================================================================| | ||
> | 0 0 0 17480 C ./saxpy 8413MiB | | ||
> | 0 0 1 17481 C ./saxpy 8363MiB | | ||
> | 1 0 0 17482 C ./saxpy 8413MiB | | ||
> | 1 0 1 17483 C ./saxpy 8363MiB | | ||
> | 2 1 0 17484 C ./saxpy 8313MiB | | ||
> | 2 1 1 17485 C ./saxpy 8313MiB | | ||
> | 2 5 0 17486 C ./saxpy 8313MiB | | ||
> | 3 1 0 17487 C ./saxpy 8313MiB | | ||
> | 3 1 1 17488 C ./saxpy 8313MiB | | ||
> | 3 5 0 17489 C ./saxpy 8313MiB | | ||
> +-----------------------------------------------------------------------------+ | ||
> ascovel@x3209c0s19b0n0:~> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters