Skip to content

Commit

Permalink
Merge branch 'main' into gitlabci_updates
Browse files Browse the repository at this point in the history
  • Loading branch information
felker authored Oct 18, 2023
2 parents 0c0e1f9 + e39df7f commit 184984b
Show file tree
Hide file tree
Showing 3 changed files with 124 additions and 14 deletions.
33 changes: 19 additions & 14 deletions docs/polaris/workflows/libensemble.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# libEnsemble

libEnsemble is a Python toolkit for running dynamic ensembles of calculations. Users provide generator and simulator functions to express their ensembles, where the generator can steer the ensemble based on previous results. A library of example functions is available which can be modified as needed. These functions can submit external executables at any scale and in a portable way. System details are detected, and dynamic resource management is provided. libEnsemble can be used in a consistent manner on laptops, clusters, and supercomputers with minimal required dependencies.
[libEnsemble](https://libensemble.readthedocs.io/en/main/) is a Python toolkit for running dynamic ensembles of calculations.

Users provide generator and simulator functions to express their ensembles, where the generator can steer the ensemble based on previous results. These functions can portably submit external executables at any scale.

System details are detected, and dynamic resource management is provided. This includes automatically detecting, assigning, and reassigning
GPUs for ensemble members.

libEnsemble can be used in a consistent manner on laptops, clusters, and supercomputers with minimal required dependencies.

## Getting libEnsemble on Polaris

Expand All @@ -22,7 +29,7 @@ See the docs for more details on using [python on Polaris](https://docs.alcf.anl
. /path/to-venv/bin/activate
```

Where /path/to-venv can be anywhere you have write access.
Where ``/path/to-venv`` can be anywhere you have write access.
For future uses just load the conda module and run the activate line.

You can also ensure you are using the latest version of libEnsemble:
Expand All @@ -35,25 +42,23 @@ See the docs for more details on using [python on Polaris](https://docs.alcf.anl

## libEnsemble examples

For a very simple example of using libEnsemble see the [Simple Sine tutorial](https://libensemble.readthedocs.io/en/main/tutorials/local_sine_tutorial.html)
For a very simple example of using libEnsemble see the [Simple Introduction tutorial](https://libensemble.readthedocs.io/en/main/tutorials/local_sine_tutorial.html)

For an example that runs a small ensemble using a C application (offloading work to the GPU), see
[the GPU app tutorial](https://libensemble.readthedocs.io/en/main/tutorials/forces_gpu_tutorial.html). The required files for the this tutorial can be found in [this directory](https://github.com/Libensemble/libensemble/tree/develop/libensemble/tests/scaling_tests/forces). Also, see the
[video demo](https://youtu.be/Ff0dYYLQzoU).

Note that when initializing the MPIExecutor on Polaris (**run_libe_forces.py** in the example), you currently need to use the following options to pick up the correct MPI runner:

exctr = MPIExecutor(custom_info={'mpi_runner':'mpich', 'runner_name':'mpiexec'})
[the GPU app tutorial](https://libensemble.readthedocs.io/en/main/tutorials/forces_gpu_tutorial.html).
The required files for this tutorial can be found
in [this directory](https://github.com/Libensemble/libensemble/tree/main/libensemble/tests/scaling_tests/forces).
A [video demo](https://youtu.be/Ff0dYYLQzoU) is also available.

## Job Submission

libEnsemble runs on the compute nodes on Polaris using either
``multi-processing`` or ``mpi4py``. The user can set the number of workers for
libEnsemble runs on the compute nodes on Polaris using either Python's
``multiprocessing`` or ``mpi4py``. The user can set the number of workers for
maximum concurrency. libEnsemble will detect the nodes available
from the PBS environment and use these for running simulations. Polaris supports
running multiple concurrent simulations on each node if desired,
running multiple concurrent simulations on each node if desired.

A simple example batch script for a libEnsemble use case that runs four workers on one node:
A simple example batch script for a libEnsemble use case that runs five workers on one node:

```shell
#!/bin/bash -l
Expand All @@ -65,7 +70,7 @@ A simple example batch script for a libEnsemble use case that runs four workers

export MPICH_GPU_SUPPORT_ENABLED=1
cd $PBS_O_WORKDIR
python run_libe_forces.py --comms local --nworkers 4
python run_libe_forces.py --comms local --nworkers 5
```

The script can be run with:
Expand Down
104 changes: 104 additions & 0 deletions docs/polaris/workflows/mig-compute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Multi-Instance GPU (MIG) mode

MIG mode can be enabled and configured on Polaris by passing a valid configuration file to `qsub`:
> qsub ... -l mig_config=/home/ME/path/to/mig_config.json ...
You can find a concise explanation of MIG concepts and terms at https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#concepts

## Configuration

Please study the following example of a valid configuration file:
> {
> "group1": {
> "gpus": [0,1],
> "mig_enabled": true,
> "instances": {"7g.40gb": ["4c.7g.40gb", "3c.7g.40gb"] }
> },
> "group2": {
> "gpus": [2,3],
> "mig_enabled": true,
> "instances": {"3g.20gb": ["2c.3g.20gb", "1c.3g.20gb"], "2g.10gb": ["2g.10gb"], "1g.5gb": ["1g.5gb"], "1g.5gb": ["1g.5gb"]}
> }
> }
### Notes
- Group names are arbitrary, but must be unique
- `"gpus"` must be an array of integers. if only one physical gpu is being configured in a group, it must still be contained within an array(ex. `"gpus": [0],`)
- Only groups with `mig_enabled` set to `true` will be configured
- `instances` denote the MIG gpu instances and the nested compute instances you wish to be configured
- syntax is `{"gpu instance 1": ["cpu instance 1", "cpu instance 2"], ...}`
- valid gpu instances are `1g.5gb`, `1g.10gb`, `2g.10gb`, `3g.20gb`, `4g.20gb`, and `7g.40gb`. the first number denotes the number of slots used out of 7 total, and the second number denotes memory in GB
- the default cpu instance for any gpu instance has the same identifier as the gpu instance(in which case it will be the only one configurable)
- other cpu instances can be configured with the identifier syntax `Xc.Y`, where `X` is the number of slots available in that gpu instance, and `Y` is the gpu instance identifier string
- some gpu instances cannot be configured adjacently, despite there being sufficient slots/memory remaining(ex. `3g.20gb` and `4g.20gb`). Please see NVIDIA MIG documentation for further details
- Currently, MIG configuration is only available in the debug, debug-scaling, and preemptable queues. submissions to other queues will result in any MIG config files passed being silently ignored
- Files which do not match the above syntax will be silently rejected, and any invalid configurations in properly formatted files will be silently ignored. Please test any changes to your configuration in an interactive job session before use
- A basic validator script is available at `/soft/pbs/mig_conf_validate.sh`. It will check for simple errors in your config, and print the expected configuration. For example:
> ascovel@polaris-login-02:~> /soft/pbs/mig_conf_validate.sh -h
> usage: mig_conf_validate.sh -c CONFIG_FILE
> ascovel@polaris-login-02:~> /soft/pbs/mig_conf_validate.sh -c ./polaris-mig/mig_config.json
> expected MIG configuration:
> GPU GPU_INST COMPUTE_INST
> -------------------------------
> 0 7g.40gb 4c.7g.40gb
> 0 7g.40gb 3c.7g.40gb
> 1 7g.40gb 4c.7g.40gb
> 1 7g.40gb 3c.7g.40gb
> 2 2g.10gb 2g.10gb
> 2 4g.20gb 2c.4g.20gb
> 2 4g.20gb 2c.4g.20gb
> 3 2g.10gb 2g.10gb
> 3 4g.20gb 2c.4g.20gb
> 3 4g.20gb 2c.4g.20gb
> ascovel@polaris-login-02:~>
## Example use of MIG compute instances

The following example demonstrates the use of MIG compute instances via the `CUDA_VISIBLE_DEVICES` environment variable:
> ascovel@polaris-login-02:~/polaris-mig> qsub -l mig_config=/home/ascovel/polaris-mig/mig_config.json -l select=1 -l walltime=60:00 -l filesystems=home:grand:swift -A Operations -q R639752 -k doe -I
> qsub: waiting for job 640002.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov to start
> qsub: job 640002.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov ready
>
> ascovel@x3209c0s19b0n0:~> cat ./polaris-mig/mig_config.json
> {
> "group1": {
> "gpus": [0,1],
> "mig_enabled": true,
> "instances": {"7g.40gb": ["4c.7g.40gb", "3c.7g.40gb"] }
> },
> "group2": {
> "gpus": [2,3],
> "mig_enabled": true,
> "instances": {"4g.20gb": ["2c.4g.20gb", "2c.4g.20gb"], "2g.10gb": ["2g.10gb"] }
> }
> }
> ascovel@x3209c0s19b0n0:~> nvidia-smi -L | grep -Po -e "MIG[0-9a-f\-]+"
> MIG-63aa1884-acb8-5880-a586-173f6506966c
> MIG-b86283ae-9953-514f-81df-99be7e0553a5
> MIG-79065f64-bdbb-53ff-89e3-9d35f270b208
> MIG-6dd56a9d-e362-567e-95b1-108afbcfc674
> MIG-76459138-79df-5d00-a11f-b0a2a747bd9e
> MIG-4d5c9fb3-b0e3-50e8-a60c-233104222611
> MIG-bdfeeb2d-7a50-5e39-b3c5-767838a0b7a3
> MIG-87a2c2f3-d008-51be-b64b-6adb56deb679
> MIG-3d4cdd8c-fc36-5ce9-9676-a6e46d4a6c86
> MIG-773e8e18-f62a-5250-af1e-9343c9286ce1
> ascovel@x3209c0s19b0n0:~> for mig in $( nvidia-smi -L | grep -Po -e "MIG[0-9a-f\-]+" ) ; do CUDA_VISIBLE_DEVICES=${mig} ./saxpy & done 2>/dev/null
> ascovel@x3209c0s19b0n0:~> nvidia-smi | tail -n 16
> +-----------------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |=============================================================================|
> | 0 0 0 17480 C ./saxpy 8413MiB |
> | 0 0 1 17481 C ./saxpy 8363MiB |
> | 1 0 0 17482 C ./saxpy 8413MiB |
> | 1 0 1 17483 C ./saxpy 8363MiB |
> | 2 1 0 17484 C ./saxpy 8313MiB |
> | 2 1 1 17485 C ./saxpy 8313MiB |
> | 2 5 0 17486 C ./saxpy 8313MiB |
> | 3 1 0 17487 C ./saxpy 8313MiB |
> | 3 1 1 17488 C ./saxpy 8313MiB |
> | 3 5 0 17489 C ./saxpy 8313MiB |
> +-----------------------------------------------------------------------------+
> ascovel@x3209c0s19b0n0:~>
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ nav:
- Visualization:
- Paraview: polaris/visualization/paraview.md
- Workflows:
- Multi-Instance GPU (MIG) mode: polaris/workflows/mig-compute.md
- Balsam: polaris/workflows/balsam.md
- Parsl: polaris/workflows/parsl.md
- libEnsemble: polaris/workflows/libensemble.md
Expand Down

0 comments on commit 184984b

Please sign in to comment.