-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add maintenance manual for SLURM nodes (#2661)
- Loading branch information
Showing
1 changed file
with
163 additions
and
2 deletions.
There are no files selected for viewing
165 changes: 163 additions & 2 deletions
165
pages/docs/community-docs/watcloud/maintenance-manual.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,166 @@ | ||
# WATcloud Maintenance Manual | ||
|
||
import { Callout } from 'nextra/components' | ||
This manual outlines the maintenance procedures for various components of WATcloud. | ||
|
||
## SLURM | ||
|
||
To get a general overview of the health of the SLURM cluster, you can run: | ||
|
||
```bash copy | ||
sinfo --long | ||
``` | ||
|
||
Example output: | ||
|
||
``` | ||
Thu Apr 18 17:16:26 2024 | ||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST | ||
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1 | ||
``` | ||
|
||
In the output above, `tr-slurm1` is in the `drained` state, which means it is not available for running jobs. | ||
`thor-slurm1` is in the `mix` state, which means some jobs are running on it. | ||
All other nodes are in the `idle` state, which means there are no jobs running on them. | ||
|
||
To get a detailed overview of the health of the SLURM cluster, you can run: | ||
|
||
```bash copy | ||
scontrol show node [NODE_NAME] | ||
``` | ||
|
||
The optional `NODE_NAME` argument can be used to restrict the output to a specific node. | ||
|
||
Example output: | ||
|
||
```text {11,20} | ||
> scontrol show node tr-slurm1 | ||
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1 | ||
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01 | ||
AvailableFeatures=(null) | ||
ActiveFeatures=(null) | ||
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K | ||
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4 | ||
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 | ||
RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1 | ||
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048 | ||
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A | ||
Partitions=compute | ||
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32 | ||
LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None | ||
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400 | ||
AllocTRES= | ||
CapWatts=n/a | ||
CurrentWatts=0 AveWatts=0 | ||
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a | ||
Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20] | ||
``` | ||
|
||
In the output above, we can see that the reason `tr-slurm1` is in the `drained` state (a.k.a. `IDLE+DRAIN`) for reason `Performing maintenance on baremetal`. | ||
The `Reason` field is an arbitrary user-specified string that can be set when performing actions on nodes. | ||
|
||
### Performing maintenance on a node | ||
|
||
|
||
#### Putting a node into maintenance mode | ||
|
||
To put a node into maintenance mode, you can run: | ||
|
||
```bash | ||
scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X" | ||
``` | ||
|
||
For example: | ||
|
||
```bash copy | ||
scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal" | ||
``` | ||
|
||
This will drain the node `tr-slurm1` (prevent new jobs from running on it) and set the reason to `Performing maintenance on baremetal`. | ||
If there are no jobs running on the node, the node state becomes `drained` (a.k.a. `IDLE+DRAIN` in `scontrol`). | ||
If there are jobs running on the node, the node state becomes `draining` (a.k.a. `MIXED+DRAIN` in `scontrol`). | ||
In this case, SLURM will wait for the jobs to finish before transitioning the node to the `drained` state. | ||
|
||
Example output from when a node is in the `draining` state: | ||
|
||
```text {4,18,27} | ||
> sinfo --long | ||
Thu Apr 18 17:17:35 2024 | ||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST | ||
compute* up 1-00:00:00 1-infinite no NO all 1 draining tr-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1 | ||
> scontrol show node tr-slurm1 | ||
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1 | ||
CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00 | ||
AvailableFeatures=(null) | ||
ActiveFeatures=(null) | ||
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K | ||
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4 | ||
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 | ||
RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1 | ||
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048 | ||
State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A | ||
Partitions=compute | ||
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32 | ||
LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None | ||
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400 | ||
AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300 | ||
CapWatts=n/a | ||
CurrentWatts=0 AveWatts=0 | ||
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a | ||
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01] | ||
``` | ||
|
||
After jobs finish running on the node, the node will transition to the `drained` state: | ||
|
||
```text {4,18,27} | ||
> sinfo --long | ||
Thu Apr 18 17:22:07 2024 | ||
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST | ||
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1 | ||
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1 | ||
> scontrol show node tr-slurm1 | ||
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1 | ||
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00 | ||
AvailableFeatures=(null) | ||
ActiveFeatures=(null) | ||
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K | ||
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4 | ||
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 | ||
RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1 | ||
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048 | ||
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A | ||
Partitions=compute | ||
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32 | ||
LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None | ||
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400 | ||
AllocTRES= | ||
CapWatts=n/a | ||
CurrentWatts=0 AveWatts=0 | ||
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a | ||
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01] | ||
``` | ||
|
||
Once the node is in the `drained` state, you can perform maintenance on it. | ||
|
||
#### Taking a node out of maintenance mode | ||
|
||
To take a node out of maintenance mode, you can run: | ||
|
||
```bash | ||
scontrol update nodename="<NODE_NAME>" state=resume | ||
``` | ||
|
||
For example: | ||
|
||
```bash copy | ||
scontrol update nodename="tr-slurm1" state=resume | ||
``` | ||
|
||
This will resume the node `tr-slurm1` (allow new jobs to run on it) and clear the reason. | ||
|
||
<Callout type="warning">This page is under construction.</Callout> |