Add maintenance manual for SLURM nodes (#2661)

WATonomous · Apr 18, 2024 · 0d30602 · 0d30602
1 parent 522d786
commit 0d30602
Showing 1 changed file with 163 additions and 2 deletions.
diff --git a/pages/docs/community-docs/watcloud/maintenance-manual.mdx b/pages/docs/community-docs/watcloud/maintenance-manual.mdx
@@ -1,5 +1,166 @@
 # WATcloud Maintenance Manual
 
-import { Callout } from 'nextra/components'
+This manual outlines the maintenance procedures for various components of WATcloud.
+
+## SLURM
+
+To get a general overview of the health of the SLURM cluster, you can run:
+
+```bash copy
+sinfo --long
+```
+
+Example output:
+
+```
+Thu Apr 18 17:16:26 2024
+PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1
+```
+
+In the output above, `tr-slurm1` is in the `drained` state, which means it is not available for running jobs.
+`thor-slurm1` is in the `mix` state, which means some jobs are running on it.
+All other nodes are in the `idle` state, which means there are no jobs running on them.
+
+To get a detailed overview of the health of the SLURM cluster, you can run:
+
+```bash copy
+scontrol show node [NODE_NAME]
+```
+
+The optional `NODE_NAME` argument can be used to restrict the output to a specific node.
+
+Example output:
+
+```text {11,20}
+> scontrol show node tr-slurm1
+NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
+   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
+   AvailableFeatures=(null)
+   ActiveFeatures=(null)
+   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
+   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
+   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
+   RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
+   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
+   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=compute
+   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
+   LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
+   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
+   Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20] 
+```
+
+In the output above, we can see that the reason `tr-slurm1` is in the `drained` state (a.k.a. `IDLE+DRAIN`) for reason `Performing maintenance on baremetal`.
+The `Reason` field is an arbitrary user-specified string that can be set when performing actions on nodes.
+
+### Performing maintenance on a node
+
+
+#### Putting a node into maintenance mode
+
+To put a node into maintenance mode, you can run:
+
+```bash
+scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"
+```
+
+For example:
+
+```bash copy
+scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"
+```
+
+This will drain the node `tr-slurm1` (prevent new jobs from running on it) and set the reason to `Performing maintenance on baremetal`.
+If there are no jobs running on the node, the node state becomes `drained` (a.k.a. `IDLE+DRAIN` in `scontrol`).
+If there are jobs running on the node, the node state becomes `draining` (a.k.a. `MIXED+DRAIN` in `scontrol`).
+In this case, SLURM will wait for the jobs to finish before transitioning the node to the `drained` state.
+
+Example output from when a node is in the `draining` state:
+
+```text {4,18,27}
+> sinfo --long
+Thu Apr 18 17:17:35 2024
+PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1    draining             tr-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1
+
+> scontrol show node tr-slurm1
+NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
+   CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
+   AvailableFeatures=(null)
+   ActiveFeatures=(null)
+   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
+   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
+   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
+   RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
+   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
+   State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=compute
+   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
+   LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
+   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
+   AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
+   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
+```
+
+After jobs finish running on the node, the node will transition to the `drained` state:
+
+```text {4,18,27}
+> sinfo --long
+Thu Apr 18 17:22:07 2024
+PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT OVERSUBS     GROUPS  NODES       STATE RESERVATION NODELIST
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1     drained             tr-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      1       mixed             thor-slurm1
+compute*     up 1-00:00:00 1-infinite   no       NO        all      3        idle             trpro-slurm[1-2],wato2-slurm1
+
+> scontrol show node tr-slurm1
+NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
+   CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
+   AvailableFeatures=(null)
+   ActiveFeatures=(null)
+   Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
+   NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
+   OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
+   RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
+   CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
+   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
+   Partitions=compute
+   BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
+   LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
+   CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
+   Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
+```
+
+Once the node is in the `drained` state, you can perform maintenance on it.
+
+#### Taking a node out of maintenance mode
+
+To take a node out of maintenance mode, you can run:
+
+```bash
+scontrol update nodename="<NODE_NAME>" state=resume
+```
+
+For example:
+
+```bash copy
+scontrol update nodename="tr-slurm1" state=resume
+```
+
+This will resume the node `tr-slurm1` (allow new jobs to run on it) and clear the reason.
 
-<Callout type="warning">This page is under construction.</Callout>