Skip to content

Commit

Permalink
Add maintenance manual for SLURM nodes (#2661)
Browse files Browse the repository at this point in the history
  • Loading branch information
ben-z authored Apr 18, 2024
1 parent 522d786 commit 0d30602
Showing 1 changed file with 163 additions and 2 deletions.
165 changes: 163 additions & 2 deletions pages/docs/community-docs/watcloud/maintenance-manual.mdx
Original file line number Diff line number Diff line change
@@ -1,5 +1,166 @@
# WATcloud Maintenance Manual

import { Callout } from 'nextra/components'
This manual outlines the maintenance procedures for various components of WATcloud.

## SLURM

To get a general overview of the health of the SLURM cluster, you can run:

```bash copy
sinfo --long
```

Example output:

```
Thu Apr 18 17:16:26 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
```

In the output above, `tr-slurm1` is in the `drained` state, which means it is not available for running jobs.
`thor-slurm1` is in the `mix` state, which means some jobs are running on it.
All other nodes are in the `idle` state, which means there are no jobs running on them.

To get a detailed overview of the health of the SLURM cluster, you can run:

```bash copy
scontrol show node [NODE_NAME]
```

The optional `NODE_NAME` argument can be used to restrict the output to a specific node.

Example output:

```text {11,20}
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29723 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-16T19:16:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:06:20]
```

In the output above, we can see that the reason `tr-slurm1` is in the `drained` state (a.k.a. `IDLE+DRAIN`) for reason `Performing maintenance on baremetal`.
The `Reason` field is an arbitrary user-specified string that can be set when performing actions on nodes.

### Performing maintenance on a node


#### Putting a node into maintenance mode

To put a node into maintenance mode, you can run:

```bash
scontrol update nodename="<NODE_NAME>" state=drain reason="Performing maintenance for reason X"
```

For example:

```bash copy
scontrol update nodename="tr-slurm1" state=drain reason="Performing maintenance on baremetal"
```

This will drain the node `tr-slurm1` (prevent new jobs from running on it) and set the reason to `Performing maintenance on baremetal`.
If there are no jobs running on the node, the node state becomes `drained` (a.k.a. `IDLE+DRAIN` in `scontrol`).
If there are jobs running on the node, the node state becomes `draining` (a.k.a. `MIXED+DRAIN` in `scontrol`).
In this case, SLURM will wait for the jobs to finish before transitioning the node to the `drained` state.

Example output from when a node is in the `draining` state:

```text {4,18,27}
> sinfo --long
Thu Apr 18 17:17:35 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 draining tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=512 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=MIXED+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:15:30 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=cpu=1,mem=512M,gres/tmpdisk=300
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
```

After jobs finish running on the node, the node will transition to the `drained` state:

```text {4,18,27}
> sinfo --long
Thu Apr 18 17:22:07 2024
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE RESERVATION NODELIST
compute* up 1-00:00:00 1-infinite no NO all 1 drained tr-slurm1
compute* up 1-00:00:00 1-infinite no NO all 1 mixed thor-slurm1
compute* up 1-00:00:00 1-infinite no NO all 3 idle trpro-slurm[1-2],wato2-slurm1
> scontrol show node tr-slurm1
NodeName=tr-slurm1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=58 CPUTot=60 CPULoad=0.00
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:grid_p40:1(S:0),shard:grid_p40:8K(S:0),tmpdisk:100K
NodeAddr=tr-slurm1.ts.watonomous.ca NodeHostName=tr-slurm1 Version=23.11.4
OS=Linux 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024
RealMemory=39140 AllocMem=0 FreeMem=29688 Sockets=60 Boards=1
CoreSpecCount=2 CPUSpecList=58-59 MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=compute
BootTime=2024-03-17T03:32:45 SlurmdStartTime=2024-04-13T20:55:32
LastBusyTime=2024-04-18T17:21:13 ResumeAfterTime=None
CfgTRES=cpu=58,mem=39140M,billing=58,gres/gpu=1,gres/shard=8192,gres/tmpdisk=102400
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
Reason=Performing maintenance on baremetal [root@2024-04-18T17:16:01]
```

Once the node is in the `drained` state, you can perform maintenance on it.

#### Taking a node out of maintenance mode

To take a node out of maintenance mode, you can run:

```bash
scontrol update nodename="<NODE_NAME>" state=resume
```

For example:

```bash copy
scontrol update nodename="tr-slurm1" state=resume
```

This will resume the node `tr-slurm1` (allow new jobs to run on it) and clear the reason.

<Callout type="warning">This page is under construction.</Callout>

0 comments on commit 0d30602

Please sign in to comment.