diff --git a/pages/docs/compute-cluster/_meta.json b/pages/docs/compute-cluster/_meta.json index 52a7055..523dd7a 100644 --- a/pages/docs/compute-cluster/_meta.json +++ b/pages/docs/compute-cluster/_meta.json @@ -3,8 +3,8 @@ "display": "hidden" }, "getting-access": "Getting Access", - "ssh": "SSH", "machine-usage-guide": "Machine Usage Guide", + "ssh": "SSH", "slurm": "SLURM", "firewall": "Firewall", "quotas": "Quotas", diff --git a/pages/docs/compute-cluster/getting-access.mdx b/pages/docs/compute-cluster/getting-access.mdx index 0475f7e..5d14536 100644 --- a/pages/docs/compute-cluster/getting-access.mdx +++ b/pages/docs/compute-cluster/getting-access.mdx @@ -38,6 +38,8 @@ After your request is approved, it usually takes about 15 minutes for your acces into the provisioning pipeline and can work with the WATcloud team to resolve any technical issues that may arise. Once your access is provisioned, you will receive a welcome email. + +In the mean time, please familiarize yourself with the [Machine Usage Guide](./machine-usage-guide). diff --git a/pages/docs/compute-cluster/machine-usage-guide.mdx b/pages/docs/compute-cluster/machine-usage-guide.mdx index 5344bf0..83223fb 100644 --- a/pages/docs/compute-cluster/machine-usage-guide.mdx +++ b/pages/docs/compute-cluster/machine-usage-guide.mdx @@ -1,6 +1,25 @@ # Machine Usage Guide -This page contains information about how to use the machines in the cluster. +This document provides an overview of the machines in the WATcloud compute cluster, including their hardware, networking, operating system, services, and software. +It also includes guidelines for using the machines, troubleshoot instructions for common issues, and information about maintenance and outages. + +## Types of Machines + +There are two main types of machines in the cluster: [general-use machines](/machines#general-use-machines) and [SLURM compute nodes](/machines#slurm-compute-nodes). + +General-use machines are meant for interactive use and are shared among all users in the cluster. +You can access them via [SSH](./ssh). + +SLURM compute nodes are meant for running resource-intensive jobs. +They are managed by a popular HPC[^hpc] job scheduler called SLURM[^slurm]. +Instructions for submitting jobs to the SLURM cluster can be found [here](./slurm). + +[^hpc]: [High-performance computing](https://en.wikipedia.org/wiki/High-performance_computing) (HPC) is the use of supercomputers and parallel processing + techniques to solve complex computational problems. Examples of HPC clusters include [Cedar](https://docs.computecanada.ca/wiki/Cedar) and + [Graham](https://docs.computecanada.ca/wiki/Graham). + +[^slurm]: Simple Linux Utility for Resource Management (SLURM) is an open-source job scheduler that allocates resources to jobs on a cluster of computers. + It is widely used in HPC environments. Learn more about SLURM [here](https://slurm.schedmd.com/quickstart.html). ## Hardware @@ -19,8 +38,8 @@ and a cluster network (over 40Gbps or 10Gbps Ethernet). The IP address range for ## Operating System -All general-use machines are virtual machines (VMs)[^hypervisor]. This setup allows us to easily manage the machines remotely -and reduce the complexity of the bare-metal OSes. +All general-use machines and SLURM compute nodes are virtual machines (VMs)[^hypervisor]. +This setup allows us to easily manage the machines remotely and reduce the complexity of the bare-metal OSes. [^hypervisor]: We use [Proxmox](https://www.proxmox.com/en/) as our hypervisor. @@ -46,12 +65,15 @@ on all general-use machines at the `/mnt/wato-drive*` directories. You can use t ### `/mnt/scratch` Directory -On some high-performance machines, we have an NVMe/SATA SSD-backed local storage pool that is mounted at the `/mnt/scratch` directory. -These local storage pools are meant for temporary storage for long-running jobs that require fast and reliable filesystem access, +Every general-use machine has an SSD-backed local storage pool that is mounted at the `/mnt/scratch` directory. +These storage pools are meant for temporary storage for long-running jobs that require fast and reliable filesystem access, such as storing training data and model checkpoints for ML workloads. The space on `/mnt/scratch` is limited. Please make sure to clean up your files after you are done with them. +An equivalent of `/mnt/scratch` is available on the SLURM compute nodes as well. +They can be requested by following the instructions [here](./slurm#grestmpdisk). + ### Docker Every general-use machine has Docker Rootless[^docker-rootless] installed. There is a per-user storage quota to ensure that everyone has @@ -77,23 +99,21 @@ Currently, it's enabled for the WATonomous organization. If you require this fun ## Software -We try to keep the general-use machines lean. We generally refrain from installing software that make sense for rootless installation or running in +We try to keep the machines lean. We generally refrain from installing software that make sense for rootless installation or running in containerized environments. -Examples of software that we install on the general-use machines include: +Examples of software that we install: - Docker (rootless) - NVIDIA Container Toolkit - zsh - various CLI tools (e.g. `vifm`, `iperf`, `moreutils`, `jq`, `ncdu`) -Examples of software that we do not install on the general-use machines include: +Examples of software that we do not install: - conda (use [miniconda](https://docs.conda.io/en/latest/miniconda.html) instead) - ROS (use [Docker](https://hub.docker.com/_/ros) instead) -- CUDA (use [Docker](https://hub.docker.com/r/nvidia/cuda) instead) -- PyTorch (use [Docker](https://hub.docker.com/r/pytorch/pytorch) instead) -- TensorFlow (use [Docker](https://hub.docker.com/r/tensorflow/tensorflow) instead) +- CUDA (use [Docker](https://hub.docker.com/r/nvidia/cuda) instead. Or use [CVMFS](./slurm#cvmfs) on the SLURM compute nodes.) -If there is a piece of software that you think should be installed on the general-use machines, please reach out to a WATcloud team member. +If there is a piece of software that you think should be installed on the machines, please reach out to a WATcloud team member. ### `OMP_NUM_THREADS` @@ -138,13 +158,6 @@ a dashboard to monitor the health of the machines in the cluster. - `/mnt/wato-drive*` are large storage pools, but they are not infinite and can fill up quickly with today's large datasets. Please remove unneeded files from these directories. - Please clean up your Docker images and containers regularly. `docker system prune --all{:bash}` is your friend. -- Resource allocation - - Currently, we don't have any resource allocation policies in place. However, if we notice that the cluster's resources are frequently exhausted, - we may need to implement a resource allocation system common in HPC[^hpc] clusters. - -[^hpc]: [High-performance computing](https://en.wikipedia.org/wiki/High-performance_computing) (HPC) is the use of supercomputers and parallel processing -techniques to solve complex computational problems. Examples of HPC clusters include [Cedar](https://docs.computecanada.ca/wiki/Cedar) and -[Graham](https://docs.computecanada.ca/wiki/Graham). ## Troubleshooting diff --git a/pages/docs/compute-cluster/slurm.mdx b/pages/docs/compute-cluster/slurm.mdx index 8e7d018..19aa1a7 100644 --- a/pages/docs/compute-cluster/slurm.mdx +++ b/pages/docs/compute-cluster/slurm.mdx @@ -5,9 +5,6 @@ It is commonly used in HPC (High Performance Computing) environments. WATcloud u [^slurm]: https://slurm.schedmd.com/ - -## Quick Start - import { Callout } from 'nextra/components' @@ -15,6 +12,21 @@ WATcloud SLURM is currently in beta. If you encounter any issues, Please review or [let us know](/docs/compute-cluster/support-resources). +## Terminology + +Before we dive into the details, let's define some common terms used in SLURM: + +- **Login node**: A node that users log into to submit jobs to the SLURM cluster. This is where you will interact with the SLURM cluster. +- **Compute node**: A node that runs jobs submitted to the SLURM cluster. This is where your job will run. Compute nodes are not directly accessible by users. +- **Job**: A unit of work submitted to the SLURM cluster. A job can be interactive or batch. +- **Interactive job**: A job that runs interactively on a compute node. This is useful for debugging or running short tasks. +- **Batch job**: A job that runs non-interactively on a compute node. This is useful for running long-running tasks like simulations or ML training. +- **Job array**: A collection of jobs with similar parameters. This is useful for running parameter sweeps or other tasks that require running the same job multiple times with potentially different inputs. +- **Resource**: A physical or logical entity that can be allocated to a job. Examples include CPUs, memory, GPUs, and temporary disk space. +- **GRES (Generic Resource)**: A SLURM feature that allows for arbitrary resources to be allocated to jobs. Examples include GPUs and temporary disk space. + +## Quick Start + To submit jobs to the SLURM cluster, you will need to log into one of the SLURM login nodes. During the beta, they are labelled `SL` in the [machine list](/machines). After the beta, all general-use machines will be SLURM login nodes.