Skip to content

Commit

Permalink
Merge branch 'main' into jini-ramprakash-patch-1
Browse files Browse the repository at this point in the history
  • Loading branch information
jini-ramprakash authored Oct 11, 2023
2 parents 92d1f67 + 4c3a15a commit ec1c20c
Show file tree
Hide file tree
Showing 6 changed files with 89 additions and 18 deletions.
Binary file added docs/ai-testbed/cerebras/files/Trust_ctl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/ai-testbed/cerebras/files/grafana_ctl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/ai-testbed/cerebras/job-queuing-and-submission.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ NAME AGE DURATION PHASE SYSTEMS USER LABEL
wsjob-thjj8zticwsylhppkbmjqe 13s 1s RUNNING cer-cs2-01 username name=unet_pt https://grafana.cerebras1.lab.alcf.anl.gov/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-thjj8zticwsylhppkbmjqe&from=1691705374000&to=now
(venv_pt) $
```
To view the grafana databoard for a job, follow the instructions at [Grafana WsJob Dashboard for Cerebras jobs](./miscellaneous.md#grafana-wsjob-dashboard-for-cerebras-jobs)

Jobs can be canceled as shown:

Expand Down
42 changes: 42 additions & 0 deletions docs/ai-testbed/cerebras/miscellaneous.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,48 @@
Cerebras documentation for porting code to run on a Cerebras CS-2 system:<br>
[Ways to port your model](https://docs.cerebras.net/en/latest/wsc/port/index.html)

## Grafana WsJob Dashboard for Cerebras jobs
A Grafana dashboard provides support for visualizing, querying, and exploring the CS2 system’s metrics and enables to access system logs and traces.
See the Cerebras documentation for the [Job Information Dashboard](https://docs.cerebras.net/en/latest/wsc/getting-started/grafana.html#wsjob-dashboard)

Here is a summary (tested to work on Ubuntu and MacOS)<br>

On your work machine with a web browser, e.g. your laptop,<br>
edit /etc/hosts, using your editor of choice
```console
sudo nano /etc/hosts
```
Add this line
```console
127.0.0.1 grafana.cerebras1.lab.alcf.anl.gov
```
Save, and exit the editor

Download the Grafana certificate present on the Cerebras node at /opt/cerebras/certs/grafana_tls.crt to your local machine. To add this certificate to your browser keychain,

1. On chrome, go to Settings->Privacy and security->Security->Manage device certificates
2. Select System under "System Keychains" on the left hand side of your screen. Also select the "Certificate" tab.
3. Drag and drop the downloaded certificate. Once it is added, it is visible as "lab.alcf.anl.gov"
![Cerebras Wafer-Scale Cluster connection diagram](files/grafana_ctl.png)
4. Select the certificate, and ensure that the "Trust" section is set to "Always Trust"
![Cerebras Wafer-Scale Cluster connection diagram](files/Trust_ctl.png)


On your work machine with a web browser, e.g. your laptop,<br>
tunnel the grafana https port on the cerebras grafana host through to localhost
```
ssh -L 8443:grafana.cerebras1.lab.alcf.anl.gov:443 [email protected]
```

Point a browser at grafana. (Tested with Firefox and Chrome/Brave)<br>
Open browser to a job grafana url shown in csctl get jobs, adding :8443 to hostname, e.g.<br>
```console
https://grafana.cerebras1.lab.alcf.anl.gov:8443/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-49b7uuojdelvtrcxu3cwbw&from=1684859330000&to=noww
```

Login to the dashboard with user admin, and password prom-operator


<!---
## Determining the CS-2 version
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@ Some math libraries targeting CPUs are made available as part of the `nvhpc` mod

- BLAS & LAPACK can be found in the `$NVIDIA_PATH/compilers/lib` directory.
- ScaLAPACK can be found in the `$NVIDIA_PATH/comm_libs` directory.

- GNU Scientific Library, [GSL-2.7](https://www.gnu.org/software/gsl/) available as `module help gsl`
- AMD Optiming CPU Libraries, [AOCL v4.0](https://www.amd.com/en/developer/aocl.html) available as `module help aocl`
- Other Cray-based math libs such as Libsci, FFTW were made available by `module load cray-libsci` & `module load cray-fftw`
[//]: # (ToDo: Need to test if these libraries are usable by gfortran, otherwise we need something compatible; AOCL might be solution)
[//]: # (ToDo: Add some pointers for AOCL when Abhishek gets it installed)

## NVIDIA Math Libraries for GPUs

Expand Down
59 changes: 43 additions & 16 deletions docs/polaris/running-jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,59 @@
# Running Jobs on Polaris

## <a name="Polaris-Queues"></a>Queues

***SLINGSHOT 11 Upgrade: The upgrade will take place in three phases, with each phase taking place during one of the normally scheduled maintenance periods. During this time, there will be an additional queue, `ss11`. This queue will contain compute nodes that have been upgraded to Slingshot 11. The compute nodes in the `prod` queue will contain the Slingshot 10 nodes. The number of nodes in the `prod` queue will dwindle with each maintenance until all computes nodes have been upgraded to Slingshot 11. Once all compute nodes have been upgraded, the `prod` queue will once again have 496 nodes and the `ss11` queue will be removed.***

***ATTENTION: From October 16th through November 13th, the Polaris nodes will be upgraded in 'chunks' to Slingshot 11. This will affect the prod queue sizes. Please read about the changes to the queues below.***

*******

There are five production queues you can target in your qsub (`-q <queue name>`):

| Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
|---------------|----------|----------|----------|----------|-----------------------------------------------------------------------------|
| debug | 1 | 2 | 5 min | 1 hr | max 8 nodes in use by this queue ay any given time |
| debug-scaling | 1 | 10 | 5 min | 1 hr | max 1 job running/accruing/queued **per-user** |
| prod | 10 | 496 | 5 min | 24 hrs | Routing queue; See below |
| preemptable | 1 | 10 | 5 min | 72 hrs | max 20 jobs running/accruing/queued **per-project**; see note below |
| demand | 1 | 56 | 5 min | 1 hr | ***By request only***; max 100 jobs running/accruing/queued **per-project** |
| Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
|--------------------------------|----------|----------------------------|----------|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| debug | 1 | 2 | 5 min | 1 hr | max 8 nodes in use by this queue ay any given time |
| debug-scaling | 1 | 10 | 5 min | 1 hr | max 1 job running/accruing/queued **per-user** |
| prod | 10 | 216-496 **see table below* | 5 min | 24 hrs | Routing queue; See below |
| ss11 (available Oct 16-Nov 13) | 1 | 112-280 **see table below* | 5 min | 24 hrs | Temporary Slingshot 11 queue for newly upgraded compute nodes; max 1 job running, and 1 job queued **per user**; ***This queue will no longer be available after Nov 13th, at which time all nodes will be upgraded and returned to the prod queue*** |
| preemptable | 1 | 10 | 5 min | 72 hrs | max 20 jobs running/accruing/queued **per-project**; see note below |
| demand | 1 | 56 | 5 min | 1 hr | ***By request only***; max 100 jobs running/accruing/queued **per-project** |

*******

***The `demand` and `preemtable` queues will be upgraded to Slingshot 11 on October 16th.***

***The `debug` and `debug-scaling` queues will remain at Slingshot 10 until Nov. 13th, at which time they will be upgraded to Slingshot 11.***

***The prod queue and Slingshot 11 (`ss11`) queue sizes will have the following max node counts during the upgrade period:***

| Number of nodes in: | prod queue (Slingshot 10) | prod queue (Slingshot 11) | ss11 queue (Slightshot 11) |
|----------------------|---------------------------|---------------------------|----------------------------|
| Now through Oct 16th | 496 | 0 | 0 |
| Oct 16th - Oct 30th | 384 | 0 | 112 |
| Oct 30th - Nov 13th | 216 | 0 | 280 |
| Nov 13th and onward | 0 | 496 | N/A |

***PBS "`insufficient resource`" ERROR: If you do not account for this change in maximum job size in your job submissions you could have jobs that sit in the queue for four weeks with a comment of “`insufficient resources`”. Once we come out of the maintenance on Nov 13th they would run.***

******

**Note:** Jobs in the demand queue take priority over jobs in the preemptable queue.
This means jobs in the preemptable queue may be preempted (killed without any warning) if there are jobs in the demand queue.
Please use the following command to view details of a queue: ```qstat -Qf <queuename>```

`prod` is routing queue and routes your job to one of the following six execution queues:

| Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
|-----------------|----------|----------|----------|----------|----------------------------------------|
| small | 10 | 24 | 5 min | 3 hrs ||
| medium | 25 | 99 | 5 min | 6 hrs ||
| large | 100 | 496 | 5 min | 24 hrs ||
| backfill-small | 10 | 24 | 5 min | 3 hrs | low priority, negative project balance |
| backfill-medium | 25 | 99 | 5 min | 6 hrs | low priority, negative project balance |
| backfill-large | 100 | 496 | 5 min | 24 hrs | low priority, negative project balance |
| Queue Name | Node Min | Node Max | Time Min | Time Max | Notes |
|-----------------|----------|----------------------------|----------|----------|----------------------------------------|
| small | 10 | 24 | 5 min | 3 hrs ||
| medium | 25 | 99 | 5 min | 6 hrs ||
| large | 100 | 216-496 **see table above* | 5 min | 24 hrs ||
| backfill-small | 10 | 24 | 5 min | 3 hrs | low priority, negative project balance |
| backfill-medium | 25 | 99 | 5 min | 6 hrs | low priority, negative project balance |
| backfill-large | 100 | 216-496 **see table above* | 5 min | 24 hrs | low priority, negative project balance |

- **Note 1:** You cannot submit to these queues directly, you can only submit to the routing queue "prod".
- **Note 1:** You cannot submit to these queues directly, you can only submit to the routing queue "`prod`".
- **Note 2:** All of these queues have a limit of ten (10) jobs running/accruing **per-project**
- **Note 3:** All of these queues have a limit of one hundred (100) jobs queued (not accruing score) **per-project**
- **Note 4:** As of January 2023, it is recommended to submit jobs with a maximum node count of 476-486 nodes given current rates of downed nodes (larger jobs may sit in the queue indefinitely).
Expand Down

0 comments on commit ec1c20c

Please sign in to comment.