Skip to content

Commit

Permalink
Merge branch 'main' into libensemble-1.0
Browse files Browse the repository at this point in the history
  • Loading branch information
felker authored Oct 18, 2023
2 parents 5623e3d + 20d8fa8 commit 7ab17b3
Show file tree
Hide file tree
Showing 83 changed files with 497 additions and 3,132 deletions.
Binary file added docs/ai-testbed/cerebras/files/Trust_ctl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/ai-testbed/cerebras/files/grafana_ctl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/ai-testbed/cerebras/job-queuing-and-submission.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ NAME AGE DURATION PHASE SYSTEMS USER LABEL
wsjob-thjj8zticwsylhppkbmjqe 13s 1s RUNNING cer-cs2-01 username name=unet_pt https://grafana.cerebras1.lab.alcf.anl.gov/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-thjj8zticwsylhppkbmjqe&from=1691705374000&to=now
(venv_pt) $
```
To view the grafana databoard for a job, follow the instructions at [Grafana WsJob Dashboard for Cerebras jobs](./miscellaneous.md#grafana-wsjob-dashboard-for-cerebras-jobs)

Jobs can be canceled as shown:

Expand Down
42 changes: 42 additions & 0 deletions docs/ai-testbed/cerebras/miscellaneous.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,48 @@
Cerebras documentation for porting code to run on a Cerebras CS-2 system:<br>
[Ways to port your model](https://docs.cerebras.net/en/latest/wsc/port/index.html)

## Grafana WsJob Dashboard for Cerebras jobs
A Grafana dashboard provides support for visualizing, querying, and exploring the CS2 system’s metrics and enables to access system logs and traces.
See the Cerebras documentation for the [Job Information Dashboard](https://docs.cerebras.net/en/latest/wsc/getting-started/grafana.html#wsjob-dashboard)

Here is a summary (tested to work on Ubuntu and MacOS)<br>

On your work machine with a web browser, e.g. your laptop,<br>
edit /etc/hosts, using your editor of choice
```console
sudo nano /etc/hosts
```
Add this line
```console
127.0.0.1 grafana.cerebras1.lab.alcf.anl.gov
```
Save, and exit the editor

Download the Grafana certificate present on the Cerebras node at /opt/cerebras/certs/grafana_tls.crt to your local machine. To add this certificate to your browser keychain,

1. On chrome, go to Settings->Privacy and security->Security->Manage device certificates
2. Select System under "System Keychains" on the left hand side of your screen. Also select the "Certificate" tab.
3. Drag and drop the downloaded certificate. Once it is added, it is visible as "lab.alcf.anl.gov"
![Cerebras Wafer-Scale Cluster connection diagram](files/grafana_ctl.png)
4. Select the certificate, and ensure that the "Trust" section is set to "Always Trust"
![Cerebras Wafer-Scale Cluster connection diagram](files/Trust_ctl.png)


On your work machine with a web browser, e.g. your laptop,<br>
tunnel the grafana https port on the cerebras grafana host through to localhost
```
ssh -L 8443:grafana.cerebras1.lab.alcf.anl.gov:443 [email protected]
```

Point a browser at grafana. (Tested with Firefox and Chrome/Brave)<br>
Open browser to a job grafana url shown in csctl get jobs, adding :8443 to hostname, e.g.<br>
```console
https://grafana.cerebras1.lab.alcf.anl.gov:8443/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-49b7uuojdelvtrcxu3cwbw&from=1684859330000&to=noww
```

Login to the dashboard with user admin, and password prom-operator


<!---
## Determining the CS-2 version
Expand Down
2 changes: 1 addition & 1 deletion docs/ai-testbed/cerebras/tunneling-and-forwarding-ports.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@

<!--[TODO a Cerebras-specific example.-->
See ALCF's [Jupyter Instructions](https://github.com/argonne-lcf/ThetaGPU-Docs/blob/master/doc_staging/jupyter.md), and
[Tunneling and forwarding ports](../sambanova_gen2/tunneling-and-forwarding-ports.md). The Cerebras login nodes are direct login; tunneling and port forwarding do not involve jump hosts.
[Tunneling and forwarding ports](../sambanova/tunneling-and-forwarding-ports.md). The Cerebras login nodes are direct login; tunneling and port forwarding do not involve jump hosts.
8 changes: 4 additions & 4 deletions docs/ai-testbed/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ The AI accelerators complement the ALCF's current and next-generation supercompu
The platforms are equipped with architectural features that support AI and data-centric workloads, making them well suited for research tasks involving the growing deluge of scientific data produced by powerful tools, such as supercomputers, light sources, telescopes, particle accelerators, and sensors. In addition, the testbed will allow researchers to explore novel workflows that combine AI methods with simulation and experimental science to accelerate the pace of discovery.

## How to Get Access
Researchers interested in using the AI Testbed’s `Cerebras CS-2`, `SambaNova DataScale SN30` and `Graphcore Bow Pod64` platforms can now submit project proposals via the [ALCF’s Director’s Discretionary program](https://www.alcf.anl.gov/science/directors-discretionary-allocation-program). Access to additional testbed resources, including `Groq`, and `Habana` accelerators, will be announced at a later date.
Researchers interested in using the AI Testbed’s `Cerebras CS-2`, `SambaNova DataScale SN30`, `Graphcore Bow Pod64` and `GroqRack` platforms can now submit project proposals via the [ALCF’s Director’s Discretionary program](https://www.alcf.anl.gov/science/directors-discretionary-allocation-program). Access to additional testbed resources, including `Habana` accelerators, will be announced at a later date.

Submit your proposal requests at: [Allocation Request Page](https://accounts.alcf.anl.gov/allocationRequests){:target="_blank"}

## Getting Started
1. Request a Director's Discretionary project on SambaNova/Cerebras/Graphcore.
1. Request a Director's Discretionary project on SambaNova/Cerebras/Graphcore/Groq.

2. Apply for an ALCF account after the project request is approved. Choose the SambaNova/Cerebras/Graphcore project that your PI has created at ALCF. If you have an active ALCF account, request to [join the project](https://accounts.alcf.anl.gov/joinProject){:target="_blank"} after your project is approved.
2. Apply for an ALCF account after the project request is approved. Choose the SambaNova/Cerebras/Graphcore/Groq project that your PI has created at ALCF. If you have an active ALCF account, request to [join the project](https://accounts.alcf.anl.gov/joinProject){:target="_blank"} after your project is approved.

3. Transfer data to ALCF using Globus after your account has been created.

a. The endpoint for your data in ALCF is ``` alcf#ai_testbed_projects ``` with the path to your project being ``` /<project name> ```.

b. The endpoint for your home directory on the AI Testbeds in ALCF is ``` alcf#ai_testbed_home ```.

4. Add/invite team members to your ALCF project on SambaNova/Cerebras/Graphcore.
4. Add/invite team members to your ALCF project on SambaNova/Cerebras/Graphcore/Groq.

## How to Contribute to Documentation
The documentation is based on [MkDocs](https://www.mkdocs.org/){:target="_blank"} and source files are
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions docs/ai-testbed/groq/getting-started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Getting Started

## Allocations

If you do not already have an allocation, you will need to request one here:
[Discretionary Allocation Request (New & Renewal)](https://accounts.alcf.anl.gov/#/allocationRequests)

## Accounts

If you do not have an ALCF account (but have an allocation), request one here: [ALCF Account and Project Management](https://accounts.alcf.anl.gov/#/home)

## Setup

Connection to a GroqRack node is a two-step process.

The first step is to ssh from a local machine to a login node.
The second, optional step is to ssh from a login node to a GroqRack node. Jobs may also be started and tracked from login nodes.

![GroqRack System View](files/groqrack_system_diagram.png "GroqRack System View")

### Log in to a login node

Connect to a groq login node, editing this command line to use your ALCF user id. You will be prompted for a password; use the 8-digit code provided by MobilePASS+.
```bash
ssh [email protected]
```
This randomly selects one of the login nodes, namely `groq-login-01.ai.alcf.anl.gov` or `groq-login-02.ai.alcf.anl.gov`. You can alternatively ssh to the specific login nodes directly.


### Log in to a GroqRack node

Once you are on a login node, optionally ssh to one of the GroqRack nodes, which are numbered 1-9.

```bash
ssh groq-r01-gn-01.ai.alcf.anl.gov
# or
ssh groq-r01-gn-09.ai.alcf.anl.gov
# or any node with hostname of form groq-r01-gn-0[1-9].ai.alcf.anl.gov
```


18 changes: 18 additions & 0 deletions docs/ai-testbed/groq/job-queuing-and-submission.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Job Queueing and Submission

Groq jobs in the AI Testbed's groqrack are managed by the PBS job scheduler.<br>
Overview: [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System)<br>
For additional information, see
[https://docs.alcf.anl.gov/running-jobs/job-and-queue-scheduling/](https://docs.alcf.anl.gov/running-jobs/job-and-queue-scheduling/)<br>
Man pages are available. These are the key commands:
```console
# qsub - to submit a batch job using a script
man qsub
# qstat - to display queue information
man qstat
# qdel - to delete (cancel) a job:
man qdel
# qhold - to hold a job
man qhold
```

116 changes: 116 additions & 0 deletions docs/ai-testbed/groq/running-a-model-or-program.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Running a Model/Program

Jobs are launched from any GroqRack node, or from login nodes. <br>
If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific node and using either **screen** or **tmux** to create persistent command line sessions. For details use:

```bash
man screen
# or
man tmux
```
or online man pages: [screen](https://manpages.ubuntu.com/manpages/jammy/en/man1/screen.1.html), [tmux](https://manpages.ubuntu.com/manpages/jammy/en/man1/tmux.1.html)

## Running jobs on Groq nodes

### GroqFlow

GroqFlow is the simplest way to port applications running inference to groq. The groqflow github repo includes many sample applications.</br>
See [GroqFlow](https://github.com/groq/groqflow/tree/main).

### Clone the GroqFlow github repo

Clone the groqflow github repo and change current directory to the clone:
```bash
cd ~/
git clone https://github.com/groq/groqflow.git
cd groqflow
```

### GroqFlow conda environments

Create a groqflow conda environment, and activate it.
Follow the instructions in the [Virtual Environments](virtual-environments.md) <br> section.
Note: Similar install instructions are in `~/groqflow/docs/install.md` or [GroqFlow™ Installation Guide](https://github.com/groq/groqflow/blob/main/docs/install.md)<br>
The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps.

### Running a groqflow sample
Each groqflow sample directory in the `~/groqflow/proof_points` tree has a README.md describing the sample and how to run it.

#### Optionally activate your GroqFlow conda environment
```console
conda activate groqflow
```

#### Run a sample using PBS
See [Job Queueing and Submission](job-queuing-and-submission.md) for more information about the PBS job scheduler.

Create a script `run_minilmv2.sh` with the following contents. It assumes that conda was installed in the default location. The conda initialize section can also be copied from your .bashrc if the conda installer was allowed to add it.
```bash
#!/bin/bash
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$(${HOME}'/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "${HOME}/miniconda3/etc/profile.d/conda.sh" ]; then
. "${HOME}/miniconda3/etc/profile.d/conda.sh"
else
export PATH="${HOME}/miniconda3/bin:$PATH"
fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate groqflow
cd ~/groqflow/proof_points/natural_language_processing/minilm
pip install -r requirements.txt
python minilmv2.py
```

Then run the script as a batch job with PBS:
```bash
qsub run_minilmv2.sh
```

If your `~/.bashrc` initializes conda, an alternative to copying the conda initilization script into your execution scripts is to comment out this section in your "~/.bashrc":
```bash
# If not running interactively, don't do anything
case $- in
*i*) ;;
*) return;;
esac
```
to
```bash
## If not running interactively, don't do anything
#case $- in
# *i*) ;;
# *) return;;
#esac
```
Then the execution script becomes:
```bash
#!/bin/bash
conda activate groqflow
cd ~/groqflow/proof_points/natural_language_processing/minilm
pip install -r requirements.txt
python minilmv2.py
```

Job status can be tracked with qstat:
```console
$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3084.groq-r01-co* run_minilmv2 user 0 R workq
$
```


Output will by default go to two files with names like the following, where the suffix is the job id. One standard output for the job. The other is the standard error for the job.
```console
$ ls run_minilmv2.sh.*
-rw------- 1 user users 448 Oct 16 18:40 run_minilmv2.sh.e3082
-rw------- 1 user users 50473 Oct 16 18:42 run_minilmv2.sh.o3082
```

18 changes: 18 additions & 0 deletions docs/ai-testbed/groq/system-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@

ALCF consists of a single `GroqRackTM compute cluster` that provides an extensible accelerator network consisting of 9 `GroqNodeTM` [ groq-r01-gn-01 through groq-r01-gn-09 ] nodes with a rotational multi-node network topology. Each of these GroqNodes consists of 8 GroqCardTM accelerators in them with integrated chip-to-chip connections with a dragonfly multi-chip topology.

`GroqCardTM accelerator` is a dual-width, full-height, three-quarter length PCI-Express Gen4 x16 adapter that includes a single `GroqChipTM processor` with 230 MB of on-chip memory. Based on the proprietary Tensor Streaming Processor (TSP) architecture, the GroqChip processor is a low latency and high throughput single core SIMD compute engine capable of 750 TOPS (INT8) and 188 TFLOPS (FP16) @ 900 MHz that includes advanced vector and matrix mathematical acceleration units. The GroqChip processor is deterministic, providing predictable and repeatable performance.

The `GroqWare suite SDK` uses a API based programming model and enables users to develop, compile, and run models on the GroqCard accelerator in a host server system. The SDK uses a ONNX/MLIR enabled DAG compiler and it consists of Groq Compiler, Groq API, and utility tools like GroqView™ profiler and groq-runtime.


<!--- The GroqRack 42U compute cluster has ---> <!--9 GroqNode servers, and --> <!--- 9 compute nodes (GroqNodes) named sequentially from groq-r01-gn-01 to groq-r01-gn-09.---> <!--and 1 redudant node (groq-r01-gn-09)--> <!---Each GroqNode has 2 AMD EPYCTM 7313 processors, a total of 1TB of DRAM, and 8 GroqCard accelerators, with integrated chip-to-chip connections. --->


For more information refer to the following links:

[GroqRack spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqRack%E2%84%A2-Compute-Cluster-Product-Brief-v1.0.pdf)<br>
[GroqNode spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqNode%E2%84%A2-Server-GN1-B8C-Product-Brief-v1.5.pdf)<br>
[GroqCard spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqCard%E2%84%A2-Accelerator-Product-Brief-v1.5-.pdf)<br>
[GroqChip spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqChip%E2%84%A2-Processor-Product-Brief-v1.5.pdf)<br>
([via](https://groq.com/docs/))
41 changes: 41 additions & 0 deletions docs/ai-testbed/groq/virtual-environments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Virtual Environments

## Install conda
If conda is not already installed:
```bash
rm Miniconda3-latest-Linux-x86_64.sh*
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# answer y/yes to all prompts
# exit ssh session, then start a new ssh session
exit
```
## GroqFlow conda environment setup
### Create and activate a groqflow conda environment
Create a groqflow conda environment and activate it
```bash
export PYTHON_VERSION=3.10.12
conda create -n groqflow python=$PYTHON_VERSION
conda activate groqflow
```

### Install groqflow into the groqflow conda environment
Execute the following commands to install groqflow into the activated groqflow conda environment
```bash
# Alter this if you have cloned groqflow to some other location.
cd ~/groqflow
pip install --upgrade pip
pip install -e .
pushd .
cd demo_helpers
pip install -e .
popd
```

To use groqfloq,
```bash
conda activate groqflow
```
Note: Always use a personal conda environment when installing packages on groq nodes; otherwise they can get installed into `~/.local` and can cause problems when your shared home directory is used on other systems. If you encounter mysterious package dependency/version issues, check your `~/.local/lib` and `~/.local/bin` for mistakenly installed packages.

Note: The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 7ab17b3

Please sign in to comment.