-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into libensemble-1.0
- Loading branch information
Showing
83 changed files
with
497 additions
and
3,132 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,6 +5,48 @@ | |
Cerebras documentation for porting code to run on a Cerebras CS-2 system:<br> | ||
[Ways to port your model](https://docs.cerebras.net/en/latest/wsc/port/index.html) | ||
|
||
## Grafana WsJob Dashboard for Cerebras jobs | ||
A Grafana dashboard provides support for visualizing, querying, and exploring the CS2 system’s metrics and enables to access system logs and traces. | ||
See the Cerebras documentation for the [Job Information Dashboard](https://docs.cerebras.net/en/latest/wsc/getting-started/grafana.html#wsjob-dashboard) | ||
|
||
Here is a summary (tested to work on Ubuntu and MacOS)<br> | ||
|
||
On your work machine with a web browser, e.g. your laptop,<br> | ||
edit /etc/hosts, using your editor of choice | ||
```console | ||
sudo nano /etc/hosts | ||
``` | ||
Add this line | ||
```console | ||
127.0.0.1 grafana.cerebras1.lab.alcf.anl.gov | ||
``` | ||
Save, and exit the editor | ||
|
||
Download the Grafana certificate present on the Cerebras node at /opt/cerebras/certs/grafana_tls.crt to your local machine. To add this certificate to your browser keychain, | ||
|
||
1. On chrome, go to Settings->Privacy and security->Security->Manage device certificates | ||
2. Select System under "System Keychains" on the left hand side of your screen. Also select the "Certificate" tab. | ||
3. Drag and drop the downloaded certificate. Once it is added, it is visible as "lab.alcf.anl.gov" | ||
![Cerebras Wafer-Scale Cluster connection diagram](files/grafana_ctl.png) | ||
4. Select the certificate, and ensure that the "Trust" section is set to "Always Trust" | ||
![Cerebras Wafer-Scale Cluster connection diagram](files/Trust_ctl.png) | ||
|
||
|
||
On your work machine with a web browser, e.g. your laptop,<br> | ||
tunnel the grafana https port on the cerebras grafana host through to localhost | ||
``` | ||
ssh -L 8443:grafana.cerebras1.lab.alcf.anl.gov:443 [email protected] | ||
``` | ||
|
||
Point a browser at grafana. (Tested with Firefox and Chrome/Brave)<br> | ||
Open browser to a job grafana url shown in csctl get jobs, adding :8443 to hostname, e.g.<br> | ||
```console | ||
https://grafana.cerebras1.lab.alcf.anl.gov:8443/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-49b7uuojdelvtrcxu3cwbw&from=1684859330000&to=noww | ||
``` | ||
|
||
Login to the dashboard with user admin, and password prom-operator | ||
|
||
|
||
<!--- | ||
## Determining the CS-2 version | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Getting Started | ||
|
||
## Allocations | ||
|
||
If you do not already have an allocation, you will need to request one here: | ||
[Discretionary Allocation Request (New & Renewal)](https://accounts.alcf.anl.gov/#/allocationRequests) | ||
|
||
## Accounts | ||
|
||
If you do not have an ALCF account (but have an allocation), request one here: [ALCF Account and Project Management](https://accounts.alcf.anl.gov/#/home) | ||
|
||
## Setup | ||
|
||
Connection to a GroqRack node is a two-step process. | ||
|
||
The first step is to ssh from a local machine to a login node. | ||
The second, optional step is to ssh from a login node to a GroqRack node. Jobs may also be started and tracked from login nodes. | ||
|
||
![GroqRack System View](files/groqrack_system_diagram.png "GroqRack System View") | ||
|
||
### Log in to a login node | ||
|
||
Connect to a groq login node, editing this command line to use your ALCF user id. You will be prompted for a password; use the 8-digit code provided by MobilePASS+. | ||
```bash | ||
ssh [email protected] | ||
``` | ||
This randomly selects one of the login nodes, namely `groq-login-01.ai.alcf.anl.gov` or `groq-login-02.ai.alcf.anl.gov`. You can alternatively ssh to the specific login nodes directly. | ||
|
||
|
||
### Log in to a GroqRack node | ||
|
||
Once you are on a login node, optionally ssh to one of the GroqRack nodes, which are numbered 1-9. | ||
|
||
```bash | ||
ssh groq-r01-gn-01.ai.alcf.anl.gov | ||
# or | ||
ssh groq-r01-gn-09.ai.alcf.anl.gov | ||
# or any node with hostname of form groq-r01-gn-0[1-9].ai.alcf.anl.gov | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Job Queueing and Submission | ||
|
||
Groq jobs in the AI Testbed's groqrack are managed by the PBS job scheduler.<br> | ||
Overview: [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System)<br> | ||
For additional information, see | ||
[https://docs.alcf.anl.gov/running-jobs/job-and-queue-scheduling/](https://docs.alcf.anl.gov/running-jobs/job-and-queue-scheduling/)<br> | ||
Man pages are available. These are the key commands: | ||
```console | ||
# qsub - to submit a batch job using a script | ||
man qsub | ||
# qstat - to display queue information | ||
man qstat | ||
# qdel - to delete (cancel) a job: | ||
man qdel | ||
# qhold - to hold a job | ||
man qhold | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# Running a Model/Program | ||
|
||
Jobs are launched from any GroqRack node, or from login nodes. <br> | ||
If you expect a loss of an internet connection for any reason, for long-running jobs we suggest logging into a specific node and using either **screen** or **tmux** to create persistent command line sessions. For details use: | ||
|
||
```bash | ||
man screen | ||
# or | ||
man tmux | ||
``` | ||
or online man pages: [screen](https://manpages.ubuntu.com/manpages/jammy/en/man1/screen.1.html), [tmux](https://manpages.ubuntu.com/manpages/jammy/en/man1/tmux.1.html) | ||
|
||
## Running jobs on Groq nodes | ||
|
||
### GroqFlow | ||
|
||
GroqFlow is the simplest way to port applications running inference to groq. The groqflow github repo includes many sample applications.</br> | ||
See [GroqFlow](https://github.com/groq/groqflow/tree/main). | ||
|
||
### Clone the GroqFlow github repo | ||
|
||
Clone the groqflow github repo and change current directory to the clone: | ||
```bash | ||
cd ~/ | ||
git clone https://github.com/groq/groqflow.git | ||
cd groqflow | ||
``` | ||
|
||
### GroqFlow conda environments | ||
|
||
Create a groqflow conda environment, and activate it. | ||
Follow the instructions in the [Virtual Environments](virtual-environments.md) <br> section. | ||
Note: Similar install instructions are in `~/groqflow/docs/install.md` or [GroqFlow™ Installation Guide](https://github.com/groq/groqflow/blob/main/docs/install.md)<br> | ||
The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps. | ||
|
||
### Running a groqflow sample | ||
Each groqflow sample directory in the `~/groqflow/proof_points` tree has a README.md describing the sample and how to run it. | ||
|
||
#### Optionally activate your GroqFlow conda environment | ||
```console | ||
conda activate groqflow | ||
``` | ||
|
||
#### Run a sample using PBS | ||
See [Job Queueing and Submission](job-queuing-and-submission.md) for more information about the PBS job scheduler. | ||
|
||
Create a script `run_minilmv2.sh` with the following contents. It assumes that conda was installed in the default location. The conda initialize section can also be copied from your .bashrc if the conda installer was allowed to add it. | ||
```bash | ||
#!/bin/bash | ||
# >>> conda initialize >>> | ||
# !! Contents within this block are managed by 'conda init' !! | ||
__conda_setup="$(${HOME}'/miniconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" | ||
if [ $? -eq 0 ]; then | ||
eval "$__conda_setup" | ||
else | ||
if [ -f "${HOME}/miniconda3/etc/profile.d/conda.sh" ]; then | ||
. "${HOME}/miniconda3/etc/profile.d/conda.sh" | ||
else | ||
export PATH="${HOME}/miniconda3/bin:$PATH" | ||
fi | ||
fi | ||
unset __conda_setup | ||
# <<< conda initialize <<< | ||
conda activate groqflow | ||
cd ~/groqflow/proof_points/natural_language_processing/minilm | ||
pip install -r requirements.txt | ||
python minilmv2.py | ||
``` | ||
|
||
Then run the script as a batch job with PBS: | ||
```bash | ||
qsub run_minilmv2.sh | ||
``` | ||
|
||
If your `~/.bashrc` initializes conda, an alternative to copying the conda initilization script into your execution scripts is to comment out this section in your "~/.bashrc": | ||
```bash | ||
# If not running interactively, don't do anything | ||
case $- in | ||
*i*) ;; | ||
*) return;; | ||
esac | ||
``` | ||
to | ||
```bash | ||
## If not running interactively, don't do anything | ||
#case $- in | ||
# *i*) ;; | ||
# *) return;; | ||
#esac | ||
``` | ||
Then the execution script becomes: | ||
```bash | ||
#!/bin/bash | ||
conda activate groqflow | ||
cd ~/groqflow/proof_points/natural_language_processing/minilm | ||
pip install -r requirements.txt | ||
python minilmv2.py | ||
``` | ||
|
||
Job status can be tracked with qstat: | ||
```console | ||
$ qstat | ||
Job id Name User Time Use S Queue | ||
---------------- ---------------- ---------------- -------- - ----- | ||
3084.groq-r01-co* run_minilmv2 user 0 R workq | ||
$ | ||
``` | ||
|
||
|
||
Output will by default go to two files with names like the following, where the suffix is the job id. One standard output for the job. The other is the standard error for the job. | ||
```console | ||
$ ls run_minilmv2.sh.* | ||
-rw------- 1 user users 448 Oct 16 18:40 run_minilmv2.sh.e3082 | ||
-rw------- 1 user users 50473 Oct 16 18:42 run_minilmv2.sh.o3082 | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
|
||
ALCF consists of a single `GroqRackTM compute cluster` that provides an extensible accelerator network consisting of 9 `GroqNodeTM` [ groq-r01-gn-01 through groq-r01-gn-09 ] nodes with a rotational multi-node network topology. Each of these GroqNodes consists of 8 GroqCardTM accelerators in them with integrated chip-to-chip connections with a dragonfly multi-chip topology. | ||
|
||
`GroqCardTM accelerator` is a dual-width, full-height, three-quarter length PCI-Express Gen4 x16 adapter that includes a single `GroqChipTM processor` with 230 MB of on-chip memory. Based on the proprietary Tensor Streaming Processor (TSP) architecture, the GroqChip processor is a low latency and high throughput single core SIMD compute engine capable of 750 TOPS (INT8) and 188 TFLOPS (FP16) @ 900 MHz that includes advanced vector and matrix mathematical acceleration units. The GroqChip processor is deterministic, providing predictable and repeatable performance. | ||
|
||
The `GroqWare suite SDK` uses a API based programming model and enables users to develop, compile, and run models on the GroqCard accelerator in a host server system. The SDK uses a ONNX/MLIR enabled DAG compiler and it consists of Groq Compiler, Groq API, and utility tools like GroqView™ profiler and groq-runtime. | ||
|
||
|
||
<!--- The GroqRack 42U compute cluster has ---> <!--9 GroqNode servers, and --> <!--- 9 compute nodes (GroqNodes) named sequentially from groq-r01-gn-01 to groq-r01-gn-09.---> <!--and 1 redudant node (groq-r01-gn-09)--> <!---Each GroqNode has 2 AMD EPYCTM 7313 processors, a total of 1TB of DRAM, and 8 GroqCard accelerators, with integrated chip-to-chip connections. ---> | ||
|
||
|
||
For more information refer to the following links: | ||
|
||
[GroqRack spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqRack%E2%84%A2-Compute-Cluster-Product-Brief-v1.0.pdf)<br> | ||
[GroqNode spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqNode%E2%84%A2-Server-GN1-B8C-Product-Brief-v1.5.pdf)<br> | ||
[GroqCard spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqCard%E2%84%A2-Accelerator-Product-Brief-v1.5-.pdf)<br> | ||
[GroqChip spec sheet](https://groq.com/wp-content/uploads/2022/10/GroqChip%E2%84%A2-Processor-Product-Brief-v1.5.pdf)<br> | ||
([via](https://groq.com/docs/)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Virtual Environments | ||
|
||
## Install conda | ||
If conda is not already installed: | ||
```bash | ||
rm Miniconda3-latest-Linux-x86_64.sh* | ||
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh | ||
bash Miniconda3-latest-Linux-x86_64.sh | ||
# answer y/yes to all prompts | ||
# exit ssh session, then start a new ssh session | ||
exit | ||
``` | ||
## GroqFlow conda environment setup | ||
### Create and activate a groqflow conda environment | ||
Create a groqflow conda environment and activate it | ||
```bash | ||
export PYTHON_VERSION=3.10.12 | ||
conda create -n groqflow python=$PYTHON_VERSION | ||
conda activate groqflow | ||
``` | ||
|
||
### Install groqflow into the groqflow conda environment | ||
Execute the following commands to install groqflow into the activated groqflow conda environment | ||
```bash | ||
# Alter this if you have cloned groqflow to some other location. | ||
cd ~/groqflow | ||
pip install --upgrade pip | ||
pip install -e . | ||
pushd . | ||
cd demo_helpers | ||
pip install -e . | ||
popd | ||
``` | ||
|
||
To use groqfloq, | ||
```bash | ||
conda activate groqflow | ||
``` | ||
Note: Always use a personal conda environment when installing packages on groq nodes; otherwise they can get installed into `~/.local` and can cause problems when your shared home directory is used on other systems. If you encounter mysterious package dependency/version issues, check your `~/.local/lib` and `~/.local/bin` for mistakenly installed packages. | ||
|
||
Note: The conda enviroment should be reinstalled whenever new groqflow code is pulled from the groqflow github; with a groqflow conda environment activated, redo just the pip install steps. |
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Oops, something went wrong.