Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
fwang2 committed Aug 21, 2024
1 parent db92fe4 commit dc7542d
Show file tree
Hide file tree
Showing 127 changed files with 74 additions and 270 deletions.
Empty file modified .gitignore
100644 → 100755
Empty file.
Empty file modified Gemfile
100644 → 100755
Empty file.
Empty file modified README.md
100644 → 100755
Empty file.
Empty file modified _config.yml
100644 → 100755
Empty file.
Empty file modified _includes/head-custom-google-analytics.html
100644 → 100755
Empty file.
Empty file modified _includes/head-custom.html
100644 → 100755
Empty file.
Empty file modified _layouts/default.html
100644 → 100755
Empty file.
Empty file modified _sass/cayman.scss
100644 → 100755
Empty file.
Empty file modified _sass/jekyll-theme-cayman.scss
100644 → 100755
Empty file.
Empty file modified _sass/normalize.scss
100644 → 100755
Empty file.
Empty file modified _sass/rouge-github.scss
100644 → 100755
Empty file.
Empty file modified _sass/variables.scss
100644 → 100755
Empty file.
Empty file modified _site/assets/css/style.css
100644 → 100755
Empty file.
Empty file modified _site/assets/css/style.css.map
100644 → 100755
Empty file.
Empty file modified _site/images/frontier.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified _site/images/ornl.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified _site/index.html
100644 → 100755
Empty file.
Empty file modified _site/jekyll-theme-cayman.gemspec
100644 → 100755
Empty file.
Empty file modified _site/sc22-FL
100644 → 100755
Empty file.
Empty file modified _site/script/bootstrap
100644 → 100755
Empty file.
Empty file modified _site/script/cibuild
100644 → 100755
Empty file.
Empty file modified _site/script/release
100644 → 100755
Empty file.
Empty file modified _site/script/server
100644 → 100755
Empty file.
Empty file modified _site/script/validate-html
100644 → 100755
Empty file.
Empty file modified andes/Andes.md
100644 → 100755
Empty file.
Empty file modified assets/css/style.scss
100644 → 100755
Empty file.
Empty file modified dl-notebooks/NLP-rnn.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/NLP-tweet.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/NLP-word-embeddings.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/NLP-word2vec.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-PCA.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-SVM.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-activation.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-autograd.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-basics.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-bayesian.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-chain-rule.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-cnn.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-gd_1.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-gd_2.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-gd_3.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-gradient-descent.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-kmeans.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-linear-regression.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-logistic.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-neural-network.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-normal-equation.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-propgation.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-regularization.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-softmax.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/ml-transfer-learning.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/pytorch-Autograd.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/pytorch-Tensors.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/pytorch-basics.ipynb
100644 → 100755
Empty file.
Empty file modified dl-notebooks/pytorch-vision.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/DL_Pytorch.pptx
100644 → 100755
Empty file.
Empty file modified dl-ppts/classification/knn.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/clustering/DBSCAN.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/clustering/hierarchical_clustering.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/clustering/kmeans.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/README.md
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/batch.lsf
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/setup_tfconfig.py
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/start-jupyter.sh
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/stop-jupyter.sh
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/summit-multi-worker-example.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/jupyterlab-on-summit/summit-single-worker-example.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/pca/SVD.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/regression/Linear-Regression.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/regression/logistic-pytorch.ipynb
100644 → 100755
Empty file.
Empty file modified dl-ppts/regression/regression-pytorch.ipynb
100644 → 100755
Empty file.
Empty file modified figures/3neurons.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/3neurons2.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/bp-compute-graph.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/bp-steps.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/chain-rule1.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/chain-rule2.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/conjugate-transpose.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/cross-entropy.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/cross-product.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/deep_wide1.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/deep_wide2.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/example1.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/gd-linear_ab.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/linear-2x-2features.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x-5-degree1.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x-5-degree10.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x-5.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x-cost-func.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x-costfunc-mode-fit.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2x.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-2xplus5.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-cost-contour.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-cost-gd.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear-regression-gd.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear.pdf
100644 → 100755
Empty file.
Empty file modified figures/linear1.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/linear2.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/logistic_vs_linear.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/manual_derive.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/neuron.png
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified figures/nn_3.drawio
100644 → 100755
Empty file.
Empty file modified figures/sigmoid.pdf
100644 → 100755
Empty file.
Empty file modified frontier/Frontier-cmake.md
100644 → 100755
Empty file.
54 changes: 0 additions & 54 deletions frontier/crusher-deepspeed.md

This file was deleted.

Empty file modified frontier/crusher-pytorch.md
100644 → 100755
Empty file.
148 changes: 74 additions & 74 deletions frontier/frontier_pytorch.md
Original file line number Diff line number Diff line change
@@ -1,112 +1,112 @@


- [Build from source](#build-from-source)
- [prep Frontier modules](#prep-frontier-modules)
- [module output on Frontier](#module-output-on-frontier)
- [setup miniconda3](#setup-miniconda3)
- [build pytorch](#build-pytorch)
- [Build options: see `setup.py`](#build-options-see-setuppy)
- [regenerate CMAKE build files](#regenerate-cmake-build-files)
- [Kineto and roctracer.h problem](#kineto-and-roctracerh-problem)
- [Build DeepSpeed](#build-deepspeed)
- [Install GPTNeoX](#install-gptneox)
## On rocm/pytorch compatibility

# Build from source
![alt text](frontier_rocmTorch_table.png)

## prep Frontier modules

## Basic environment setup

```
module load PrgEnv-gnu
module load gcc/10.3.0
module load rocm/5.1.0
module load craype-x86-trento
ROCM_VER=6.0.0
module load PrgEnv-gnu/8.5.0
module load rocm/6.0.0
module load gcc-native/12.3
module load craype-accel-amd-gfx90a
module load cmake
module load miniforge3/23.11.0-0
module unload darshan-runtime
export HCC_AMDGPU_TARGET=gfx90a
export PYTORCH_ROCM_ARCH=gfx90a
export ROCM_SOURCE_DIR=/opt/rocm-5.1.0
export CRAY_CPU_TARGET=x86_64 # just to remove warning noise
export ROCM_HOME=/opt/rocm-${ROCM_VER}
export CC=cc
export CXX=CC
```
## module output on Frontier

```
Currently Loaded Modules:
1) libfabric/1.15.2.0 4) cray-dsmml/0.2.2 7) gcc/10.3.0 10) DefApps/default 13) craype-accel-amd-gfx90a
2) craype-network-ofi 5) cray-libsci/22.12.1.1 8) darshan-runtime/3.4.0 11) cray-mpich/8.1.23 14) craype-x86-trento
3) craype/2.7.19 6) PrgEnv-gnu/8.3.3 9) hsi/default 12) rocm/5.1.0
```

Note: One of the module between `craype-x86-trento` and `craype-accel-amd-gfx90a` fixed a linking problem. My guess is the former.
## conda and mpi4py setup

## setup miniconda3

```
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh -b -p miniconda
conda create -n pytorch python=3.8
conda active pytorch
pip install -r requirements.txt
```
conda create -p /sw/aaims/frontier/rocm600-pt230 \
python=3.10 numpy=1 -c conda-forge
## build pytorch
source activate /sw/aaims/frontier/rocm600-pt230
pip install pyyaml typing_extensions ninja packaging
```
git clone --recursive -b IFU-master-2022-11-22 https://github.com/ROCmSoftwarePlatform/pytorch
python tools/amd_build/build_amd.py
USE_KINETO=0 USE_ROCM=1 MAX_JOBS=4 python setup.py bdist_wheel 2>&1 | tee output
# Optional: install mpi4py
MPICC="cc -shared" pip install --no-cache-dir --no-binary=mpi4py mpi4py
```

### Build options: see `setup.py`
## Build pytorch

```
USE_KINETO=0 # disable profiler, ask for roctracer.h
```
### regenerate CMAKE build files
TORCH_VER=release/2.3-frontier
git clone --recursive -b ${TORCH_VER} \
https://github.com/michael-sandoval/pytorch
This will trigger a rebuild for the changed configuration.
cd pytorch
```
cd pytorch/build
rm CMakeCache.txt
```
To remove previous build as well:
# redundant
#git submodule init
#git submodule update
```
python setup.py clean
```
### Kineto and roctracer.h problem
# If using GCC12 to build torch:
export CFLAGS=" -Wno-error=maybe-uninitialized -Wno-error=uninitialized -Wno-error=restrict -Wno-error=nonnull"
export CXXFLAGS=" -Wno-error=maybe-uninitialized -Wno-error=uninitialized -Wno-error=restrict -Wno-error=nonnull"
export BUILD_TEST=OFF
Kineto requires roctracer, which fails in rocm 5.1.0
```
if (NOT ROCM_SOURCE_DIR)
set(ROCM_SOURCE_DIR "$ENV{ROCM_SOURCE_DIR}")
message(INFO " ROCM_SOURCE_DIR = ${ROCM_SOURCE_DIR}")
endif()
```
# Generate HIP files
python3 tools/amd_build/build_amd.py
For reason unknown at this point, the `ROCM_SOURCE_DIR` is still set as `/opt/rocm` instead of `/opt/rocm-5.1.0` even though the environment variable is set.
So the easy workaround is:
# Set the PyTorch build version
export PYTORCH_BUILD_VERSION="2.3.0"
export PYTORCH_BUILD_NUMBER=1
```
set(ROCM_SOURCE_DIR /opt/rocm-5.1.0)
# Point libkineto away from "/opt/rocm" to "/opt/rocm-x.y.z"
cp ./frontier_fixes/third_party/kineto/libkineto/CMakeLists.txt \
./third_party/kineto/libkineto/CMakeLists.txt
# Fix "rocm-core/rocm_version.h" reference post HIP conversion (ONLY WHEN USING ROCm < 6.0.0)
<!-- cp ./frontier_fixes/aten/src/ATen/hip/tunable/TunableGemm.h \
./aten/src/ATen/hip/tunable/TunableGemm.h -->
# Build PyTorch
USE_ROCM=1 USE_CUDA=OFF USE_NVCC=OFF BUILD_CAFFE2_OPS=0 ROCM_SOURCE_DIR="/opt/rocm-${ROCM_VER}" MAX_JOBS=8 python setup.py bdist_wheel
```

## Build DeepSpeed
After compilation, I have:

```
git clone https://github.com/microsoft/DeepSpeed
DS_BUILD_FUSED_LAMB=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1 DS_BUILD_UTILS=1 python setup.py bdist_wheel
python setup.py install
dist/torch-2.3.0-cp310-cp310-linux_x86_64.whl
```

## Install GPTNeoX
## Verify

```
pip install shortuuid # missed from
git clone https://github.com/EleutherAI/gpt-neox.git
pip install -r requirements/requirements.txt
pip install -r requirements/requirements-wandb.txt
pip install -r requirements/requirements-tensorboard.txt
pip install dist/*.whl
```

```
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/lustre/orion/stf218/proj-shared/fwang2/pytorch/torch/__init__.py", line 546, in <module>
raise ImportError(textwrap.dedent('''
ImportError: Failed to load PyTorch C extensions:
It appears that PyTorch has loaded the `torch/_C` folder
of the PyTorch repository rather than the C extensions which
are expected in the `torch._C` namespace. This can occur when
using the `install` workflow. e.g.
$ python setup.py install && python -c "import torch"
This error can generally be solved using the `develop` workflow
$ python setup.py develop && python -c "import torch" # This should succeed
or by running Python from a different directory.
```

Binary file added frontier/frontier_rocmTorch_table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
71 changes: 0 additions & 71 deletions frontier/rocm551-py310/conda_env.sh

This file was deleted.

71 changes: 0 additions & 71 deletions frontier/rocm551-py310/conda_env_old_pkgs.sh

This file was deleted.

Binary file not shown.
Empty file modified images/frontier.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified images/ornl.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file modified metrics.ipynb
100644 → 100755
Empty file.
Empty file modified nlp-model-scaling.ipynb
100644 → 100755
Empty file.
Empty file modified preface.py
100644 → 100755
Empty file.
Empty file modified sfmono.py
100644 → 100755
Empty file.
Empty file modified summit/JupyterOnSummit.md
100644 → 100755
Empty file.
Empty file modified summit/Summit-deepspeed.md
100644 → 100755
Empty file.
Empty file modified summit/Summit.md
100644 → 100755
Empty file.
Empty file modified summit/olcf-jupyterhub.md
100644 → 100755
Empty file.
Empty file modified tools/data_loader.ipynb
100644 → 100755
Empty file.
Empty file modified tools/linear.ipynb
100644 → 100755
Empty file.
Empty file modified tools/matplotlib.ipynb
100644 → 100755
Empty file.
Empty file modified tools/numpy.ipynb
100644 → 100755
Empty file.
Empty file modified tools/pandas.ipynb
100644 → 100755
Empty file.
Empty file modified units.py
100644 → 100755
Empty file.

0 comments on commit dc7542d

Please sign in to comment.