-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
20 changed files
with
915 additions
and
250 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Copper | ||
|
||
Copper is a co-operative caching layer for scalable parallel data movement in Exascale Supercomputers developed at Argonne Leadership Computing Facility. | ||
|
||
## Introduction | ||
|
||
Copper is a **read-only** cooperative caching layer aimed to enable scalable data loading on massive amounts of compute nodes. This aims to avoid the I/O bottleneck in the storage network and effectively use the compute network for data movement. | ||
|
||
The current intended use of copper is to improve the performance of python imports - dynamic shared library loading on Aurora. However, copper can used to improve the performance of any type of redundant data loading on a supercomputer. | ||
|
||
It is recommended to use copper for any applications [preferrably python and I/O <500 MB] in order to scale beyond 2k nodes. | ||
|
||
![Copper Workflow](copper.gif "Copper Workflow Architecture") | ||
|
||
|
||
|
||
## How to use copper on Aurora | ||
|
||
On your job script or from an interactive session | ||
|
||
```bash | ||
module load copper | ||
launch_copper.sh | ||
``` | ||
|
||
Then run your mpiexec as you would normally run. | ||
|
||
If you want your I/O to go through copper, add ```/tmp/${USER}/copper/``` to the begining of your PATHS. Here only the root compute node will do the I/O directly with the lustre file system. | ||
If ```/tmp/${USER}/copper/``` is not added to the begining of your paths, then all compute nodes would do I/O directly to the lustre file system. | ||
|
||
For example, if you have a local conda environment located in a path at ```/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env```, you need to prepath the copper path as ```/tmp/${USER}/copper/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env```. | ||
The same should be done for any type of PATHS, like PYTHONPATH, CONDAPATH and your input file path. | ||
|
||
|
||
|
||
Python Example | ||
|
||
```bash | ||
time mpirun --np ${NRANKS} --ppn ${RANKS_PER_NODE} --cpu-bind=list:4:9:14:19:20:25:56:61:66:71:74:79 --genvall \ | ||
--genv=PYTHONPATH=/tmp/${USER}/copper/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env \ | ||
python3 -c "import numpy; print(numpy.__file__)" | ||
|
||
``` | ||
|
||
Non python example | ||
|
||
```bash | ||
time mpiexec -np $ranks -ppn 12 --cpu-bind list:4:9:14:19:20:25:56:61:66:71:74:79 --no-vni -genvall \ | ||
/lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/run/aurora/wrapper.sh \ | ||
/lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/build_ws1024/bin/thundersvm-train \ | ||
-s 0 -t 2 -g 1 -c 10 -o 1 /tmp/${USER}/copper/lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/data/sc-40-data/real-sim_M100000_K25000_S0.836 | ||
``` | ||
|
||
Finally, you can add an optional ```stop_copper.sh``` | ||
|
||
|
||
## Copper Options | ||
|
||
```bash | ||
|
||
-l log_level [Allowed values :6[no logging],5[less logging],4,3,2,1[more logging] ] [Default : 6] | ||
-t log_type [Allowed values :file or file_and_stdout ] [Default : file] | ||
-T trees [Allowed values : any number] [Default : 1] | ||
-M max_cacheable_byte_size [Allowed values : any number in bytes] [Default : 10MB] | ||
-s sleeptime [Allowed values : Any number] [Default : 20 seconds] Recommended to use 60 seconds for 4k nodes | ||
-b physcpubind [Allowed values : "CORE NUMBER-CORE NUMBER"] [Default : "48-51"] | ||
|
||
``` | ||
|
||
For example, you can change the default values to | ||
|
||
```bash | ||
launch_copper.sh -l 2 -t stdout_and_file -T 2 -s 40 | ||
``` | ||
|
||
## Notes | ||
|
||
* Copper currently does not support write operation. | ||
* Only the follow file system operations are supported : init, open, read, readdir, readlink, getattr, ioctl, destroy | ||
* Copper works only from the compute nodes and you need a minimum of 2 nodes up to a max of any number of nodes ( Aurora max 10624 nodes) | ||
* Recommended trees is 1 or 2. | ||
* Recommended size for max cachable byte size is 10MB to 100MB. | ||
* To be used only from the compute node. | ||
* More examples at https://github.com/argonne-lcf/copper/tree/main/examples/example3 and https://alcf-copper-docs.readthedocs.io/en/latest/. | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.