diff --git a/docs/aurora/data-management/copper/copper.gif b/docs/aurora/data-management/copper/copper.gif new file mode 100644 index 000000000..a8c18ef05 Binary files /dev/null and b/docs/aurora/data-management/copper/copper.gif differ diff --git a/docs/aurora/data-management/copper/copper.md b/docs/aurora/data-management/copper/copper.md new file mode 100644 index 000000000..713a37967 --- /dev/null +++ b/docs/aurora/data-management/copper/copper.md @@ -0,0 +1,85 @@ +# Copper + +Copper is a co-operative caching layer for scalable parallel data movement in Exascale Supercomputers developed at Argonne Leadership Computing Facility. + +## Introduction + +Copper is a **read-only** cooperative caching layer aimed to enable scalable data loading on massive amounts of compute nodes. This aims to avoid the I/O bottleneck in the storage network and effectively use the compute network for data movement. + +The current intended use of copper is to improve the performance of python imports - dynamic shared library loading on Aurora. However, copper can used to improve the performance of any type of redundant data loading on a supercomputer. + +It is recommended to use copper for any applications [preferrably python and I/O <500 MB] in order to scale beyond 2k nodes. + +![Copper Workflow](copper.gif "Copper Workflow Architecture") + + + +## How to use copper on Aurora + +On your job script or from an interactive session + +```bash +module load copper +launch_copper.sh +``` + +Then run your mpiexec as you would normally run. + +If you want your I/O to go through copper, add ```/tmp/${USER}/copper/``` to the begining of your PATHS. Here only the root compute node will do the I/O directly with the lustre file system. +If ```/tmp/${USER}/copper/``` is not added to the begining of your paths, then all compute nodes would do I/O directly to the lustre file system. + +For example, if you have a local conda environment located in a path at ```/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env```, you need to prepath the copper path as ```/tmp/${USER}/copper/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env```. +The same should be done for any type of PATHS, like PYTHONPATH, CONDAPATH and your input file path. + + + +Python Example + +```bash +time mpirun --np ${NRANKS} --ppn ${RANKS_PER_NODE} --cpu-bind=list:4:9:14:19:20:25:56:61:66:71:74:79 --genvall \ + --genv=PYTHONPATH=/tmp/${USER}/copper/lus/flare/projects/Aurora_deployment/kaushik/copper/oct24/copper/run/copper_conda_env \ + python3 -c "import numpy; print(numpy.__file__)" + +``` + +Non python example + +```bash +time mpiexec -np $ranks -ppn 12 --cpu-bind list:4:9:14:19:20:25:56:61:66:71:74:79 --no-vni -genvall \ + /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/run/aurora/wrapper.sh \ + /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/build_ws1024/bin/thundersvm-train \ + -s 0 -t 2 -g 1 -c 10 -o 1 /tmp/${USER}/copper/lus/flare/projects/CSC250STDM10_CNDA/kaushik/thunder/svm_mpi/data/sc-40-data/real-sim_M100000_K25000_S0.836 +``` + +Finally, you can add an optional ```stop_copper.sh``` + + +## Copper Options + +```bash + + -l log_level [Allowed values :6[no logging],5[less logging],4,3,2,1[more logging] ] [Default : 6] + -t log_type [Allowed values :file or file_and_stdout ] [Default : file] + -T trees [Allowed values : any number] [Default : 1] + -M max_cacheable_byte_size [Allowed values : any number in bytes] [Default : 10MB] + -s sleeptime [Allowed values : Any number] [Default : 20 seconds] Recommended to use 60 seconds for 4k nodes + -b physcpubind [Allowed values : "CORE NUMBER-CORE NUMBER"] [Default : "48-51"] + +``` + +For example, you can change the default values to + +```bash +launch_copper.sh -l 2 -t stdout_and_file -T 2 -s 40 +``` + +## Notes + +* Copper currently does not support write operation. +* Only the follow file system operations are supported : init, open, read, readdir, readlink, getattr, ioctl, destroy +* Copper works only from the compute nodes and you need a minimum of 2 nodes up to a max of any number of nodes ( Aurora max 10624 nodes) +* Recommended trees is 1 or 2. +* Recommended size for max cachable byte size is 10MB to 100MB. +* To be used only from the compute node. +* More examples at https://github.com/argonne-lcf/copper/tree/main/examples/example3 and https://alcf-copper-docs.readthedocs.io/en/latest/. + diff --git a/docs/aurora/data-management/daos/core-nic-binding.png b/docs/aurora/data-management/daos/core-nic-binding.png new file mode 100644 index 000000000..1a1cf8721 Binary files /dev/null and b/docs/aurora/data-management/daos/core-nic-binding.png differ diff --git a/docs/aurora/data-management/daos/daos-overview.md b/docs/aurora/data-management/daos/daos-overview.md index f068fa30f..1ad120042 100644 --- a/docs/aurora/data-management/daos/daos-overview.md +++ b/docs/aurora/data-management/daos/daos-overview.md @@ -1,38 +1,59 @@ # DAOS Architecture -DAOS is a high performance storage system for storing checkpoints and analysis files. -DAOS is fully integrated with the wider Aurora compute fabric as can be seen in the overall storage architecture below. +DAOS is a major file system in Aurora with 230 PB delivering upto >30 TB/s with 1024 DAOS server storage Nodes. +DAOS is an open-source software-defined object store designed for massively distributed Non Volatile Memory (NVM) and NVMe SSD. +DAOS presents a unified storage model with a native Key-array Value storage interface supporitng POSIX, MPIO, DFS and HDF5. +Users can use DAOS for their I/O and checkpointing on Aurora. +DAOS is fully integrated with the wider Aurora compute fabric as can be seen in the overall storage architecture below. ![Aurora Storage Architecture](aurora-storage-architecture.png "Aurora Storage Architecture") +![Aurora Interconnect](dragonfly.png "Aurora Slingshot Dragonfly") + + # DAOS Overview -Users should submit a request as noted below to have a DAOS pool created for their project. -Once created, users may create and manage containers within the pool as they wish. + +The first step in using DAOS is to get DAOS POOL space allocated for your project. +Users should submit a request as noted below to have a DAOS pool created for your project. + +## DAOS Pool Allocation + +DAOS pool is a physically allocated dedicated storage space for your project. + +Email support@alcf.anl.gov to request a DAOS pool with the following information. + +* Project Name +* Alcf User Names +* Total Space requested (typically 100 TBs++) +* Justification +* Preferred pool name + ### Note This is an initial test DAOS configuration and as such, any data on the DAOS system will envtually be deleted when the configuration is changed into a larger system. Warning will be given before the system is wiped to allow time for users to move any important data off. -## Pool Allocation -Email support@alcf.anl.gov to request a pool and provide which primary project you're on. -The pool will be set to allow anyone in the project unix group to access the pool. -Please request the capacity of allocation you would like. ## Modules -Please load the `daos/base` module when using DAOS. This should be done when logging into the UAN or when using DAOS from a compute job script: +Please load the `daos` module when using DAOS. This should be done on the login node (UAN) or in the compute node (jobscript): ```bash +module use /soft/modulefiles module load daos/base ``` ## Pool + +Pool is a dedicated space allocated to your project. Once your pool is allocated for your project space. + Confirm you are able to query the pool via: + ```bash -daos pool query +daos pool query ``` Example output: ```bash -daos pool query software +daos pool query hacc Pool 050b20a3-3fcc-499b-a6cf-07d4b80b04fd, ntarget=640, disabled=0, leader=2, version=131 Pool space info: - Target(VOS) count:640 @@ -45,154 +66,303 @@ Total size: 6.0 TB Rebuild done, 4 objs, 0 recs ``` -## Container -The container is the basic unit of storage. -A POSIX container can contain hundreds of millions of files, you can use it to store all of your data. + + +## DAOS Container + +The container is the basic unit of storage. A POSIX container can contain hundreds of millions of files, you can use it to store all of your data. You only need a small set of containers; perhaps just one per major unit of project work is sufficient. -### Create a container -ALCF has provided a script, `mkcont`, to help create a container with reasonable defaults. +There are 3 modes with which we can operate with the DAOS containers +1. Posix container Posix Mode +2. Posix Container MPI-IO Mode +3. DFS container through DAOS APIs. + + +### Create a posix container + + ```bash -mkcont --type POSIX --pool --user $USER --group +$ DAOS_POOL=datascience +$ DAOS_CONT=LLM-GPT-1T +$ daos container create --type POSIX ${DAOS_POOL} ${DAOS_CONT} --properties rd_fac:1 + Container UUID : 59747044-016b-41be-bb2b-22693333a380 + Container Label: LLM-GPT-1T + Container Type : POSIX + +Successfully created container 59747044-016b-41be-bb2b-22693333a380 + ``` -Example output: +If you prefer a higher data protection and recovery you can --properties rd_fac:2 and if you don't need data protection and recovery, you can remove --properties rd_fac:1. +We recommend to have at least --properties rd_fac:1. + +![data model ](datamodel.png "DAOS data model") + + +## DAOS sanity checks + +If any of the following command results in an error, then you can confirm DAOS is currently down. +'Out of group or member list' error is an exception and can be safely ignored. This error message will be fixed in the next daos release. + ```bash -> mkcont --type=POSIX --pool iotest --user harms --group users random -> Container UUID : 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c -> Container Label: random -> Container Type : POSIX -> -> Successfully created container 9a6989d3-3835-4521-b9c6-ba1b10f3ec9c -> 0 +module use /soft/modulefiles +module load daos/base +env | grep DRPC +ps –ef | grep daos +clush --hostfile ${PBS_NODEFILE} ps –ef | grep agent | grep -v grep' | dshbak -c #to check on all compute nodes +export DAOS_POOL=Your_allocated_pool_name +daos pool query ${DAOS_POOL} +daos cont list ${DAOS_POOL} +daos container get-prop $DAOS_POOL_NAME $DAOS_CONT_NAME + ``` -Alternatively, the `daos` command can be used to create a container directly. -### Mount a container +* Look for messages like Rebuild busy and state degraded in the daos pool query. +* Look for messages like Health (status) : UNCLEAN in the get prop + + +```bash +daos pool autotest $DAOS_POOL_NAME +daos container check --pool=$DAOS_POOL_NAME --cont=$DAOS_CONT_NAME +``` + + +### Mount a posix container Currently, you must manually mount your container prior to use on any node you are working on. In the future, we hope to automate some of this via additional `qsub` options. -#### UAN -Create a directory to mount the POSIX container on and then mount the container via `dfuse`. +#### To mount a posix container on a login node + + ```bash -dfuse --pool= --cont= -m $HOME/daos// + +mkdir –p /tmp/${DAOS_POOL}/${DAOS_CONT} +start-dfuse.sh -m /tmp/${DAOS_POOL}/${DAOS_CONT} --pool ${DAOS_POOL} --cont ${DAOS_CONT} # To mount +mount | grep dfuse # To confirm if its mounted + +# Mode 1 +ls /tmp/${DAOS_POOL}/${DAOS_CONT} +cd /tmp/${DAOS_POOL}/${DAOS_CONT} +cp ~/temp.txt ~ /tmp/${DAOS_POOL}/${DAOS_CONT}/ +cat /tmp/${DAOS_POOL}/${DAOS_CONT}/temp.txt + +fusermount3 -u /tmp/${DAOS_POOL}/${DAOS_CONT} # To unmount + ``` + +#### To mount a posix container on Compute Nodes -> mkdir -p $HOME/daos/iotest/random -> dfuse --pool=iotest --cont=random -m $HOME/daos/iotest/random -> mount | grep iotest -> dfuse on /home/harms/daos/iotest/random type fuse.daos (rw,nosuid,nodev,noatime,user_id=4211,group_id=100,default_permissions) +You need to mount the container on all compute nodes. -#### Compute Node -From a compute node, you need to mount the container on all compute nodes. -We provide some scripts to help perform this from within your job script. -More examples are available in `/soft/daos/examples`. -The following example uses two support scripts, `launch-dfuse.sh` and `clean-dfuse.sh`, to startup dfuse on each compute node and then shut it down at job end, respectively. -Job Script Example: ```bash -#!/bin/bash -#PBS -A -#PBS -lselect=1 -#PBS -lwalltime=30:00 -#PBS -k doe -# -# Test case for MPI-IO code example +launch-dfuse.sh ${DAOS_POOL_NAME}:${DAOS_CONT_NAME} # launched using pdsh on all compute nodes mounted at: /tmp// +mount | grep dfuse # To confirm if its mounted -# ranks per node -rpn=4 +ls /tmp/${DAOS_POOL}/${DAOS_CONT}/ -# threads per rank -threads=1 +clean-dfuse.sh ${DAOS_POOL_NAME}:${DAOS_CONT_NAME} # To unmount on all nodes +``` +DAOS Data mover instruction is provided at [here](../moving_data_to_aurora/daos_datamover.md). -# nodes per job -nnodes=$(cat $PBS_NODEFILE | wc -l) +## Job Submission -# Verify the pool and container are set -if [ -z "$DAOS_POOL" ]; -then - echo "You must set DAOS_POOL" - exit 1 -fi +The `-ldaos=default` switch will ensure that DAOS is accessible on the compute nodes. -if [ -z "$DAOS_CONT" ]; -then - echo "You must set DAOS_CONT" - exit 1 -fi +Job submission without requesting DAOS: +```bash +qsub -l select=1 -l walltime=01:00:00 -A Aurora_deployment -k doe -l filesystems=flare -q lustre_scaling ./pbs_script1.sh or - I +``` -# load daos/base module (if not loaded) -module load daos/base +Job submission with DAOS: +```bash +qsub -l select=1 -l walltime=01:00:00 -A Aurora_deployment -k doe -l filesystems=flare -q lustre_scaling -l daos=default ./pbs_script1.sh or - I +``` -# print your module list (useful for debugging) -module list -# print your environment (useful for debugging) -#env +## NIC and Core Binding -# turn on output of what is executed -set -x +Each Aurora compute node has 8 NICs and each DAOS server node has 2 NICs. +Each NIC is capable of driving 20-25 GB/s unidirection for data transfer. +Every read and write goes over the NIC and hence NIC binding is the key to achieve good performance. -# -# clean previous mounts (just in case) -# -clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} +For 12 PPN, the following binding is recommended. -# launch dfuse on all compute nodes -# will be launched using pdsh -# arguments: -# pool:container -# may list multiple pool:container arguments -# will be mounted at: -# /tmp/\/\ -launch-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} +```bash +CPU_BINDING1=list:4:9:14:19:20:25:56:61:66:71:74:79 +``` +![Sample NIC to Core binding](core-nic-binding.png "Sample NIC to Core binding") + + + +## Interception library for posix containers + +The interception library (IL) is a next step in improving DAOS performance. This provides kernel-bypass for I/O data, leading to improved performance. +The libioil IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. The libpil4dfs IL should be used for both data and metadata calls to go through dFuse. +The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and commuNICates with DAOS directly in userspace. +It will also take advantage of the multiple NICs on the node based on how many MPI processes are running on the node and which CPU socket they are on. + + + +![Interception library](interception.png "Interception library") + + + +```bash +Interception library for POSIX mode + +mpiexec # no interception +mpiexec --env LD_PRELOAD=/usr/lib64/libioil.so # only data is intercepted +mpiexec --env LD_PRELOAD=/usr/lib64/libpil4dfs.so # preferred - both metadata and data is intercepted. This provides close to DFS mode performance. -# change to submission directory -cd $PBS_O_WORKDIR -# run your job(s) -# these test cases assume 'testfile' is in the CWD -cd /tmp/${DAOS_POOL}/${DAOS_CONT} - -echo "write" - -mpiexec -np $((rpn*nnodes)) \ --ppn $rpn \ --d $threads \ ---cpu-bind numa \ ---no-vni \ # enables DAOS access --genvall \ -/soft/daos/examples/src/posix-write - -echo "read" -mpiexec -np $((rpn*nnodes)) \ --ppn $rpn \ --d $threads \ ---cpu-bind numa \ ---no-vni \ # enables DAOS access --genvall \ -/soft/daos/examples/src/posix-read - -# cleanup dfuse mounts -clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} - -exit 0 ``` -## Job Submission -The above job script expects two environment variables which you set to the relevant pool and container. -The `-ldaos=default` switch will ensure that DAOS is available on the compute node. +## Sample job script + +Currently, ``--no-vni`` is required in the ``mpiexec`` command to use DAOS. + +```bash + +#!/bin/bash -x +#PBS -l select=512 +#PBS -l walltime=01:00:00 +#PBS -A Aurora_deployment +#PBS -q lustre_scaling +#PBS -k doe +#PBS -ldaos=default + +# qsub -l select=512:ncpus=208 -l walltime=01:00:00 -A Aurora_deployment -l filesystems=flare -q lustre_scaling -ldaos=default ./pbs_script.sh or - I + + +# please do not miss -ldaos=default in your qsub :'( + +export TZ='/usr/share/zoneinfo/US/Central' +date +module use /soft/modulefiles +module load daos +env | grep DRPC #optional +ps -ef|grep daos #optional +clush --hostfile ${PBS_NODEFILE} 'ps -ef|grep agent|grep -v grep' | dshbak -c #optional +DAOS_POOL=datascience +DAOS_CONT=thundersvm_exp1 +daos pool query ${DAOS_POOL} #optional +daos cont list ${DAOS_POOL} #optional +daos container destroy ${DAOS_POOL} ${DAOS_CONT} #optional +daos container create --type POSIX ${DAOS_POOL} ${DAOS_CONT} --properties rd_fac:1 +daos container query ${DAOS_POOL} ${DAOS_CONT} #optional +daos container get-prop ${DAOS_POOL} ${DAOS_CONT} #optional +daos container list ${DAOS_POOL} #optional +launch-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} # To mount on a compute node + +# mkdir -p /tmp/${DAOS_POOL}/${DAOS_CONT} # To mount on a login node +# start-dfuse.sh -m /tmp/${DAOS_POOL}/${DAOS_CONT} --pool ${DAOS_POOL} --cont ${DAOS_CONT} # To mount on a login node + +mount|grep dfuse #optional +ls /tmp/${DAOS_POOL}/${DAOS_CONT} #optional + +# cp /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 /tmp/${DAOS_POOL}/${DAOS_CONT} #one time +# daos filesystem copy --src /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 --dst daos://tmp/${DAOS_POOL}/${DAOS_CONT} # check https://docs.daos.io/v2.4/testing/datamover/ + + +cd $PBS_O_WORKDIR +echo Jobid: $PBS_JOBID +echo Running on nodes `cat $PBS_NODEFILE` +NNODES=`wc -l < $PBS_NODEFILE` +RANKS_PER_NODE=12 # Number of MPI ranks per node +NRANKS=$(( NNODES * RANKS_PER_NODE )) +echo "NUM_OF_NODES=${NNODES} TOTAL_NUM_RANKS=${NRANKS} RANKS_PER_NODE=${RANKS_PER_NODE}" +CPU_BINDING1=list:4:9:14:19:20:25:56:61:66:71:74:79 + +export THUN_WS_PROB_SIZE=1024 +export ZE_FLAT_DEVICE_HIERARCHY=COMPOSITE +export AFFINITY_ORDERING=compact +export RANKS_PER_TILE=1 +export PLATFORM_NUM_GPU=6 +export PLATFORM_NUM_GPU_TILES=2 + + +date +LD_PRELOAD=/usr/lib64/libpil4dfs.so mpiexec -np ${NRANKS} -ppn ${RANKS_PER_NODE} --cpu-bind ${CPU_BINDING1} \ + --no-vni -genvall thunder/svm_mpi/run/aurora/wrapper.sh thunder/svm_mpi/build_ws1024/bin/thundersvm-train \ + -s 0 -t 2 -g 1 -c 10 -o 1 /tmp/datascience/thunder_1/real-sim_M100000_K25000_S0.836 +date + +clean-dfuse.sh ${DAOS_POOL}:${DAOS_CONT} #to unmount on compute node +# fusermount3 -u /tmp/${DAOS_POOL}/${DAOS_CONT} #to unmount on login node ``` -qsub -v DAOS_POOL=,DAOS_CONT= -ldaos=default ./job-script.sh + +## MPI-IO Mode + +Mode 2 + +The ROMIO MPI-IO layer provides multiple I/O backends including a custom DAOS backend. +MPI-IO can be used with dFuse and the interception library when using the `ufs` backend but the `daos` backend will provide optimal performance. +In order to use this, one can prefix the file names with `daos:` which will tell MPI-IO to use the DAOS backend. + + + +```bash + +export ROMIO_PRINT_HINTS=1 + +echo "cb_nodes 128" >> ${PBS_O_WORKDIR}/romio_hints + +mpiexec --env ROMIO_HINTS = romio_hints_file program daos:/mpi_io_file.data + +or + +mpiexec --env MPICH_MPIIO_HINTS = path_to_your_file*:cb_config_list=#*:2# + :romio_cb_read=enable + :romio_cb_write=enable + :cb_nodes=32 + program daos:/mpi_io_file.data + + ``` -## oneScratch -The current DAOS system is configured with 20 server nodes. -The remaining balance of server nodes is still reserved for internal testing. +## DFS Mode + +Mode 3 + + +DFS is the user level API for DAOS. +This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. +The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching. + -### Hardware +Reference code for using DAOS through DFS mode and DAOS APIs +Full code at ``` /soft/daos/examples/src ``` + +```bash +#include +#include +#include +#include +#include +int main(int argc, char **argv) +{ + dfs_t *dfs; + d_iov_t global; + ret = MPI_Init(&argc, &argv); + ret = MPI_Comm_rank(MPI_COMM_WORLD, &rank); + ret = dfs_init(); + ret = dfs_connect(getenv("DAOS_POOL"), NULL, getenv("DAOS_CONT"), O_RDWR, NULL, &dfs); + ret = dfs_open(dfs, NULL, filename, S_IFREG|S_IRUSR|S_IWUSR, O_CREAT|O_WRONLY, obj_class, chunk_size, NULL, &obj); + ret = dfs_write(dfs, obj, &sgl, off, NULL); + ret = dfs_read(dfs, obj, &sgl, off, &read, NULL); + ret = dfs_disconnect(dfs); + ret = daos_fini(); + ret = MPI_Finalize(); +} + + +``` + +## DAOS Hardware Each DAOS server nodes is based on the Intel Coyote Pass platform. * (2) Xeon 5320 CPU (Ice Lake) * (16) 32GB DDR4 DIMMs @@ -202,77 +372,87 @@ Each DAOS server nodes is based on the Intel Coyote Pass platform. ![DAOS Node](daos-node.png "DAOS CYP Node") -### Performance -The peak performance of the oneScratch storage is approximately 800 GB/s to 1000 GB/s. -Obtaining this performance should be possible from a job running in the available user partition but there are many considerations to understand achieving good performance. - -#### Single Node -First is to consider the throughput you can obtain from a single compute node in the test case. -* dfuse is a single process and will therefore attach to a single NIC, limiting throughput to ~20 GB/s per compute node. -* dfuse offers caching and can thus show performance greater than theoretical due to cache effects based on the workload running. -* MPI-IO, Intercept Library, or other interfaces that use libdfs will bond to a NIC per-process. - * DAOS will bond to NICs in a round-robin fashion to NICs which are located on the same socket. - * For Aurora, DAOS processes running on socket 0 will only use 4 NICs assuming at least four processes are used but will not use more until the second socket is used. - * IF running with a lower process count such as 24, the processes should be distributed between socket 0 and socket 1 for best I/O performance. - -#### Dragonfly Groups -The next element is to consider how many dragonfly groups the job is running within. -Each dragonfly groups has 2 links to each I/O group and the current DAOS servers are distributed amoung the full 8 I/O groups. -* If a single compute group is used, that limits performance to 8 groups 2 links/group * 25 GB/s/link = 400 GB/s -* Thus it requires at least 2 compute groups to reach max performance. -* However, Slingshot support dynamic routing allowing traffic to use non-minimal routes via other compute groups which will result in performance greater that the theoretical peak of the number of compute groups being used. - * Dynamic routing performance will be sensitive to other workloads running on the system and not be consistent. - -![Aurora Interconnect](daos-ss-dragonfly.png "Aurora Slingshot Dragonfly") - -#### Object Class -The object class selected for your container will influence the performance potential of I/O. -An object class which is [SG]X is distributed across all of the targets in the system. -* All pools in the test system are enabled to use 100% of the targets. -* SX/GX will provide best performance for large data which distributes on all server targets but will lower IOp performance for metadata as each target must be communicated with. -* S1/G1 will provide good performance for small data with need for high IOps as it places data only on 1 target. - - -# Porting I/O to DAOS -There is no need to specifically modify code to use DAOS, however, DAOS can be used in several different modes so some consideration should be given on where to start. -The diagram below provides a suggested path to testing an application beginning at the green downward arrow. - -![DAOS Porting Stragegy](daos-porting-strategy.png "DAOS Porting Strategy") - -## dFuse -The first suggested step to test with dFuse. -dFuse utilizes the linux FUSE layer to provide a POSIX compatible API layer for any application that uses POSIX directly or indirectly though an I/O library. -The dFuse component provides a simple method for using DAOS with no modifications. -Once the DAOS container is mounted via dFuse, applications and utilities can access data as before. -dFuse will scale as more compute nodes are added, but is not efficient on a per-node basis so will not provide ideal results at small scale. -dFuse doesn't provide ideal metadata performance either but it does have the advatange of utilizing the Linux page cache, so workloads that benefit from caching may see better performance than other methods. - -## Interception Library -The interception library (IL) is a next step in improving DAOS performance. -The IL will intercept basic read and write POSIX calls while all metadata calls still go through dFuse. -The IL can provide a large performance improvement for bulk I/O as it bypasses the kernel and communicates with DAOS directly in userspace. -It will also take advantage of the multiple NICs on the node based on who many MPI processes are running on the node and which CPU socket they are on. - -## MPI-IO -The ROMIO MPI-IO layer provides multiple I/O backends including a custom DAOS backend. -MPI-IO can be used with dFuse and the interception library when using the `ufs` backend but the `daos` backend will provide optimal performance. -In order to use this, one can prefix the file names with `daos:` which will tell MPI-IO to use the DAOS backend. -## HDF DAOS VOL -The HDF5 library can be used with POSIX or MPI-IO layers utilizing dFuse, IL or MPI-IO with DAOS. -The first suggestion would be to start with dFuse and then move to MPI-IO. -Once the performance of these methods has been evaluated, using the custom DAOS VOL can be attempted. -The DAOS VOL will provide a performance improvement under certain types of HDF workloads. -Using the VOL has other complications/benefits which should be considered as well. -The VOL maps a single HDF file into a single container. -This means a workload that tries to use multiple HDF files per checkpoint, will create one DAOS container for each one. -This is not ideal and will likely lead to performance issues. -The HDF code should be such that a single HDF file is used per checkpoint/analysis file/etc. -An entire campaign might generate thousands of containers which might be some overhead on an individual to manage so many containers. -As such, it might be beneficial to convert the code to write each checkpoint/time step into a HDF Group and then a single HDF file can be used for the entire campaign. -This solution is more DAOS specific, as it will be functionally compatible on any system, however a traditinoal PFS may lose the entire contents of the file if a failure occurs during write while DAOS will be resilent to those failures and rollback to a previous good version. - -## DFS -DFS is the user level API for DAOS. -This API is very similar to POSIX but still has many differences that would require code changes to utilize DFS directly. -The DFS API can provide the best overall performance for any scenario other than workloads which benefit from caching. + +## Darshan profiler for DAOS + +Currently, you need to install your own local darshan-daos profiler +You need to use DFS mode (3) or Posix with interception library to profile + +```bash +module use /soft/modulefiles +module load daos +module list +git clone https://github.com/darshan-hpc/darshan.git +git checkout snyder/dev-daos-module-3.4 +./prepare.sh +mkdir /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-install + +./configure --prefix=/home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-install \ + --with-log-path=/home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs \ + --with-jobid-env=PBS_JOBID \ + CC=mpicc --enable-daos-mod + +make && make install + +chmod 755 ~/soft/profilers/darshan-daos/darshan/darshan-install/darshan-mk-log-dirs.pl +mkdir /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs +cd /home/kaushikvelusamy/soft/profilers/darshan-daos/darshan-logs +~/soft/profilers/darshan-daos/darshan/darshan-install/darshan-mk-log-dirs.pl +~/soft/profilers/darshan-daos/darshan-install/bin/darshan-config --log-path + +``` + +Preload darshan first then daos interception library + +``` +mpiexec --env LD_PRELOAD=~/soft/profilers/darshan-daos/darshan-install/lib/libdarshan.so:/usr/lib64/libpil4dfs.so + -np 32 -ppn 16 --no-vni -genvall \ + ior -a DFS --dfs.pool=datascience_ops --dfs.cont=ior_test1 \ + -i 5 -t 16M -b 2048M -w -r -C -e -c -v -o /ior_2.dat +``` + + +install darshan-util from laptop + + +```bash + +conda info –envs +conda activate env-non-mac-darshan-temp +/Users/kvelusamy/Desktop/tools/spack/share/spack/setup-env.sh + +spack install darshan darshan-util +export DYLD_FALLBACK_LIBRARY_PATH=/Users/kvelusamy/Desktop/tools/spack/opt/spack/darwin-ventura-m1/apple-clang-14.0.3/darshan-util-3.4.4-od752jyfljrrey3d4gjeypdcppho42k2/lib/:$DYLD_FALLBACK_LIBRARY_PATH + +darshan-parser ~/Downloads/kaushikv_ior_id917110-44437_10-23-55830-632270104473632905_1.darshan +python3 -m darshan summary ~/Downloads/kaushikv_ior_id917110-44437_10-23-55830-632270104473632905_1.darshan #coming soon + +``` + +## Cluster Size + +DAOS Cluster size is the number of available DAOS servers. While we are working towards bringing up the entire 1024 daos server available users, currently different number of daos nodes could be up. Please check with support or run an IOR test to get an estimate on the current number of daos servers available. + + +![expected Bandwidth](expectedBW.png "Expected number of daos servers and its approximate expected bandwidth") + + +## Best practices + +```bash +Check qsub –l daos=default +Daos sanity checks mentioned above +Did you load DAOS module? module load daos +Do you have your DAOS pool allocated? daos pool query datascience +Is Daos client running on all your nodes? ps –ef | grep daos +Is your container mounted on all nodes? mount | grep dfuse +Can you ls in your container? ls /tmp/${DAOS_POOL}/${DAOS_CONT} +Did your I/O Actually fail? +What is the health property in your container? daos container get-prop $DAOS_POOL $CONT +Is your space full? Min and max daos pool query datascience +Does your query show failed targets or rebuild in process? daos pool query datascience +daos pool autotest +Daos container check + +``` diff --git a/docs/aurora/data-management/daos/daos-porting-strategy.pdf b/docs/aurora/data-management/daos/daos-porting-strategy.pdf deleted file mode 100644 index 24f4fd451..000000000 Binary files a/docs/aurora/data-management/daos/daos-porting-strategy.pdf and /dev/null differ diff --git a/docs/aurora/data-management/daos/daos-porting-strategy.png b/docs/aurora/data-management/daos/daos-porting-strategy.png deleted file mode 100644 index 9a8edd5cc..000000000 Binary files a/docs/aurora/data-management/daos/daos-porting-strategy.png and /dev/null differ diff --git a/docs/aurora/data-management/daos/daos-ss-dragonfly.png b/docs/aurora/data-management/daos/daos-ss-dragonfly.png deleted file mode 100644 index 3e3c64c3c..000000000 Binary files a/docs/aurora/data-management/daos/daos-ss-dragonfly.png and /dev/null differ diff --git a/docs/aurora/data-management/daos/datamodel.png b/docs/aurora/data-management/daos/datamodel.png new file mode 100644 index 000000000..195b0459c Binary files /dev/null and b/docs/aurora/data-management/daos/datamodel.png differ diff --git a/docs/aurora/data-management/daos/dragonfly.png b/docs/aurora/data-management/daos/dragonfly.png new file mode 100644 index 000000000..30214a4d2 Binary files /dev/null and b/docs/aurora/data-management/daos/dragonfly.png differ diff --git a/docs/aurora/data-management/daos/expectedBW.png b/docs/aurora/data-management/daos/expectedBW.png new file mode 100644 index 000000000..e75659c25 Binary files /dev/null and b/docs/aurora/data-management/daos/expectedBW.png differ diff --git a/docs/aurora/data-management/daos/interception.png b/docs/aurora/data-management/daos/interception.png new file mode 100644 index 000000000..d54933a4b Binary files /dev/null and b/docs/aurora/data-management/daos/interception.png differ diff --git a/docs/aurora/data-management/lustre/flare.md b/docs/aurora/data-management/lustre/flare.md new file mode 100644 index 000000000..73541cea3 --- /dev/null +++ b/docs/aurora/data-management/lustre/flare.md @@ -0,0 +1,6 @@ +# Flare Filesystem + +**Flare** is a 91 PB Lustre Filesystem with 160 OSTs, 40 MDTs and 48 Gateway nodes mounted at ```/lus/flare/projects/``` with a peak theoritical performance of 650GB/s. You should launch jobs only from this flare space. + +Home is 12 PB **Gecko** Lustre Filesystem with 32 OSTs and 12 MDTs. + diff --git a/docs/aurora/data-management/lustre/gecko.md b/docs/aurora/data-management/lustre/gecko.md deleted file mode 100644 index c54f65478..000000000 --- a/docs/aurora/data-management/lustre/gecko.md +++ /dev/null @@ -1,54 +0,0 @@ -# Gecko Filesystem - -## Data Transfer - -Currently, scp and SFTP are the only ways to transfer data to/from Aurora. - -### Transferring files from non-ALCF systems - -As an expedient for initiating ssh sessions to Aurora login nodes via the bastion indirect nodes, and to enable scp from remote (non ALCF) hosts to Aurora login nodes, follow these steps: - -1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on [this page](https://help.cels.anl.gov/docs/linux/ssh/): -2. Add the lines listed below to your ~/.ssh/config file on the remote host. That is, you should do this on your laptop/desktop, from which you are initiating ssh login sessions to Aurora via bastion, and on other non-ALCF host systems from which you want to copy files to Aurora login nodes using scp. - -``` -$ cat ~/.ssh/config - -Host *.aurora.alcf.anl.gov aurora.alcf.anl.gov - ProxyCommand ssh @bastion.alcf.anl.gov -q -W %h:%p -``` - -3. Copy the public key (*.pub) from ~/.ssh folder on the remote machine to ~/.ssh/authorized_keys file on Aurora (login.aurora.alcf.anl.gov) - -When you use an SSH proxy, it takes the authentication mechanism from the local host and applies it to the farthest-remote host, while prompting you for the “middle host” separately. So, when you run the ssh @login.aurora.alcf.anl.gov command on your laptop/desktop, you'll be prompted for two ALCF authentication codes - first the Mobilepass+ or Cryptocard passcode for the bastion, and then the SSH passphrase for Aurora. Likewise, when you run scp from a remote host to copy files to Aurora login nodes, you'll be prompted for two ALCF authentication codes codes - first the Mobilepass+ or Cryptocard passcode and then the SSH passphrase. - - -### Transferring files from other ALCF systems - -With the bastion pass-through nodes currently used to access both Sunspot and Aurora, users will find it helpful to modify their `.ssh/config` files on Aurora appropriately to facilitate transfers to Aurora from other ALCF systems. These changes are similar to what Sunspot users may have already implemented. From an Aurora login-node, this readily enables one to transfer files from Sunspot's `gila` filesystem or one of the production filesystems at ALCF (`home` and `eagle`) mounted on an ALCF system's login node. With the use of `ProxyJump` below, entering the MobilePass+ or Cryptocard passcode twice will be needed (once for bastion and once for the other resource). A simple example shows the `.ssh/config` entries for Polaris and the `scp` command for transferring from Polaris: - -``` -$ cat .ssh/config -knight@aurora-uan-0009:~> cat .ssh/config -Host bastion.alcf.anl.gov - User knight - -Host polaris.alcf.anl.gov - ProxyJump bastion.alcf.anl.gov - DynamicForward 3142 - user knight -``` - -``` -knight@aurora-uan-0009:~> scp knight@polaris.alcf.anl.gov:/eagle/catalyst/proj-shared/knight/test.txt ./ ---------------------------------------------------------------------------- - Notice to Users -... -[Password: ---------------------------------------------------------------------------- - Notice to Users -... -[Password: -knight@aurora-uan-0009:~> cat test.txt -from_polaris eagle -``` diff --git a/docs/aurora/data-management/moving_data_to_aurora/daos_datamover.md b/docs/aurora/data-management/moving_data_to_aurora/daos_datamover.md new file mode 100644 index 000000000..6f757904b --- /dev/null +++ b/docs/aurora/data-management/moving_data_to_aurora/daos_datamover.md @@ -0,0 +1,31 @@ +# To move data to your daos posix container + +## using CP + +```bash +cp /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 /tmp/${DAOS_POOL}/${DAOS_CONT} +``` + +## using daos filesystem copy + +```bash +daos filesystem copy --src /lus/flare/projects/CSC250STDM10_CNDA/kaushik/thundersvm/input_data/real-sim_M100000_K25000_S0.836 --dst daos://tmp/${DAOS_POOL}/${DAOS_CONT} +``` +You may have to replace the DAOS_POOL and DAOS_CONT label with its UUIDs. UUIDs can be copied from + +```bash +daos pool query ${DAOS_POOL} +daos container query $DAOS_POOL_NAME $DAOS_CONT_NAME +``` + +## using mpifileutils distributed CP (DCP) + +You can also use other mpifileutils binaraies. + +```bash +mpifileutils/bin> ls +dbcast dbz2 dchmod dcmp dcp dcp1 ddup dfilemaker1 dfind dreln drm dstripe dsync dtar dwalk +``` + +Ref: https://docs.daos.io/v2.4/testing/datamover/ + diff --git a/docs/aurora/data-management/moving_data_to_aurora/globus.md b/docs/aurora/data-management/moving_data_to_aurora/globus.md new file mode 100644 index 000000000..8277298f5 --- /dev/null +++ b/docs/aurora/data-management/moving_data_to_aurora/globus.md @@ -0,0 +1,13 @@ +### Transfering files through Globus + +Currently only Globus personal is supported + +```bash +/soft/tools/proxychains/bin/proxychains4 -f /soft/tools/proxychains/etc/proxychains.conf /soft/tools/globusconnect/globusconnect -setup --no-gui +``` +and follow the instruction to setup personal endpoint. + +```bash +/soft/tools/proxychains/bin/proxychains4 -f /soft/tools/proxychains/etc/proxychains.conf /soft/tools/globusconnect/globusconnect -start & +``` +You can also add -restrict-paths /lus/flare/projects/YOURPROJECT to access folders outside of your home diff --git a/docs/aurora/data-management/moving_data_to_aurora/scp.md b/docs/aurora/data-management/moving_data_to_aurora/scp.md new file mode 100644 index 000000000..a9c6fa241 --- /dev/null +++ b/docs/aurora/data-management/moving_data_to_aurora/scp.md @@ -0,0 +1,37 @@ + + +## Data Transfer + +Currently, scp and SFTP are the only ways to transfer data to/from Aurora. + +### Transferring files to Aurora + +With the bastion pass-through nodes currently used to access both Sunspot and Aurora, users will find it helpful to modify their `.ssh/config` files on Aurora appropriately to facilitate transfers to Aurora from other ALCF systems. These changes are similar to what Sunspot users may have already implemented. From an Aurora login-node, this readily enables one to transfer files from Sunspot's `gila` filesystem or one of the production filesystems at ALCF (`home` and `eagle`) mounted on an ALCF system's login node. With the use of `ProxyJump` below, entering the MobilePass+ or Cryptocard passcode twice will be needed (once for bastion and once for the other resource). A simple example shows the `.ssh/config` entries for Polaris and the `scp` command for transferring from Polaris: + +``` +$ cat .ssh/config +knight@aurora-uan-0009:~> cat .ssh/config +Host bastion.alcf.anl.gov + User knight + +Host polaris.alcf.anl.gov + ProxyJump bastion.alcf.anl.gov + DynamicForward 3142 + user knight +``` + +``` +knight@aurora-uan-0009:~> scp knight@polaris.alcf.anl.gov:/eagle/catalyst/proj-shared/knight/test.txt ./ +--------------------------------------------------------------------------- + Notice to Users +... +[Password: +--------------------------------------------------------------------------- + Notice to Users +... +[Password: +knight@aurora-uan-0009:~> cat test.txt +from_polaris eagle +``` + + diff --git a/docs/aurora/data-science/frameworks/oneCCL.md b/docs/aurora/data-science/frameworks/oneCCL.md new file mode 100644 index 000000000..070294778 --- /dev/null +++ b/docs/aurora/data-science/frameworks/oneCCL.md @@ -0,0 +1,332 @@ +# oneCCL + +oneAPI Collective Communications Library (oneCCL) provides an efficient implementation of communication patterns used in deep learning. +oneCCL is governed by the UXL Foundation and is an implementation of the oneAPI specification. + +oneCCL can be used through + +1. native C++ SYCL mode +2. Horovod +3. PyTorch Distributed Data Parallel (DDP) + + +## Aurora oneCCL environment + +```bash +kaushikvelusamy@aurora-uan-0012:~> module load frameworks +(/opt/aurora/24.180.0/frameworks/aurora_nre_models_frameworks-2024.2.1_u1) kaushikvelusamy@aurora-uan-0012:~> echo $CCL_ROOT +/opt/aurora/24.180.0/CNDA/oneapi/ccl/2021.13.1_20240808.145507 +``` + + +**OneCCL mandatory environment variables** + +```bash +module load frameworks +echo $CCL_ROOT +export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH +export CPATH=$CCL_ROOT/include:$CPATH +export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH + +export CCL_PROCESS_LAUNCHER=pmix +export CCL_ATL_TRANSPORT=mpi +export CCL_ALLREDUCE=topo +export CCL_ALLREDUCE_SCALEOUT=rabenseifner # currently best allreduce algorithm at large scale +export CCL_BCAST=double_tree # currently best bcast algorithm at large scale + +export CCL_KVS_MODE=mpi +export CCL_CONFIGURATION_PATH="" +export CCL_CONFIGURATION=cpu_gpu_dpcpp +export CCL_KVS_CONNECTION_TIMEOUT=600 + +export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=1024 +export CCL_KVS_USE_MPI_RANKS=1 +``` + +**OneCCL optional environment variables** + +```bash +ulimit -c unlimited +export FI_MR_ZE_CACHE_MONITOR_ENABLED=0 +export FI_MR_CACHE_MONITOR=disabled +export FI_CXI_RX_MATCH_MODE=hybrid +export FI_CXI_OFLOW_BUF_SIZE=8388608 +export FI_CXI_DEFAULT_CQ_SIZE=1048576 +export FI_CXI_CQ_FILL_PERCENT=30 +export MPI_PROVIDER=$FI_PROVIDER +unset MPIR_CVAR_CH4_COLL_SELECTION_TUNING_JSON_FILE +unset MPIR_CVAR_COLL_SELECTION_TUNING_JSON_FILE +export INTELGT_AUTO_ATTACH_DISABLE=1 +export PALS_PING_PERIOD=240 +export PALS_RPC_TIMEOUT=240 +export MPIR_CVAR_GATHERV_INTER_SSEND_MIN_PROCS=-1 #to solve the sync send issue in Horovod seg fault +export CCL_ATL_SYNC_COLL=1 #to avoid potential hang at large scale +export CCL_OP_SYNC=1 #to avoid potential hang at large scale +``` + + +**Algorithm selection** + +```bash +export CCL_COLLECTIVENAME=topo +export CCL_COLLECTIVENAME_SCALEOUT=ALGORITHM_NAME +``` +More info on Algorithm selection: https://oneapi-src.github.io/oneCCL/env-variables.html + +```bash +export CCL_ALLREDUCE=topo +export CCL_ALLREDUCE_SCALEOUT=rabenseifner +``` + + +## native C++ SYCL mode + +You can compile examples from the oneCCL gitrepository and use the library from the system default instead of local builds. +More information at : https://www.intel.com/content/www/us/en/docs/oneccl/benchmark-user-guide/2021-12/overview.html + +To build the C++ benchmark examples + +```bash + +cd oneccl +mkdir build +cd build +module load cmake +cmake .. -DCMAKE_C_COMPILER=icx-cc -DCMAKE_CXX_COMPILER=icpx -DCOMPUTE_BACKEND=dpcpp -DCMAKE_INSTALL_PREFIX=/lus/flare/projects/Aurora_deployment/kaushik/all_reduce_frameworks/gitrepos/oneCCL/build/ +make -j install + +rm -rf _install/bin/* _install/lib/*mpi* _install/lib/*fabric* _install/opt/ + +``` + +To run from a jobscript + +```bash +#!/bin/bash -x +# qsub -l nodes=2:ncpus=208 -q workq -l walltime=02:00:00 -l filesystems=lustre_scaling -A Aurora_deployment ./pbs_job_ +#PBS -A Aurora_deployment +#PBS -k doe + +module load frameworks +cd $PBS_O_WORKDIR +echo Jobid: $PBS_JOBID +echo Running on nodes `cat $PBS_NODEFILE` +NNODES=`wc -l < $PBS_NODEFILE` +RANKS_PER_NODE=12 # Number of MPI ranks per node +NRANKS=$(( NNODES * RANKS_PER_NODE )) +echo "NUM_OF_NODES=${NNODES} TOTAL_NUM_RANKS=${NRANKS} RANKS_PER_NODE=${RANKS_PER_NODE}" + +CPU_BINDING1=list:4:9:14:19:20:25:56:61:66:71:74:79 +EXT_ENV="--env FI_CXI_DEFAULT_CQ_SIZE=1048576" +APP1=/lus/flare/projects/Aurora_deployment/kaushik/all_reduce_frameworks/gitrepos/oneCCL/build/_install/examples/benchmark/benchmark + + +echo $CCL_ROOT +export LD_LIBRARY_PATH=$CCL_ROOT/lib:$LD_LIBRARY_PATH +export CPATH=$CCL_ROOT/include:$CPATH +export LIBRARY_PATH=$CCL_ROOT/lib:$LIBRARY_PATH + +export CCL_PROCESS_LAUNCHER=pmix +export CCL_ATL_TRANSPORT=mpi +export CCL_ALLREDUCE=topo +export CCL_ALLREDUCE_SCALEOUT=rabenseifner + +export CCL_KVS_MODE=mpi +export CCL_CONFIGURATION_PATH="" +export CCL_CONFIGURATION=cpu_gpu_dpcpp +export CCL_KVS_CONNECTION_TIMEOUT=600 + +which python + +mkdir -p ./out_${PBS_JOBID}/c_oneccl_gpu +for NNODES in 4 8 16 32 64 +do +RANKS_PER_NODE=12 # Number of MPI ranks per node +NRANKS=$(( NNODES * RANKS_PER_NODE )) + + for BUF_SIZE in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456 + do + date + mpiexec ${EXT_ENV} --env CCL_LOG_LEVEL=info --env CCL_PROCESS_LAUNCHER=pmix --env CCL_ATL_TRANSPORT=mpi \ + --np ${NRANKS} -ppn ${RANKS_PER_NODE} --cpu-bind $CPU_BINDING1 $APP1 \ + --elem_counts ${BUF_SIZE},${BUF_SIZE},${BUF_SIZE} \ + --coll allreduce -j off -i 1 -w 0 --backend sycl --sycl_dev_type gpu > ./out_${PBS_JOBID}/c_oneccl_gpu/${PBS_JOBID}_${NNODES}_${NRANKS}_${RANKS_PER_NODE}_${BUF_SIZE}_sycl_ccl_gpu_out_w1.txt + date + echo ${BUF_SIZE} + + done +done + +# For CPU only, change benchmark options to : --backend host --sycl_dev_type host + +``` +For more information on oneCCL benchmark, please refer to: https://www.intel.com/content/www/us/en/docs/oneccl/benchmark-user-guide/2021-12/overview.html + + + +## Horovod + +Tensorflow horovod example + + +```bash + +import datetime +from time import perf_counter_ns +import sys + +import tensorflow as tf +import horovod.tensorflow as hvd +import intel_extension_for_tensorflow as itex +print(itex.__version__) +hvd.init() + +hvd_local_rank = hvd.local_rank() +hvd_size = hvd.size() +print("hvd_local_rank = %d hvd_size = %d" % (hvd_local_rank, hvd_size)) + +xpus = tf.config.experimental.list_physical_devices('XPU') +logical_gpus = tf.config.experimental.set_visible_devices(xpus[hvd.local_rank()], 'XPU') +print(xpus) +tf.debugging.set_log_device_placement(True) + + +dim_size=int(int(sys.argv[1])/4) +elapsed1=[] + +for _ in range(5): + with tf.device(f"XPU:{hvd_local_rank%12}"): + x = tf.ones([1, dim_size],dtype=tf.float32) + # print(x) + t5 = perf_counter_ns() + y = hvd.allreduce(x, average=False) + t6 = perf_counter_ns() + elapsed1.append(t6 - t5) + +if hvd.rank() == 0: + for e in elapsed1: + print(e) + +``` + +Pytorch horovod example + +```bash +from time import perf_counter_ns +import sys +import intel_extension_for_pytorch # Added Extra +import torch.nn.parallel +import horovod.torch as hvd +hvd.init() +hvd_local_rank = hvd.local_rank() +hvd_size = hvd.size() +# print("hvd_local_rank = %d hvd_size = %d" % (hvd_local_rank, hvd_size)) + +def get_default_device(): + if torch.xpu.is_available(): + return torch.device(f"xpu:{hvd_local_rank%12}") + else: + return torch.device('cpu') + +device = get_default_device() + +dim_size=int(int(sys.argv[1])/4) +elapsed1=[] + +for _ in range(50): + x = torch.ones([1, dim_size],dtype=torch.float32).to(device, non_blocking=True) + # print(x) + t5 = perf_counter_ns() + y = hvd.allreduce(x, average=False) + t6 = perf_counter_ns() + elapsed1.append(t6 - t5) + +if hvd.rank() == 0: + for e in elapsed1: + print(e) + +``` + +## Pytorch DDP + +```bash +import datetime +from time import perf_counter_ns +import sys +import os +import socket +from mpi4py import MPI +import intel_extension_for_pytorch # Added Extra +import torch.nn.parallel +import torch.distributed as dist +import oneccl_bindings_for_pytorch + + +MPI.COMM_WORLD.Barrier() + +os.environ['RANK'] = str(os.environ.get('PMI_RANK', 0)) +os.environ['WORLD_SIZE'] = str(os.environ.get('PMI_SIZE', 1)) +mpi_world_size = MPI.COMM_WORLD.Get_size() +mpi_my_rank = MPI.COMM_WORLD.Get_rank() + +if mpi_my_rank == 0: + master_addr = socket.gethostname() + sock = socket.socket() + sock.bind(('',0)) + # master_port = sock.getsockname()[1] + master_port = 2345 +else: + master_addr = None + master_port = None + +master_addr = MPI.COMM_WORLD.bcast(master_addr, root=0) +master_port = MPI.COMM_WORLD.bcast(master_port, root=0) +os.environ["MASTER_ADDR"] = master_addr +os.environ["MASTER_PORT"] = str(master_port) + +MPI.COMM_WORLD.Barrier() +dist.init_process_group(backend = "ccl", init_method = 'env://', world_size = mpi_world_size, rank = mpi_my_rank, timeout = datetime.timedelta(seconds=3600)) +MPI.COMM_WORLD.Barrier() + + +dist_my_rank = dist.get_rank() +dist_world_size = dist.get_world_size() + +def get_default_device(): + if torch.xpu.is_available(): + return torch.device(f"xpu:{dist_my_rank%12}") + else: + return torch.device('cpu') + +device = get_default_device() + +dim_size=int(int(sys.argv[1])/4) +MPI.COMM_WORLD.Barrier() + +elapsed1=[] + +for _ in range(50): + x = torch.ones([1, dim_size],dtype=torch.float32).to(device, non_blocking=True) + # print(x) + t5 = perf_counter_ns() + dist.all_reduce(x, op=dist.ReduceOp.SUM) # Added Extra op + MPI.COMM_WORLD.Barrier() + t6 = perf_counter_ns() + elapsed1.append(t6 - t5) + +if mpi_my_rank == 0: + for e in elapsed1: + print(e) + +``` + +References + +1. https://oneapi-src.github.io/oneCCL/env-variables.html +2. https://github.com/oneapi-src/oneCCL +3. https://github.com/intel/torch-ccl +4. https://github.com/argonne-lcf/dl_scaling +5. https://www.intel.com/content/www/us/en/docs/oneccl/benchmark-user-guide/2021-12/overview.html + + + diff --git a/docs/aurora/getting-started-on-aurora.md b/docs/aurora/getting-started-on-aurora.md index 49c613e89..4956355a0 100644 --- a/docs/aurora/getting-started-on-aurora.md +++ b/docs/aurora/getting-started-on-aurora.md @@ -98,6 +98,32 @@ round robin to the aurora login nodes. ssh @login.aurora.alcf.anl.gov ``` +### As an expedient for initiating ssh sessions to Aurora login nodes via the bastion indirect nodes. + +Note: Here remote machine means your laptop/desktop. +1. Create SSH keys on the laptop/desktop/remote machine. See "Creating SSH Keys" section on [this page](https://help.cels.anl.gov/docs/linux/ssh/): +2. Add the lines listed below to your ~/.ssh/config file on the remote host. That is, you should do this on your laptop/desktop, from which you are initiating ssh login sessions to Aurora via bastion, and on other non-ALCF host systems from which you want to login to Aurora. + +``` +$ cat ~/.ssh/config +Host *.aurora.alcf.anl.gov aurora.alcf.anl.gov + ProxyCommand ssh @bastion.alcf.anl.gov -q -W %h:%p + User + ControlMaster auto + ControlPath ~/.ssh/master-%r@%h:%p +``` + +3. Transfering your remote public key to bastion and aurora. + +``` +Copy the public key (*.pub) from ~/.ssh/*.pub folder on the remote machine (your laptop) and append it to ~/.ssh/authorized_keys file on bastion (bastion.alcf.anl.gov) +Copy the public key (*.pub) from ~/.ssh/*.pub folder on the remote machine (your laptop) and append it to ~/.ssh/authorized_keys file on Aurora UAN. (login.aurora.alcf.anl.gov) + +If you are trying to scp from other ALCF system (example Polaris) to Aurora , you need to do the above step replacing the remote machine (your laptop) with Polaris. +``` + +When you use an SSH proxy, it takes the authentication mechanism from the local host and applies it to the farthest-remote host, while prompting you for the “middle host” separately. So, when you run the ssh @login.aurora.alcf.anl.gov command on your laptop/desktop, you'll be prompted for two ALCF authentication codes - first the Mobilepass+ or Cryptocard passcode for the bastion, and then the SSH passphrase for Aurora. Likewise, when you run scp from a remote host to copy files to Aurora login nodes, you'll be prompted for two ALCF authentication codes codes - first the Mobilepass+ or Cryptocard passcode and then the SSH passphrase. + ## Proxies for outbound connections: Git, ssh, etc... The Aurora login nodes don't currently have outbound network connectivity enabled by default. Setting the following environment variables will provide access to the proxy host. This is necessary, for example, to clone remote git repos. @@ -122,7 +148,7 @@ Host my.awesome.machine.edu $ ssh me@my.awesome.machine.edu ``` -Additional guidance on scp and transfering files to Aurora is available and [here](./data-management/lustre/gecko.md). +Additional guidance on scp and transfering files to Aurora is available and [here](./data-management/lustre/flare.md). ## Working with Git repos diff --git a/docs/aurora/running-jobs-aurora.md b/docs/aurora/running-jobs-aurora.md index 2e86b48e9..5109d7b77 100644 --- a/docs/aurora/running-jobs-aurora.md +++ b/docs/aurora/running-jobs-aurora.md @@ -10,6 +10,9 @@ There is a single routing queue in place called `EarlyAppAccess` which submits t - `lustre_scaling` (execution queue) : 10 running jobs per-user; max walltime : 6 hours; max nodecount : 9090 (subject to change) ### Submitting a job + +Note: Jobs should be submitted only from your allocated project directory and not from your home directory. + For example, a one-node interactive job can be requested for 30 minutes with the following command, where `[your_ProjectName]` is replaced with an appropriate project name. ```bash diff --git a/mkdocs.yml b/mkdocs.yml index a7642560a..cdbb60bcf 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -175,8 +175,13 @@ nav: - CMake: aurora/build-tools/cmake-aurora.md - Running Jobs: aurora/running-jobs-aurora.md - Data Management: + - Copper: aurora/data-management/copper/copper.md - DAOS: aurora/data-management/daos/daos-overview.md - - Lustre - Gecko: aurora/data-management/lustre/gecko.md + - Lustre Flare: aurora/data-management/lustre/flare.md + - Moving_data_to_Aurora: + - DAOS_datamover: aurora/data-management/moving_data_to_aurora/daos_datamover.md + - Globus: aurora/data-management/moving_data_to_aurora/globus.md + - SCP: aurora/data-management/moving_data_to_aurora/scp.md - Applications and Libraries: - Libraries: - Cabana: aurora/applications-and-libraries/libraries/cabana-aurora.md @@ -197,6 +202,7 @@ nav: - PyTorch: aurora/data-science/frameworks/pytorch.md - TensorFlow: aurora/data-science/frameworks/tensorflow.md - LibTorch: aurora/data-science/frameworks/libtorch.md + - OneCCL: aurora/data-science/frameworks/oneCCL.md - Libraries: - OpenVINO: aurora/data-science/libraries/openvino.md - Programming Models: