This is the official implementation of "Proteus: Simulating the Performance of Distributed DNN Training". [arXiv]
Proteus is the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). Proteus is evaluated across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves
First, compile the nccl in external/nccl
by running the following commands:
cd external/nccl
make -j src.build
Then, add proteus
to PYTHONPATH
pip install graphviz toposort
export PYTHONPATH=$PYTHONPATH:/path/to/proteus
The cluster configuration is defined with a device topo file and a cluster json file. The device topo file specifies the topology of a single node, and the cluster json file specifies the cluster info. We provide some example topo files and cluster json files in examples/clusters/
. The device topo file is generated by running nccl-tests with NCCL_TOPO_DUMP_FILE
(link).
We provide some examples in examples/
. Try Proteus with
cd examples
mkdir -p log
python alexnet.py -model alexnet -bs 256 -cluster clusters/dgx1_v100_2ib/n1_g1.json -ps dp --profile-iters 50
@article{duan2023proteus,
title={Proteus: Simulating the Performance of Distributed DNN Training},
author={Duan, Jiangfei and Li, Xiuhong and Xu, Ping and Zhang, Xingcheng and Yan, Shengen and Liang, Yun and Lin, Dahua},
journal={arXiv preprint arXiv:2306.02267},
year={2023}
}