-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use multi-GPUs training with PBS system #567
Comments
Can you link the error that you have observed? Are you training on a single node or multiple nodes? |
Thanks for your prompt reply. I am first training on a single node. The error message is as the following, 2024-08-25 22:27:52.631 INFO: Using gradient clipping with tolerance=10.000 [rank1]:[E825 22:53:11.225026392 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
|
What GPUs are you using? You need to ask your system administrator provide ports that are accessible for NCCL to communicate between GPUs. Did you make sure to provide the right environement variables? |
The training fails after Epoch 0. |
I am training on A100. |
My training script is like: #!/bin/bash -x #PBS -N mace #define variables cd $PBS_O_WORKDIR export CUDA_HOME=/public/software/compiler/cuda-11.3 torchrun --standalone --nnodes=1 --nproc_per_node=2 /public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train |
Can you share the top of your log so I can look at how the GPUs were setup. |
Also what is your torch vesion? |
The log file has been attached. |
Thank you. Looking at the log i don't think that has anything to do with MACE. Something happened to your GPUs that prevented communications. If you start again, do you see it crashes again after epoch 0 or at a different epoch? |
The torch version is '2.4.0+cu121' |
I started again and again, and it always fails after epoch 0. |
Can you try to downgrade to torch 2.3? |
ok, I will have a try. Thanks very much! |
Can you see if it crashes before or after writing the checkpoint to the disk? |
It crashes after writing the checkpoint to the disk. |
It is probably happening when the master GPU reaches the second barrier and the two need to synch. For some reason at this point your second GPU is idle, and can not respond anymore, leading to a time out. |
Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289. |
I modify the train.py and print the rank. This time it crashes after epoch 2. The error message is like the following, 2024-08-26 08:13:24.098 INFO: Using gradient clipping with tolerance=10.000 2024-08-26 08:39:40.359 INFO: Epoch 1: loss=2.1196, MAE_E_per_atom=34.8 meV, MAE_F=38.3 meV / A 2024-08-26 08:52:28.594 INFO: Epoch 2: loss=1.3910, MAE_E_per_atom=28.8 meV, MAE_F=33.6 meV / A [rank0]:[E826 08:53:23.114722578 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out. [rank0]:[E826 08:53:23.338327558 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out. W0826 08:53:24.845000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 403057 closing signal SIGTERM
|
So what should I do next to avoid this problem ? |
To me this is not a MACE problem, but a problem with your system. Sorry I can not help. You should request help from your system administrator. |
Hi,
I want to train model by using multi-GPUs on our computer cluster which utilizes PBS job managing system. I refer to #458, and comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py. However it seems it does not work. So what should I do to make it ? Thanks in advance !
The text was updated successfully, but these errors were encountered: