Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use multi-GPUs training with PBS system #567

Closed
jinfeng-data opened this issue Aug 25, 2024 · 21 comments
Closed

How to use multi-GPUs training with PBS system #567

jinfeng-data opened this issue Aug 25, 2024 · 21 comments

Comments

@jinfeng-data
Copy link

Hi,

I want to train model by using multi-GPUs on our computer cluster which utilizes PBS job managing system. I refer to #458, and comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py. However it seems it does not work. So what should I do to make it ? Thanks in advance !

@ilyes319
Copy link
Contributor

ilyes319 commented Aug 25, 2024

Can you link the error that you have observed? Are you training on a single node or multiple nodes?
What you need to do is to be able to set all of the required environement variables outlined in the code. I recommend you speak to your system administrator in case you need more detailed help.

@jinfeng-data
Copy link
Author

jinfeng-data commented Aug 25, 2024

Thanks for your prompt reply. I am first training on a single node. The error message is as the following,

2024-08-25 22:27:52.631 INFO: Using gradient clipping with tolerance=10.000
2024-08-25 22:27:52.631 INFO: Started training
2024-08-25 22:28:16.122 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A
2024-08-25 22:41:15.636 INFO: Epoch 0: loss=2.2208, MAE_E_per_atom=35.8 meV, MAE_F=41.9 meV / A
[rank0]:[E825 22:53:11.599226728 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
[rank1]:[E825 22:53:11.599481332 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
[rank1]:[E825 22:53:11.679520115 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.679530396 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203585637 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203651114 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E825 22:53:11.203671433 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E825 22:53:11.206360040 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank1]:[E825 22:53:11.206422589 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E825 22:53:11.206441064 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E825 22:53:11.224901535 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b83bb9ebf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b83878628d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b8387869313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b838786b6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2b836cefabf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2b83610d7e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2b8361aed34d in /lib64/libc.so.6)

[rank1]:[E825 22:53:11.225026392 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b56b9703f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b568557a8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b5685581313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b56855836fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2b566ac12bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2b565edefe25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2b565f80534d in /lib64/libc.so.6)
W0825 22:53:12.892000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 190700 closing signal SIGTERM
E0825 22:53:13.172000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 190699) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python
Traceback (most recent call last):
File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-25_22:53:12
host : gpu9
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 190699)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 190699

Permission denied, please try again.^M
Received disconnect from 10.11.100.1 port 22:2: Too many authentication failures for root^M
Authentication failed.^M

@ilyes319
Copy link
Contributor

What GPUs are you using? You need to ask your system administrator provide ports that are accessible for NCCL to communicate between GPUs. Did you make sure to provide the right environement variables?

@jinfeng-data
Copy link
Author

The training fails after Epoch 0.

@jinfeng-data
Copy link
Author

I am training on A100.

@jinfeng-data
Copy link
Author

My training script is like:

#!/bin/bash -x

#PBS -N mace
#PBS -l nodes=1:ppn=2:gpus=2
#PBS -j oe
#PBS -q gpu_a100

#define variables

cd $PBS_O_WORKDIR

export CUDA_HOME=/public/software/compiler/cuda-11.3
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

torchrun --standalone --nnodes=1 --nproc_per_node=2 /public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train
--name="haha"
--foundation_model="./2023-12-03-mace-128-L1_epoch-199.model"
--train_file="mace_trainingset.xyz"
--valid_fraction=0.05
--E0s="isolated"
--forces_weight=1000
--energy_weight=100
--lr=0.01
--scaling="rms_forces_scaling"
--batch_size=2
--valid_batch_size=2
--max_num_epochs=200
--start_swa=150
--scheduler_patience=5
--patience=15
--eval_interval=1
--ema
--ema_decay=0.99
--amsgrad
--swa
--swa_forces_weight=10
--error_table='PerAtomMAE'
--default_dtype="float64"
--device=cuda
--seed=123
--restart_latest
--distributed
--save_cpu

@ilyes319
Copy link
Contributor

Can you share the top of your log so I can look at how the GPUs were setup.

@ilyes319
Copy link
Contributor

Also what is your torch vesion?

@jinfeng-data
Copy link
Author

The log file has been attached.
haha_run-123.log

@ilyes319
Copy link
Contributor

ilyes319 commented Aug 25, 2024

Thank you. Looking at the log i don't think that has anything to do with MACE. Something happened to your GPUs that prevented communications. If you start again, do you see it crashes again after epoch 0 or at a different epoch?

@jinfeng-data
Copy link
Author

The torch version is '2.4.0+cu121'

@jinfeng-data
Copy link
Author

I started again and again, and it always fails after epoch 0.

@ilyes319
Copy link
Contributor

Can you try to downgrade to torch 2.3?

@jinfeng-data
Copy link
Author

ok, I will have a try. Thanks very much!

@ilyes319
Copy link
Contributor

ilyes319 commented Aug 25, 2024

Can you see if it crashes before or after writing the checkpoint to the disk?

@jinfeng-data
Copy link
Author

It crashes after writing the checkpoint to the disk.

@ilyes319
Copy link
Contributor

It is probably happening when the master GPU reaches the second barrier and the two need to synch. For some reason at this point your second GPU is idle, and can not respond anymore, leading to a time out.

@ilyes319
Copy link
Contributor

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

@jinfeng-data
Copy link
Author

I modify the train.py and print the rank. This time it crashes after epoch 2. The error message is like the following,

2024-08-26 08:13:24.098 INFO: Using gradient clipping with tolerance=10.000
2024-08-26 08:13:24.098 INFO: Started training
2024-08-26 08:13:46.902 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A
My rank is: 1
My rank is: 0
2024-08-26 08:26:49.998 INFO: Epoch 0: loss=2.7713, MAE_E_per_atom=39.7 meV, MAE_F=44.4 meV / A
My rank is:My rank is: 10

2024-08-26 08:39:40.359 INFO: Epoch 1: loss=2.1196, MAE_E_per_atom=34.8 meV, MAE_F=38.3 meV / A
My rank is:My rank is: 10

2024-08-26 08:52:28.594 INFO: Epoch 2: loss=1.3910, MAE_E_per_atom=28.8 meV, MAE_F=33.6 meV / A
My rank is:My rank is: 10

[rank0]:[E826 08:53:23.114722578 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
[rank1]:[E826 08:53:23.114941895 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
[rank1]:[E826 08:53:23.162548244 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.162556584 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309172826 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309213729 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E826 08:53:23.309228728 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E826 08:53:23.312406692 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.312441895 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E826 08:53:23.312452746 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E826 08:53:23.336914054 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2adc2badaf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2adbf79518d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2adbf7958313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2adbf795a6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2adbdcfe9bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2adbd11c6e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2adbd1bdc34d in /lib64/libc.so.6)

[rank0]:[E826 08:53:23.338327558 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2aeff73d6f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2aefc324d8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2aefc3254313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2aefc32566fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2aefa88e5bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2aef9cac2e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2aef9d4d834d in /lib64/libc.so.6)

W0826 08:53:24.845000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 403057 closing signal SIGTERM
E0826 08:53:25.070000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 403058) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python
Traceback (most recent call last):
File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-26_08:53:24
host : gpu9
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 403058)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 403058

@jinfeng-data
Copy link
Author

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

So what should I do next to avoid this problem ?

@ilyes319
Copy link
Contributor

To me this is not a MACE problem, but a problem with your system. Sorry I can not help. You should request help from your system administrator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants