How to use multi-GPUs training with PBS system #567

jinfeng-data · 2024-08-25T15:07:47Z

Hi,

I want to train model by using multi-GPUs on our computer cluster which utilizes PBS job managing system. I refer to #458, and comment out the _setup_distr_env(self): function in mace/tools/slurm_distributed.py. However it seems it does not work. So what should I do to make it ? Thanks in advance !

ilyes319 · 2024-08-25T15:10:45Z

Can you link the error that you have observed? Are you training on a single node or multiple nodes?
What you need to do is to be able to set all of the required environement variables outlined in the code. I recommend you speak to your system administrator in case you need more detailed help.

jinfeng-data · 2024-08-25T16:01:25Z

Thanks for your prompt reply. I am first training on a single node. The error message is as the following,

2024-08-25 22:27:52.631 INFO: Using gradient clipping with tolerance=10.000
2024-08-25 22:27:52.631 INFO: Started training
2024-08-25 22:28:16.122 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A
2024-08-25 22:41:15.636 INFO: Epoch 0: loss=2.2208, MAE_E_per_atom=35.8 meV, MAE_F=41.9 meV / A
[rank0]:[E825 22:53:11.599226728 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
[rank1]:[E825 22:53:11.599481332 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
[rank1]:[E825 22:53:11.679520115 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.679530396 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203585637 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank0]:[E825 22:53:11.203651114 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E825 22:53:11.203671433 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E825 22:53:11.206360040 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 3965, last enqueued NCCL work: 10659, last completed NCCL work: 3964.
[rank1]:[E825 22:53:11.206422589 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E825 22:53:11.206441064 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E825 22:53:11.224901535 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933807 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b83bb9ebf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b83878628d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b8387869313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b838786b6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2b836cefabf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2b83610d7e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2b8361aed34d in /lib64/libc.so.6)

[rank1]:[E825 22:53:11.225026392 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3965, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933855 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2b56b9703f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2b568557a8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2b5685581313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2b56855836fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2b566ac12bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2b565edefe25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2b565f80534d in /lib64/libc.so.6)
W0825 22:53:12.892000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 190700 closing signal SIGTERM
E0825 22:53:13.172000 47471442153280 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 0 (pid: 190699) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python
Traceback (most recent call last):
File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-25_22:53:12
host : gpu9
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 190699)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 190699

Permission denied, please try again.^M
Received disconnect from 10.11.100.1 port 22:2: Too many authentication failures for root^M
Authentication failed.^M

ilyes319 · 2024-08-25T16:05:58Z

What GPUs are you using? You need to ask your system administrator provide ports that are accessible for NCCL to communicate between GPUs. Did you make sure to provide the right environement variables?

jinfeng-data · 2024-08-25T16:06:01Z

The training fails after Epoch 0.

jinfeng-data · 2024-08-25T16:07:01Z

I am training on A100.

jinfeng-data · 2024-08-25T16:11:31Z

My training script is like:

#!/bin/bash -x

#PBS -N mace
#PBS -l nodes=1:ppn=2:gpus=2
#PBS -j oe
#PBS -q gpu_a100

#define variables

cd $PBS_O_WORKDIR

export CUDA_HOME=/public/software/compiler/cuda-11.3
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64

torchrun --standalone --nnodes=1 --nproc_per_node=2 /public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train
--name="haha"
--foundation_model="./2023-12-03-mace-128-L1_epoch-199.model"
--train_file="mace_trainingset.xyz"
--valid_fraction=0.05
--E0s="isolated"
--forces_weight=1000
--energy_weight=100
--lr=0.01
--scaling="rms_forces_scaling"
--batch_size=2
--valid_batch_size=2
--max_num_epochs=200
--start_swa=150
--scheduler_patience=5
--patience=15
--eval_interval=1
--ema
--ema_decay=0.99
--amsgrad
--swa
--swa_forces_weight=10
--error_table='PerAtomMAE'
--default_dtype="float64"
--device=cuda
--seed=123
--restart_latest
--distributed
--save_cpu

ilyes319 · 2024-08-25T16:14:16Z

Can you share the top of your log so I can look at how the GPUs were setup.

ilyes319 · 2024-08-25T16:16:03Z

Also what is your torch vesion?

jinfeng-data · 2024-08-25T16:26:33Z

The log file has been attached.
haha_run-123.log

ilyes319 · 2024-08-25T16:30:23Z

Thank you. Looking at the log i don't think that has anything to do with MACE. Something happened to your GPUs that prevented communications. If you start again, do you see it crashes again after epoch 0 or at a different epoch?

jinfeng-data · 2024-08-25T16:30:41Z

The torch version is '2.4.0+cu121'

jinfeng-data · 2024-08-25T16:32:48Z

I started again and again, and it always fails after epoch 0.

ilyes319 · 2024-08-25T16:33:37Z

Can you try to downgrade to torch 2.3?

jinfeng-data · 2024-08-25T16:34:24Z

ok, I will have a try. Thanks very much!

ilyes319 · 2024-08-25T16:34:59Z

Can you see if it crashes before or after writing the checkpoint to the disk?

jinfeng-data · 2024-08-25T16:40:49Z

It crashes after writing the checkpoint to the disk.

ilyes319 · 2024-08-25T17:04:24Z

It is probably happening when the master GPU reaches the second barrier and the two need to synch. For some reason at this point your second GPU is idle, and can not respond anymore, leading to a time out.

ilyes319 · 2024-08-25T17:05:02Z

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

jinfeng-data · 2024-08-26T01:55:31Z

I modify the train.py and print the rank. This time it crashes after epoch 2. The error message is like the following,

2024-08-26 08:13:24.098 INFO: Using gradient clipping with tolerance=10.000
2024-08-26 08:13:24.098 INFO: Started training
2024-08-26 08:13:46.902 INFO: Epoch None: loss=720.6777, MAE_E_per_atom=3672.0 meV, MAE_F=227.3 meV / A
My rank is: 1
My rank is: 0
2024-08-26 08:26:49.998 INFO: Epoch 0: loss=2.7713, MAE_E_per_atom=39.7 meV, MAE_F=44.4 meV / A
My rank is:My rank is: 10

2024-08-26 08:39:40.359 INFO: Epoch 1: loss=2.1196, MAE_E_per_atom=34.8 meV, MAE_F=38.3 meV / A
My rank is:My rank is: 10

2024-08-26 08:52:28.594 INFO: Epoch 2: loss=1.3910, MAE_E_per_atom=28.8 meV, MAE_F=33.6 meV / A
My rank is:My rank is: 10

[rank0]:[E826 08:53:23.114722578 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
[rank1]:[E826 08:53:23.114941895 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
[rank1]:[E826 08:53:23.162548244 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.162556584 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309172826 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 10260, last enqueued NCCL work: 17107, last completed NCCL work: 10259.
[rank1]:[E826 08:53:23.309213729 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E826 08:53:23.309228728 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E826 08:53:23.312406692 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 10260, last enqueued NCCL work: 17106, last completed NCCL work: 10259.
[rank0]:[E826 08:53:23.312441895 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E826 08:53:23.312452746 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E826 08:53:23.336914054 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 934056 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2adc2badaf86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2adbf79518d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2adbf7958313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2adbf795a6fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2adbdcfe9bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2adbd11c6e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2adbd1bdc34d in /lib64/libc.so.6)

[rank0]:[E826 08:53:23.338327558 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=10260, OpType=ALLREDUCE, NumelIn=207488, NumelOut=207488, Timeout(ms)=600000) ran for 933936 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x2aeff73d6f86 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x2aefc324d8d2 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x2aefc3254313 in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x2aefc32566fc in /public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x2aefa88e5bf4 in /public/home/xiaohe/miniconda3/bin/../lib/libstdc++.so.6)
frame #5: + 0x7e25 (0x2aef9cac2e25 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x2aef9d4d834d in /lib64/libc.so.6)

W0826 08:53:24.845000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 403057 closing signal SIGTERM
E0826 08:53:25.070000 47874619043648 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 403058) of binary: /public/home/xiaohe/jinfeng/soft/mace-venv/bin/python
Traceback (most recent call last):
File "/public/home/xiaohe/jinfeng/soft/mace-venv/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, kwargs)
^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call**
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/public/home/xiaohe/jinfeng/soft/mace-venv/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-26_08:53:24
host : gpu9
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 403058)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 403058

jinfeng-data · 2024-08-27T05:39:34Z

Can you go in this file https://github.com/ACEsuit/mace/blob/main/mace/tools/train.py and print the rank at the line 289.

So what should I do next to avoid this problem ?

ilyes319 · 2024-08-29T09:59:42Z

To me this is not a MACE problem, but a problem with your system. Sorry I can not help. You should request help from your system administrator.

ilyes319 closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use multi-GPUs training with PBS system #567

How to use multi-GPUs training with PBS system #567

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 •

edited

Loading

jinfeng-data commented Aug 25, 2024 •

edited

Loading

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 •

edited

Loading

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 •

edited

Loading

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 26, 2024

jinfeng-data commented Aug 27, 2024

ilyes319 commented Aug 29, 2024

How to use multi-GPUs training with PBS system #567

How to use multi-GPUs training with PBS system #567

Comments

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 • edited Loading

jinfeng-data commented Aug 25, 2024 • edited Loading

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-25_22:53:12 host : gpu9 rank : 0 (local_rank: 0) exitcode : -6 (pid: 190699) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 190699

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 • edited Loading

jinfeng-data commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024 • edited Loading

jinfeng-data commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

ilyes319 commented Aug 25, 2024

jinfeng-data commented Aug 26, 2024

/public/home/xiaohe/jinfeng/soft/mace-venv/bin/mace_run_train FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2024-08-26_08:53:24 host : gpu9 rank : 1 (local_rank: 1) exitcode : -6 (pid: 403058) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 403058

jinfeng-data commented Aug 27, 2024

ilyes319 commented Aug 29, 2024

ilyes319 commented Aug 25, 2024 •

edited

Loading

jinfeng-data commented Aug 25, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-25_22:53:12
host : gpu9
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 190699)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 190699

ilyes319 commented Aug 25, 2024 •

edited

Loading

ilyes319 commented Aug 25, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-08-26_08:53:24
host : gpu9
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 403058)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 403058