Linux Lora多卡训练数据量2W 报错 #312

571129857 · 2023-12-18T09:15:10Z

[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out.
[17:12:07] WARNING Sending process 3399 closing signal SIGTERM api.py:698
WARNING Sending process 3400 closing signal SIGTERM api.py:698
WARNING Sending process 3407 closing signal SIGTERM api.py:698
WARNING Sending process 3408 closing signal SIGTERM api.py:698
WARNING Sending process 3409 closing signal SIGTERM api.py:698
[17:12:08] ERROR failed (exitcode: -6) local_rank: 2 (pid: 3402) of binary: /home/admin/.conda/envs/lora_script/bin/python

571129857 · 2023-12-18T09:28:41Z

保存第一个checkpoint模型是，就会报如上错误。数据量在几千张时，训练正常，模型也可以正常保存。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux Lora多卡训练数据量2W 报错 #312

Linux Lora多卡训练数据量2W 报错 #312

571129857 commented Dec 18, 2023

571129857 commented Dec 18, 2023

Linux Lora多卡训练 数据量2W 报错 #312

Linux Lora多卡训练 数据量2W 报错 #312

Comments

571129857 commented Dec 18, 2023

571129857 commented Dec 18, 2023

Linux Lora多卡训练数据量2W 报错 #312

Linux Lora多卡训练数据量2W 报错 #312