Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux Lora多卡训练 数据量2W 报错 #312

Open
571129857 opened this issue Dec 18, 2023 · 1 comment
Open

Linux Lora多卡训练 数据量2W 报错 #312

571129857 opened this issue Dec 18, 2023 · 1 comment

Comments

@571129857
Copy link

[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=12, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1805633 milliseconds before timing out.
[17:12:07] WARNING Sending process 3399 closing signal SIGTERM api.py:698
WARNING Sending process 3400 closing signal SIGTERM api.py:698
WARNING Sending process 3407 closing signal SIGTERM api.py:698
WARNING Sending process 3408 closing signal SIGTERM api.py:698
WARNING Sending process 3409 closing signal SIGTERM api.py:698
[17:12:08] ERROR failed (exitcode: -6) local_rank: 2 (pid: 3402) of binary: /home/admin/.conda/envs/lora_script/bin/python

@571129857
Copy link
Author

保存第一个checkpoint模型是,就会报如上错误。数据量在几千张时,训练正常,模型也可以正常保存。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant