Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 #924

Open
sph116 opened this issue Sep 6, 2024 · 0 comments
Open

多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 #924

sph116 opened this issue Sep 6, 2024 · 0 comments

Comments

@sph116
Copy link

sph116 commented Sep 6, 2024

rank0的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2
rank1的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2

rank1与rank0通信成功 单卡模式都成功训练

报错日志

[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in
[rank0]: main()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank0]: runner.train()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank0]: self._train_loop = self.build_train_loop(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank0]: loop = LOOPS.build(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
[rank0]: dataloader = runner.build_dataloader(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank0]: dataset = DATASETS.build(dataset_cfg)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset
[rank0]: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank0]: return _new_group_with_tag(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10

@sph116 sph116 changed the title 多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 nccl多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 Sep 6, 2024
@sph116 sph116 changed the title nccl多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant