We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rank0的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2 rank1的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2
rank1与rank0通信成功 单卡模式都成功训练
[rank0]: Traceback (most recent call last): [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in [rank0]: main() [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main [rank0]: runner.train() [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train [rank0]: self._train_loop = self.build_train_loop( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop [rank0]: loop = LOOPS.build( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build [rank0]: return self.build_func(cfg, *args, **kwargs, registry=self) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [rank0]: obj = obj_cls(**args) # type: ignore [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init [rank0]: dataloader = runner.build_dataloader( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader [rank0]: dataset = DATASETS.build(dataset_cfg) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build [rank0]: return self.build_func(cfg, *args, **kwargs, registry=self) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg [rank0]: obj = obj_cls(**args) # type: ignore [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset [rank0]: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper [rank0]: func_return = func(*args, **kwargs) [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group [rank0]: return _new_group_with_tag( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag [rank0]: pg, pg_store = _new_process_group_helper( [rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper [rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) [rank0]: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10
The text was updated successfully, but these errors were encountered:
No branches or pull requests
rank0的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2
rank1的启动命令
NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2
rank1与rank0通信成功 单卡模式都成功训练
报错日志
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in
[rank0]: main()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main
[rank0]: runner.train()
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train
[rank0]: self._train_loop = self.build_train_loop(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop
[rank0]: loop = LOOPS.build(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init
[rank0]: dataloader = runner.build_dataloader(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader
[rank0]: dataset = DATASETS.build(dataset_cfg)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
[rank0]: return self.build_func(cfg, *args, **kwargs, registry=self)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
[rank0]: obj = obj_cls(**args) # type: ignore
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset
[rank0]: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group
[rank0]: return _new_group_with_tag(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10
The text was updated successfully, but these errors were encountered: