Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD 执行 run_pt.sh失败 #371

Open
liuyang6055 opened this issue May 8, 2024 · 4 comments
Open

AMD 执行 run_pt.sh失败 #371

liuyang6055 opened this issue May 8, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@liuyang6055
Copy link

你好,当训练环境是AMD ROCM环境时,执行run_pt.sh会报错,错误如下:
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

请问本模型无法在ROCM平台下运行吗。

谢谢。

run_pt.sh内容:
HIP_VISIBLE_DEVICES=0 python pretraining.py
--model_type auto
--model_name_or_path Qwen/Qwen1.5-0.5B-Chat
--train_file_dir ./data/pretrain
--validation_file_dir ./data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 13
--gradient_accumulation_steps 1
--preprocessing_num_workers 10
--block_size 512
--group_by_length True
--output_dir outputs-pt-qwen-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype bfloat16
--bf16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--cache_dir ./cache

详细错误如下:
2024-05-08 08:32:59.501 | INFO | main:main:381 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False)
2024-05-08 08:32:59.501 | INFO | main:main:382 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
2024-05-08 08:33:00.792 | INFO | main:main:492 - train files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:00.792 | INFO | main:main:502 - eval files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:01.847 | INFO | main:main:534 - Raw datasets: DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 3876
})
validation: Dataset({
features: ['text'],
num_rows: 3876
})
})
2024-05-08 08:33:02.298 | DEBUG | main:main:597 - Num train_samples: 1230
2024-05-08 08:33:02.298 | DEBUG | main:main:598 - Tokenized training example:
2024-05-08 08:33:02.300 | DEBUG | main:main:599 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
2024-05-08 08:33:02.301 | DEBUG | main:main:611 - Num eval_samples: 10
2024-05-08 08:33:02.301 | DEBUG | main:main:612 - Tokenized eval example:
2024-05-08 08:33:02.303 | DEBUG | main:main:613 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 780, in
main()
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 660, in main
model = model_class.from_pretrained(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3609, in from_pretrained
max_memory = get_balanced_memory(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 910, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in get_max_memory
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/pytorch/torch/cuda/memory.py", line 663, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

@liuyang6055 liuyang6055 added the bug Something isn't working label May 8, 2024
@shibing624
Copy link
Owner

看着是torch不兼容AMD,我没测试过AMD的gpu。

你可以在google colab上试试免费的T4

@ZiHAO-LI-cmd
Copy link

安装rocm版本的torch可以成功运行
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
但是如果使用Deepspeed会报错:

 24%|██▍       | 50/206 [00:16<00:34,  4.48it/s]

100%|██████████| 1/1 [00:00<00:00, 44.95it/s]�[A

                                             �[A/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/utils/checkpoint.py:434: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

 25%|██▍       | 51/206 [00:16<00:43,  3.60it/s]
 25%|██▌       | 52/206 [00:16<00:38,  4.01it/s]
 26%|██▌       | 53/206 [00:16<00:35,  4.35it/s]
 26%|██▌       | 54/206 [00:17<00:34,  4.37it/s]
 27%|██▋       | 55/206 [00:17<00:34,  4.41it/s]
 27%|██▋       | 56/206 [00:17<00:33,  4.44it/s]
 28%|██▊       | 57/206 [00:17<00:31,  4.70it/s]
 28%|██▊       | 58/206 [00:17<00:30,  4.91it/s]
 29%|██▊       | 59/206 [00:18<00:30,  4.76it/s]Traceback (most recent call last):
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
    main()
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
    main()
        main()main()
  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main

    train_result = trainer.train(resume_from_checkpoint=checkpoint)  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main

  File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
          train_result = trainer.train(resume_from_checkpoint=checkpoint) 
      train_result = trainer.train(resume_from_checkpoint=checkpoint)      
 train_result = trainer.train(resume_from_checkpoint=checkpoint)  
                           ^   ^   ^   ^   ^   ^   ^   ^   ^   ^  ^^  ^^  ^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^^^^^^^^^^^^^^^^^^^^^^^
^^^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^
^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
            return inner_training_loop(return inner_training_loop(return inner_training_loop(


            return inner_training_loop(   
                           ^^  ^^ ^^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

^^^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
    tr_loss_step = self.training_step(model, inputs) 
               tr_loss_step = self.training_step(model, inputs)  
        tr_loss_step = self.training_step(model, inputs)   
                          ^   ^   ^   ^   ^^  ^^  ^^  ^^  ^^  ^^  ^^  ^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^
^^^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^^^^^^
^^  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step

  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
    self.accelerator.backward(loss, **kwargs)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
    self.accelerator.backward(loss, **kwargs)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
    self.accelerator.backward(loss, **kwargs)
    self.accelerator.backward(loss, **kwargs)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
        self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)    

self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
      File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward

  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
        self.engine.step()self.engine.step()    

self.engine.step()
      File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
self.engine.step()  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step

  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
        self._take_model_step(lr_kwargs)self._take_model_step(lr_kwargs)    

self._take_model_step(lr_kwargs)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
      File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
self._take_model_step(lr_kwargs)  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step

  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
    self.optimizer.step()    
self.optimizer.step()
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
    self.optimizer.step()
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
    self.optimizer.step()
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
    self._update_scale(self.overflow)        
self._update_scale(self.overflow)self._update_scale(self.overflow)

  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
      File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
self._update_scale(self.overflow)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
    self.loss_scaler.update_scale(has_overflow)    
self.loss_scaler.update_scale(has_overflow)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    self.loss_scaler.update_scale(has_overflow)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
    raise Exception(        
raise Exception(raise Exception(

Exception: Exception    ExceptionCurrent loss scale already at minimum - cannot decrease scale anymore. Exiting run.: raise Exception(: 
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

 29%|██▊       | 59/206 [00:18<00:46,  3.17it/s]
[2024-07-21 17:05:32,038] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 87859) of binary: /scratch/project_462000506/members/zihao/train_AMD_env/bin/python
Traceback (most recent call last):
  File "/scratch/project_462000506/members/zihao/train_AMD_env/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 

@shibing624
Copy link
Owner

可以不用deepspeed,先单卡跑,再用torchrun跑多卡。

@ZiHAO-LI-cmd
Copy link

可以不用deepspeed,先单卡跑,再用torchrun跑多卡。

是的,直接torchrun跑多卡没问题,就是加上deepspeed会报错。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants