-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD 执行 run_pt.sh失败 #371
Labels
bug
Something isn't working
Comments
看着是torch不兼容AMD,我没测试过AMD的gpu。 你可以在google colab上试试免费的T4 |
安装rocm版本的torch可以成功运行 24%|██▍ | 50/206 [00:16<00:34, 4.48it/s]
100%|██████████| 1/1 [00:00<00:00, 44.95it/s]�[A
�[A/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/utils/checkpoint.py:434: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
25%|██▍ | 51/206 [00:16<00:43, 3.60it/s]
25%|██▌ | 52/206 [00:16<00:38, 4.01it/s]
26%|██▌ | 53/206 [00:16<00:35, 4.35it/s]
26%|██▌ | 54/206 [00:17<00:34, 4.37it/s]
27%|██▋ | 55/206 [00:17<00:34, 4.41it/s]
27%|██▋ | 56/206 [00:17<00:33, 4.44it/s]
28%|██▊ | 57/206 [00:17<00:31, 4.70it/s]
28%|██▊ | 58/206 [00:17<00:30, 4.91it/s]
29%|██▊ | 59/206 [00:18<00:30, 4.76it/s]Traceback (most recent call last):
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 779, in <module>
main()
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
main()
main()main()
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
File "/pfs/lustrep3/scratch/project_462000506/members/zihao/train/MedicalGPT/pretraining.py", line 740, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
train_result = trainer.train(resume_from_checkpoint=checkpoint)
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^^ ^^ ^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^^^^^^^^^^^^^^^^^^^^^^^
^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
^
^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train
return inner_training_loop(return inner_training_loop(return inner_training_loop(
return inner_training_loop(
^^ ^^ ^^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
tr_loss_step = self.training_step(model, inputs)
^ ^ ^ ^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^ ^^^ ^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^
^^^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
^^^^^^^^^^^
^^ File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/transformers/trainer.py", line 3324, in training_step
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.accelerator.backward(loss, **kwargs)
self.accelerator.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/accelerator.py", line 2143, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)self.deepspeed_engine_wrapped.backward(loss, **kwargs)
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
self.engine.step()self.engine.step()
self.engine.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
self.engine.step() File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2160, in step
self._take_model_step(lr_kwargs)self._take_model_step(lr_kwargs)
self._take_model_step(lr_kwargs)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
self._take_model_step(lr_kwargs) File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2066, in _take_model_step
self.optimizer.step()
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self.optimizer.step()
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1829, in step
self._update_scale(self.overflow)
self._update_scale(self.overflow)self._update_scale(self.overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
self._update_scale(self.overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2067, in _update_scale
self.loss_scaler.update_scale(has_overflow)
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
self.loss_scaler.update_scale(has_overflow)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale
raise Exception(
raise Exception(raise Exception(
Exception: Exception ExceptionCurrent loss scale already at minimum - cannot decrease scale anymore. Exiting run.: raise Exception(:
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.
29%|██▊ | 59/206 [00:18<00:46, 3.17it/s]
[2024-07-21 17:05:32,038] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 87859) of binary: /scratch/project_462000506/members/zihao/train_AMD_env/bin/python
Traceback (most recent call last):
File "/scratch/project_462000506/members/zihao/train_AMD_env/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/project_462000506/members/zihao/train_AMD_env/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
可以不用deepspeed,先单卡跑,再用torchrun跑多卡。 |
是的,直接torchrun跑多卡没问题,就是加上deepspeed会报错。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
你好,当训练环境是AMD ROCM环境时,执行run_pt.sh会报错,错误如下:
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions.请问本模型无法在ROCM平台下运行吗。
谢谢。
run_pt.sh内容:
HIP_VISIBLE_DEVICES=0 python pretraining.py
--model_type auto
--model_name_or_path Qwen/Qwen1.5-0.5B-Chat
--train_file_dir ./data/pretrain
--validation_file_dir ./data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--max_train_samples 10000
--max_eval_samples 10
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 13
--gradient_accumulation_steps 1
--preprocessing_num_workers 10
--block_size 512
--group_by_length True
--output_dir outputs-pt-qwen-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype bfloat16
--bf16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--cache_dir ./cache
详细错误如下:
2024-05-08 08:32:59.501 | INFO | main:main:381 - Script args: ScriptArguments(use_peft=True, target_modules='all', lora_rank=8, lora_dropout=0.05, lora_alpha=16.0, modules_to_save=None, peft_path=None, qlora=False)
2024-05-08 08:32:59.501 | INFO | main:main:382 - Process rank: 0, device: cuda:0, n_gpu: 1 distributed training: True, 16-bits training: False
/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/huggingface_hub/file_download.py:1132: FutureWarning:
resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, useforce_download=True
.warnings.warn(
2024-05-08 08:33:00.792 | INFO | main:main:492 - train files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:00.792 | INFO | main:main:502 - eval files: ['./data/pretrain/fever.txt', './data/pretrain/en_article_tail500.txt', './data/pretrain/tianlongbabu.txt']
2024-05-08 08:33:01.847 | INFO | main:main:534 - Raw datasets: DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 3876
})
validation: Dataset({
features: ['text'],
num_rows: 3876
})
})
2024-05-08 08:33:02.298 | DEBUG | main:main:597 - Num train_samples: 1230
2024-05-08 08:33:02.298 | DEBUG | main:main:598 - Tokenized training example:
2024-05-08 08:33:02.300 | DEBUG | main:main:599 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
2024-05-08 08:33:02.301 | DEBUG | main:main:611 - Num eval_samples: 10
2024-05-08 08:33:02.301 | DEBUG | main:main:612 - Tokenized eval example:
2024-05-08 08:33:02.303 | DEBUG | main:main:613 - 第一章论
传染病是指由病原微生物,如朊粒、病毒、衣原体、立克次体、支原体(mycoplasma)细菌真菌、螺旋体和寄生虫,如原虫、蠕虫、医学昆虫感染人体后产生的有传染性、在一定条件下可造成流行的疾病。感染性疾病是指由病原体感染所致的疾病,包括传染病和非传染性感染性疾病。
传染病学是一门研究各种传染病在人体内外发生、发展、传播、诊断、治疗和预防规律的学科。重点研究各种传染病的发病机制、临床表现、诊断和治疗方法,同时兼顾流行病学和预防措施的研究,做到防治结合。
传染病学与其他学科有密切联系,其基础学科和相关学科包括病原生物学、分子生物学、免疫学、人体寄生虫学、流行病学、病理学、药理学和诊断学等。掌握这些学科的基本知识、基本理论和基本技能对学好传染病学起着非常重要的作用。
在人类历史长河中,传染病不仅威胁着人类的健康和生命,而且影响着人类文明的进程,甚至改写过人类历史。人类在与传染病较量过程中,取得了许多重大战果,19世纪以来,病原微生物的不断发现及其分子生物学的兴起,
The argument
trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored.Traceback (most recent call last):
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 780, in
main()
File "/home/lyrccla/MGPT/MedicalGPT/pretraining.py", line 660, in main
model = model_class.from_pretrained(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3609, in from_pretrained
max_memory = get_balanced_memory(
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 910, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in get_max_memory
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/anaconda3/envs/py39/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 781, in
max_memory = {i: torch.cuda.mem_get_info(i)[0] for i in range(torch.cuda.device_count())}
File "/home/lyrccla/pytorch/torch/cuda/memory.py", line 663, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_HIP_DSA
to enable device-side assertions.The text was updated successfully, but these errors were encountered: