Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChatGLM全参数二次预训练过程中,loss马上变为0,val_loss = nan #125

Closed
gloryyoung opened this issue Jul 27, 2023 · 13 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@gloryyoung
Copy link

gloryyoung commented Jul 27, 2023

bug描述

使用仓库自带数据集(天龙八部),对ChatGLM-6B进行全参数预训练loss很快变为0,eval_loss = NAN.
image

CUDA_VISIBLE_DEVICES=0,1,2,3 python pretraining.py
--model_type chatglm
--model_name_or_path ./chatglm-6b
--train_file_dir ./data/pretrain
--validation_file_dir ./data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft False
--seed 42
--num_train_epochs 1
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v2
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype bfloat16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True

并且使用gradio测试该二次预训练完成的模型时同样报错:
image

应该是二次预训练过程出现了问题

@gloryyoung gloryyoung added the bug Something isn't working label Jul 27, 2023
@shibing624
Copy link
Owner

复现了,我明天看看咋解决

shibing624 added a commit that referenced this issue Jul 28, 2023
@shibing624
Copy link
Owner

shibing624 commented Jul 28, 2023

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参``torch_dtype=float16`,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。

@gloryyoung
Copy link
Author

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参``torch_dtype=float16`,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。

多谢大佬!!我试试

@NaCloudAI
Copy link

NaCloudAI commented Jul 28, 2023

请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \
    --model_type chatglm \
    --model_name_or_path THUDM/chatglm2-6b \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 1024 \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True


tting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                                           | 0/8 [00:15<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2081)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2082)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@80b04983b729:/workspace/MedicalGPT# 

@shibing624
Copy link
Owner

可以--use_peft True训

@shibing624
Copy link
Owner

shibing624 commented Jul 28, 2023

对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。
Xnip2023-07-28_13-20-12

这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning

@shibing624 shibing624 added help wanted Extra attention is needed good first issue Good for newcomers and removed bug Something isn't working labels Jul 28, 2023
@shibing624 shibing624 pinned this issue Jul 28, 2023
@gloryyoung
Copy link
Author

gloryyoung commented Jul 28, 2023

请教一下,改成float32以后即使是4块A100 80GB也训不了 chatglm2,这是正常的吗?

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 pretraining.py \
    --model_type chatglm \
    --model_name_or_path THUDM/chatglm2-6b \
    --train_file_dir ./data/pretrain \
    --validation_file_dir ./data/pretrain \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --use_peft False \
    --seed 42 \
    --max_train_samples 10000 \
    --max_eval_samples 10 \
    --num_train_epochs 0.5 \
    --learning_rate 2e-4 \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --block_size 1024 \
    --output_dir outputs-pt-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True


tting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 1; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 2; 79.19 GiB total capacity; 75.50 GiB already allocated; 245.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/workspace/MedicalGPT/pretraining.py", line 663, in <module>
    main()
  File "/workspace/MedicalGPT/pretraining.py", line 635, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1888, in _inner_training_loop
    self.optimizer.step()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 142, in step
    self.optimizer.step(closure)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/optimization.py", line 457, in step
    state["exp_avg_sq"] = torch.zeros_like(p)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 428.00 MiB (GPU 0; 79.19 GiB total capacity; 75.50 GiB already allocated; 253.56 MiB free; 77.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
  0%|                                                                                                                                                                                                           | 0/8 [00:15<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2079) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
pretraining.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2080)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2081)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 2082)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-28_04:38:59
  host      : 80b04983b729
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2079)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
root@80b04983b729:/workspace/MedicalGPT# 

我用四张3090,在dtype为bfloat16下,可以跑,模型用的一代的ChatGLM-6B
你可以把torchrun去掉试试,因为加上的话就是数据并行了,每张卡都要加载完整模型的

@zhr0313
Copy link

zhr0313 commented Aug 27, 2023

请问如果是多机多卡的代码,如何把他改成不用数据并行的代码。两机两卡的A100跑不了数据并行
node_rank=$1
echo ${node_rank}
master_addr="10.111.112.223"

torchrun --nproc_per_node 8 --nnodes 2 --master_addr ${master_addr} --master_port 14545 --node_rank ${node_rank} run_supervised_finetuning.py ...

@TomasAndersonFang
Copy link

对于chatglm-6b的这种小size的模型来说,大家不用纠结于一定要用全参训练,其实lora训练效果跟全参相比并不差,部分参数调整合适,效果还更好,另外,还能减少样本量小时的过拟合情况。 Xnip2023-07-28_13-20-12

这里附上chatglm-6b官方的微调效果比对:https://github.com/THUDM/ChatGLM-6B/tree/main/ptuning

大佬请问一下四块A100 40GB的gpu可以以bf16全参数微调llama-7b吗?我发现fp16会出现题目中的问题,但是我把精度设置为bf16后就会出现OOM,即使我把batch_size设置为1了。

@tszslovewanpu
Copy link

@gloryyoung
我也是把torchrun去掉,几张卡都可以,一加上两张卡都不行

@xingenju
Copy link

@tszslovewanpu @gloryyoung 大佬,可以给一个你们的config吗,去掉trochrun的config

@tszslovewanpu
Copy link

@xingenju
CUDA_VISIBLE_DEVICES=0,1 python supervised_finetuning.py
--model_type your_model
--model_name_or_path PATH
--train_file_dir DIR
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--fp16
--num_train_epochs 1
--learning_rate 2e-5
--warmup_ratio 0.05
--weight_decay 0.05
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 6
--gradient_accumulation_steps 1
--preprocessing_num_workers 4
--output_dir DIR
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--cache_dir ./cache

@wangrx33
Copy link

尝试了chatglm1-6b和chatglm2-6b的全参增量预训练和全参SFT微调,成功了。解决方法:

  1. 设置torch_dtype=float16, 会loss=0,这个的解释是float16精度不够,需要用float32或者bfloat16(如果GPU支持),LLaMA模型设置float32即可成功运行;
  2. 改变方法,当前我显存多,手动设置torch_dtype=float32,遇到问题expected scalar type Half but found Float, 参考 expected scalar type Half but found Float mymusise/ChatGLM-Tuning#179 解决,这个设置等同于入参``torch_dtype=float16`,所以chatglm的代码逻辑需要加个强制转换为float32,加了解决问题了。

llama2做sft,也遇到了这个问题,训练了几十个step之后loss就变成0了,torch_dtype=float32还是不管用,显卡是8张A100。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants