Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

咨询META训练时超参数设置问题 #2

Open
pooplar opened this issue Feb 14, 2023 · 5 comments
Open

咨询META训练时超参数设置问题 #2

pooplar opened this issue Feb 14, 2023 · 5 comments

Comments

@pooplar
Copy link

pooplar commented Feb 14, 2023

我从您论文摘要所给出的github链接下载了META工程,用四张显存各为12G的2080Ti,按照您配置文件中默认的参数,BASE_LR: 0.04,IMS_PER_BATCH: 64,训练时回报错“FloatingPointError: Loss became infinite or NaN at iteration=510!”,查阅资料后好像时是学习率有点大。我就根据您论文实验部分的参数设置,设置BASE_LR: 0.0003,IMS_PER_BATCH: 64,这样能训练完,但是rank1为39.16,mAP为14.83,达不到论文中的性能。请问是我训练过程有问题吗,能否分享一下您的超参数设置呢?

@luluaa
Copy link

luluaa commented Feb 15, 2023

Hello! I also encountered the problem what you said when using the default BASE_LR: 0.04,IMS_PER_BATCH: 64. But when I set BASE_LR to 0.0003 and IMS_PER_BATCH=64, I reported another error when finishing epoch/iter: 2/1999. Do you or the author knows what the problem is or have you ever encountered it?

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "./fastreid/engine/launch.py", line 103, in _distributed_worker
main_func(*args)
File "/data2/XXX/Prj/META/META/projects/META/train_net.py", line 46, in main
return trainer.train()
File "./fastreid/engine/defaults.py", line 384, in train
super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch)
File "./fastreid/engine/train_loop.py", line 146, in train
self.run_step()
File "./fastreid/engine/defaults.py", line 393, in run_step
self._trainer.run_step(self.cfg)
File "./fastreid/engine/train_loop.py", line 346, in run_step
loss_dict = self.model(data,self.iter)
File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "./fastreid/engine/apex/apex/parallel/distributed.py", line 564, in forward
result = self.module(*inputs, **kwargs)
File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "./fastreid/modeling/meta_arch/baseline.py", line 180, in forward
losses = self.losses(x_Expert2_output, F_final_output, x_agg_output, targets_agg, targets_expert,iters)
File "./fastreid/modeling/meta_arch/baseline.py", line 303, in losses
self._cfg.MODEL.LOSSES.CE.ALPHA,
File "./fastreid/modeling/losses/cross_entroy_loss.py", line 50, in cross_entropy_loss
non_zero_cnt = max(loss.nonzero(as_tuple=False).size(0), 1)
RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

@luluaa
Copy link

luluaa commented Feb 15, 2023

Hello! I also encountered the problem what you said when using the default BASE_LR: 0.04,IMS_PER_BATCH: 64. But when I set BASE_LR to 0.0003 and IMS_PER_BATCH=64, I reported another error when finishing epoch/iter: 2/1999. Do you or the author knows what the problem is or have you ever encountered it?

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "./fastreid/engine/launch.py", line 103, in _distributed_worker main_func(*args) File "/data2/XXX/Prj/META/META/projects/META/train_net.py", line 46, in main return trainer.train() File "./fastreid/engine/defaults.py", line 384, in train super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch) File "./fastreid/engine/train_loop.py", line 146, in train self.run_step() File "./fastreid/engine/defaults.py", line 393, in run_step self._trainer.run_step(self.cfg) File "./fastreid/engine/train_loop.py", line 346, in run_step loss_dict = self.model(data,self.iter) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/engine/apex/apex/parallel/distributed.py", line 564, in forward result = self.module(*inputs, **kwargs) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/modeling/meta_arch/baseline.py", line 180, in forward losses = self.losses(x_Expert2_output, F_final_output, x_agg_output, targets_agg, targets_expert,iters) File "./fastreid/modeling/meta_arch/baseline.py", line 303, in losses self._cfg.MODEL.LOSSES.CE.ALPHA, File "./fastreid/modeling/losses/cross_entroy_loss.py", line 50, in cross_entropy_loss non_zero_cnt = max(loss.nonzero(as_tuple=False).size(0), 1) RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

and often occur the gradient overflow:
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0

@pooplar
Copy link
Author

pooplar commented Feb 16, 2023

Hello, I haven't met the second question. I trained with four graphics cards. However, after my successful training, the performance still lags behind that in the paper. I sent an email to the author of the paper, but he didn't reply to me.
I also encountered the gradient overflow, but the training was successful :
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0

@Xiaofu233
Copy link

Xiaofu233 commented Jun 20, 2023

Hello! I also encountered the problem what you said when using the default BASE_LR: 0.04,IMS_PER_BATCH: 64. But when I set BASE_LR to 0.0003 and IMS_PER_BATCH=64, I reported another error when finishing epoch/iter: 2/1999. Do you or the author knows what the problem is or have you ever encountered it?

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "./fastreid/engine/launch.py", line 103, in _distributed_worker main_func(*args) File "/data2/XXX/Prj/META/META/projects/META/train_net.py", line 46, in main return trainer.train() File "./fastreid/engine/defaults.py", line 384, in train super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch) File "./fastreid/engine/train_loop.py", line 146, in train self.run_step() File "./fastreid/engine/defaults.py", line 393, in run_step self._trainer.run_step(self.cfg) File "./fastreid/engine/train_loop.py", line 346, in run_step loss_dict = self.model(data,self.iter) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/engine/apex/apex/parallel/distributed.py", line 564, in forward result = self.module(*inputs, **kwargs) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/modeling/meta_arch/baseline.py", line 180, in forward losses = self.losses(x_Expert2_output, F_final_output, x_agg_output, targets_agg, targets_expert,iters) File "./fastreid/modeling/meta_arch/baseline.py", line 303, in losses self._cfg.MODEL.LOSSES.CE.ALPHA, File "./fastreid/modeling/losses/cross_entroy_loss.py", line 50, in cross_entropy_loss non_zero_cnt = max(loss.nonzero(as_tuple=False).size(0), 1) RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

I have met the same problem, have you addressed ? Thanks a lot.

@luluaa
Copy link

luluaa commented Jun 29, 2023

Hello! I also encountered the problem what you said when using the default BASE_LR: 0.04,IMS_PER_BATCH: 64. But when I set BASE_LR to 0.0003 and IMS_PER_BATCH=64, I reported another error when finishing epoch/iter: 2/1999. Do you or the author knows what the problem is or have you ever encountered it?
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "./fastreid/engine/launch.py", line 103, in _distributed_worker main_func(*args) File "/data2/XXX/Prj/META/META/projects/META/train_net.py", line 46, in main return trainer.train() File "./fastreid/engine/defaults.py", line 384, in train super().train(self.start_epoch, self.max_epoch, self.iters_per_epoch) File "./fastreid/engine/train_loop.py", line 146, in train self.run_step() File "./fastreid/engine/defaults.py", line 393, in run_step self._trainer.run_step(self.cfg) File "./fastreid/engine/train_loop.py", line 346, in run_step loss_dict = self.model(data,self.iter) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/engine/apex/apex/parallel/distributed.py", line 564, in forward result = self.module(*inputs, **kwargs) File "/home/XXX/anaconda3/envs/meta/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "./fastreid/modeling/meta_arch/baseline.py", line 180, in forward losses = self.losses(x_Expert2_output, F_final_output, x_agg_output, targets_agg, targets_expert,iters) File "./fastreid/modeling/meta_arch/baseline.py", line 303, in losses self._cfg.MODEL.LOSSES.CE.ALPHA, File "./fastreid/modeling/losses/cross_entroy_loss.py", line 50, in cross_entropy_loss non_zero_cnt = max(loss.nonzero(as_tuple=False).size(0), 1) RuntimeError: copy_if failed to synchronize: cudaErrorAssert: device-side assert triggered

I have met the same problem, have you addressed ? Thanks a lot.

In fact, I kind of forgot about this. If memory serves, the number of classes in config.yml should be +1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants