Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError is encounted when training cifar10_rnn_gate_rl_38 #7

Open
VictoriaYyz opened this issue Apr 11, 2019 · 4 comments
Open

Comments

@VictoriaYyz
Copy link

Get a RuntimeError when training cifar10_rnn_gate_rl_38 :

04-11-19 09:10:start training cifar10_rnn_gate_rl_38
04-11-19 09:10:=> loading checkpoint ./save_checkpoints/cifar10_rnn_gate_38/model_best.pth.tar
04-11-19 09:10:=> loaded checkpoint ./save_checkpoints/cifar10_rnn_gate_38/model_best.pth.tar (iter: 59000)
Files already downloaded and verified
Files already downloaded and verified
start: 0
04-11-19 09:10:Iter [0] learning rate = 0.0001
Traceback (most recent call last):
File "train_rl.py", line 492, in
main()
File "train_rl.py", line 121, in main
run_training(args)
File "train_rl.py", line 235, in run_training
R = r + args.gamma * R
File "/seu_share/home/zhanjun/anaconda3/envs/pytorch0.2/lib/python3.6/site-packages/torch/tensor.py", line 293, in add
return self.add(other)
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:217

I didn't change any default configure, please help. Thanks.

@xinw1012
Copy link
Collaborator

Hi Victoria, what's your PyTorch version? It's likely some APIs have changed in the newer version of PyTorch. I'm working on updating the code to the new version and hope to release it soon. Thanks!

@VictoriaYyz
Copy link
Author

I use PyTorch 0.2 and Python 3.6. My cuda version is 9.0.
I google the error, it seems that the following code causes the problem.
R = - pred_loss.data
R = r + args.gamma * R

@13597862
Copy link

Excuse me,I encountered the bug ,too.Can you run the code using the command"python3 train_rl.py train cifar10_rnn_gate_rl_110 --resume resnet-110-rnn-sp-cifar10.pth.tar -d cifar10 --gate-type rnn
" normally?

The bug information lists as follow.
Traceback (most recent call last):
File "train_rl.py", line 492, in
main()
File "train_rl.py", line 121, in main
run_training(args)
File "train_rl.py", line 217, in run_training
output, masks, probs = model(input_var)
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
TypeError: Caught TypeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wym/skipnet-master/cifar/models.py", line 1243, in forward
mask, gprob = self.control(gate_feature)
File "/home/wym/anaconda3/envs/python_auto/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wym/skipnet-master/cifar/models.py", line 1136, in forward
action = bi_prob.multinomial()
TypeError: multinomial() missing 1 required positional arguments: "num_samples"

@akinsanyaayomide
Copy link

I use PyTorch 0.2 and Python 3.6. My cuda version is 9.0. I google the error, it seems that the following code causes the problem. R = - pred_loss.data R = r + args.gamma * R

Hello @VictoriaYyz have you been able to resolve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants