Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems about Usage of SyncSN #16

Open
stillwaterman opened this issue Mar 31, 2019 · 12 comments
Open

Problems about Usage of SyncSN #16

stillwaterman opened this issue Mar 31, 2019 · 12 comments

Comments

@stillwaterman
Copy link

stillwaterman commented Mar 31, 2019

Very nice work! I try to use your train code in face_recognition, but I met some problems. Frist, rank = int(os.environ['RANK']) and world_size = int(os.environ['WORLD_SIZE']) don't have values, so I added some code os.environ['RANK']=str(0), os.environ['WORLD_SIZE']=str(4). Is that right? Second, my code is stuck at dist.broadcast, it doesn't have any error message, just stuck. Could you give me some advice

@stillwaterman stillwaterman reopened this Apr 1, 2019
@stillwaterman stillwaterman changed the title Usage of SyncSN Problems about Usage of SyncSN Apr 1, 2019
@JiaminRen
Copy link
Collaborator

What problem? I will give a example soon.

@stillwaterman
Copy link
Author

stillwaterman commented Apr 1, 2019

@JiaminRen my code is stuck at dist.broadcast, no error message, backend is nccl. Do you test the train code or what configuration I didn’t do

@JiaminRen
Copy link
Collaborator

which task did you test? imagenet or face recognition?

@stillwaterman
Copy link
Author

@JiaminRen I just tried to imitate your train code in face recognition to use SyncSN in my code, but
I didn't succeed. I met two problem, frist, rank = int(os.environ['RANK']) and world_size = int(os.environ['WORLD_SIZE']) don't have values, so I added some code os.environ['RANK']=str(0), os.environ['WORLD_SIZE']=str(4), second is dist.broadcast

@JiaminRen
Copy link
Collaborator

Have you changed any code? Just running the script face_recognition/train.sh will be ok.

@stillwaterman
Copy link
Author

stillwaterman commented Apr 1, 2019

@JiaminRen I quickly test the face_recognition train.py, unfortunately I met the same problems. I think maybe some system configurations I missed.

@stillwaterman
Copy link
Author

@JiaminRen my system is ubuntu18.04 and I use ananconda to install pytorch, program is stuck at dist.broadcast

@JiaminRen
Copy link
Collaborator

This is a distributed framework, and it should be run on multi-gpus by using torch.distributed.launch.

@stillwaterman
Copy link
Author

Thanks, torch.distributed.launch can solve problems. But sync way consumes a lot of GPU memory, always out of memory

@stillwaterman
Copy link
Author

stillwaterman commented Jul 5, 2019

Sorry to bother you again. Actually, when I was using SyncSN, I got some different errors. I tried to imitate the way you used in train.py, but my model outputs NaNs, which will not happened in SN. Another error is subprocess.CalledProcessError: Command returned non-zero exit 1. Do you have any idea? Thanks

@henbucuoshanghai
Copy link

same error ,loss is nan why?

@henbucuoshanghai
Copy link

my model is very easy,like resnet,or lenet.
change the bn to sn,loss is nan,why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants