-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems about Usage of SyncSN #16
Comments
What problem? I will give a example soon. |
@JiaminRen my code is stuck at dist.broadcast, no error message, backend is nccl. Do you test the train code or what configuration I didn’t do |
which task did you test? imagenet or face recognition? |
@JiaminRen I just tried to imitate your train code in face recognition to use SyncSN in my code, but |
Have you changed any code? Just running the script |
@JiaminRen I quickly test the face_recognition train.py, unfortunately I met the same problems. I think maybe some system configurations I missed. |
@JiaminRen my system is ubuntu18.04 and I use ananconda to install pytorch, program is stuck at dist.broadcast |
This is a distributed framework, and it should be run on multi-gpus by using |
Thanks, |
Sorry to bother you again. Actually, when I was using SyncSN, I got some different errors. I tried to imitate the way you used in train.py, but my model outputs NaNs, which will not happened in SN. Another error is subprocess.CalledProcessError: Command returned non-zero exit 1. Do you have any idea? Thanks |
same error ,loss is nan why? |
my model is very easy,like resnet,or lenet. |
Very nice work! I try to use your train code in face_recognition, but I met some problems. Frist,
rank = int(os.environ['RANK'])
andworld_size = int(os.environ['WORLD_SIZE'])
don't have values, so I added some codeos.environ['RANK']=str(0)
,os.environ['WORLD_SIZE']=str(4)
. Is that right? Second, my code is stuck at dist.broadcast, it doesn't have any error message, just stuck. Could you give me some adviceThe text was updated successfully, but these errors were encountered: