Problems about Usage of SyncSN #16

stillwaterman · 2019-03-31T02:12:14Z

Very nice work! I try to use your train code in face_recognition, but I met some problems. Frist, rank = int(os.environ['RANK']) and world_size = int(os.environ['WORLD_SIZE']) don't have values, so I added some code os.environ['RANK']=str(0), os.environ['WORLD_SIZE']=str(4). Is that right? Second, my code is stuck at dist.broadcast, it doesn't have any error message, just stuck. Could you give me some advice

The text was updated successfully, but these errors were encountered:

JiaminRen · 2019-04-01T07:16:04Z

What problem? I will give a example soon.

stillwaterman · 2019-04-01T07:25:08Z

@JiaminRen my code is stuck at dist.broadcast, no error message, backend is nccl. Do you test the train code or what configuration I didn’t do

JiaminRen · 2019-04-01T07:37:18Z

which task did you test? imagenet or face recognition?

stillwaterman · 2019-04-01T07:44:27Z

@JiaminRen I just tried to imitate your train code in face recognition to use SyncSN in my code, but
I didn't succeed. I met two problem, frist, rank = int(os.environ['RANK']) and world_size = int(os.environ['WORLD_SIZE']) don't have values, so I added some code os.environ['RANK']=str(0), os.environ['WORLD_SIZE']=str(4), second is dist.broadcast

JiaminRen · 2019-04-01T07:47:32Z

Have you changed any code? Just running the script face_recognition/train.sh will be ok.

stillwaterman · 2019-04-01T08:15:46Z

@JiaminRen I quickly test the face_recognition train.py, unfortunately I met the same problems. I think maybe some system configurations I missed.

stillwaterman · 2019-04-01T08:21:46Z

@JiaminRen my system is ubuntu18.04 and I use ananconda to install pytorch, program is stuck at dist.broadcast

JiaminRen · 2019-04-01T08:27:30Z

This is a distributed framework, and it should be run on multi-gpus by using torch.distributed.launch.

stillwaterman · 2019-04-01T09:26:14Z

Thanks, torch.distributed.launch can solve problems. But sync way consumes a lot of GPU memory, always out of memory

stillwaterman · 2019-07-05T03:19:47Z

Sorry to bother you again. Actually, when I was using SyncSN, I got some different errors. I tried to imitate the way you used in train.py, but my model outputs NaNs, which will not happened in SN. Another error is subprocess.CalledProcessError: Command returned non-zero exit 1. Do you have any idea? Thanks

henbucuoshanghai · 2024-05-13T06:31:01Z

same error ,loss is nan why?

henbucuoshanghai · 2024-05-13T06:31:53Z

my model is very easy,like resnet,or lenet.
change the bn to sn,loss is nan,why?

stillwaterman closed this as completed Mar 31, 2019

stillwaterman reopened this Apr 1, 2019

stillwaterman changed the title ~~Usage of SyncSN~~ Problems about Usage of SyncSN Apr 1, 2019

stillwaterman closed this as completed Apr 2, 2019

stillwaterman reopened this Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems about Usage of SyncSN #16

Problems about Usage of SyncSN #16

stillwaterman commented Mar 31, 2019 •

edited

Loading

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019 •

edited

Loading

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019 •

edited

Loading

stillwaterman commented Apr 1, 2019

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019

stillwaterman commented Jul 5, 2019 •

edited

Loading

henbucuoshanghai commented May 13, 2024

henbucuoshanghai commented May 13, 2024

Problems about Usage of SyncSN #16

Problems about Usage of SyncSN #16

Comments

stillwaterman commented Mar 31, 2019 • edited Loading

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019 • edited Loading

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019 • edited Loading

stillwaterman commented Apr 1, 2019

JiaminRen commented Apr 1, 2019

stillwaterman commented Apr 1, 2019

stillwaterman commented Jul 5, 2019 • edited Loading

henbucuoshanghai commented May 13, 2024

henbucuoshanghai commented May 13, 2024

stillwaterman commented Mar 31, 2019 •

edited

Loading

stillwaterman commented Apr 1, 2019 •

edited

Loading

stillwaterman commented Apr 1, 2019 •

edited

Loading

stillwaterman commented Jul 5, 2019 •

edited

Loading