-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imagenet in 18 minutes submission #53
Conversation
Thanks @yaroslavvb! I aim to review this and #54 by end of day Monday. However, I wouldn't expect the result to go live on the website until the end of the week (9/14). Please let me know if that is an issue. |
9/14 sounds good to me
…On Sat, Sep 8, 2018 at 8:51 PM codyaustun ***@***.***> wrote:
Thanks @yaroslavvb <https://github.com/yaroslavvb>! I aim to review this
and #54
<#54> by
end of day Monday. However, I wouldn't expect the result to go live on the
website until the end of the week (9/14). Please let me know if that is an
issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHOWFnrW3_KVlHUPwoVk1OD_v6Wb2ks5uZJBagaJpZM4WfpWE>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. I had some clarification questions and requested a few version numbers.
"momentum": 0.9, | ||
"weightDecay": 0.0001, | ||
"schedule": [ | ||
{"learning_rate": 1.8819957971572876, "example": 0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does example mean? Is this the same as iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example means image. IE, at image 0 we used learning rate 1.8819957971572876.
"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "PyTorch", | ||
"codeURL": "https://github.com/diux-dev/imagenet18", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a commit hash. You can either update the codeURL
field or add a separate commitHash
field.
{"learning_rate": 3.68815279006958, "example": 7389440}, | ||
{"learning_rate": 3.6901485919952393, "example": 7397632}, | ||
{"learning_rate": 3.6921443939208984, "example": 7405824}, | ||
{"learning_rate": 3.69414019584 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any confusion, can you add the usedBlackList
field as shown here. If you used all 50,000 images for validation, this value should be true. If not, please rerun with all 50,000 images in the validation set to be comparable with other results. More details are available in Issues #36
"optimizer": "SGD with Momentum", | ||
"momentum": 0.9, | ||
"weightDecay": 0.0001, | ||
"schedule": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your learning rate schedule looks complicated. Can you add a high-level description as a separate field in misc
? Here is an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not have a simple description as some parts of the schedule were due to bugs in the learning rate scheduler -- fixing those bugs made convergence worse so we kept the buggy version, I'll try my best though.
"version": "v1.0", | ||
"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "PyTorch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the version number.
"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "PyTorch", | ||
"codeURL": "https://github.com/diux-dev/imagenet18", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the link, it looks like the following is used to reproduced this result:
pip install -r requirements.txt
aws configure (or set your AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION)
python train.py # pre-warming
python train.py
If that is true, what does the pre-warming do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using AWS io2 root disks initialized from AMI which are created on the fly. These disks are created lazily, so the first time you access them, the data is copied from S3 and it adds 10 minutes to run time. When you run again, the disks are reused, so you no longer pay the copy penalty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable to me. I wanted to make sure there was no pretraining or caching. Is it true that the data could be persisted on the io2 disks?
ptal |
Yes, it is true. We are not loading from checkpoints, training starts from
random parameters each time
…On Mon, Sep 10, 2018, 15:52 codyaustun ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In ImageNet/train/dawn_resnet50_18minutes.json
<#53 (comment)>
:
> @@ -0,0 +1,975 @@
+{
+ "version": "v1.0",
+ "author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
+ "authorEmail": ***@***.***",
+ "framework": "PyTorch",
+ "codeURL": "https://github.com/diux-dev/imagenet18",
That sounds reasonable to me. I wanted to make sure there was no
pretraining or caching. Is it true that the data could be persisted on the
io2 disks?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHLy8zcPP18mKARJSxeKf6f6QoHB0ks5uZu0pgaJpZM4WfpWE>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaroslavvb it seems like Travis is having some issues today, and the build for your commit failed but isn't showing on GitHub. Please fix these JSON issues. In case Travis fails again, you can run the tests locally by going to the root of this repo and running the following:
pip install -r requirements.txt
pytest
"optimizer": "SGD with Momentum", | ||
"momentum": 0.9, | ||
"weightDecay": 0.0001, | ||
"schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0, 'sz':128, 'bs':64}, {'ep':(0,6), 'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is missing a comma
{"learning_rate": 3.672186851501465, "example": 7323904}, | ||
{"learning_rate": 3.674182653427124, "example": 7332096}, | ||
{"learning_rate": 3.676178455352783, "example": 7340288}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is missing a comma.
fixed and ran through json validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md
JSON and TSV files are named
[author name]_[model name]_[hardware tag]_[framework].json
, similar to
dawn_resnet56_1k80-gc_tensorflow.[json|tsv]
. Put the JSON and TSV files in theImageNet/train/
sub-directory.
"momentum": 0.9, | ||
"weightDecay": 0.0001, | ||
"schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0, 'sz':128, 'bs':64}, {'ep':(0,6), 'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]", | ||
"schedule": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is minor, but by any chance can you express the learning rate, batch size, and image size schedule in a format similar to #54
ptal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I noticed the epochs in schedule
, imageSize
, and batchSize
all add up to 40 while the TSV only has 38 lines. Was that because it reached 93% at 38 epochs and you didn't include the last 2 epochs in the TSV? I just want to make sure the learning rate schedule is correct.
Yes, that is correct, tsv cuts off at first epoch that reaches 93
…On Wed, Sep 12, 2018, 11:59 codyaustun ***@***.***> wrote:
***@***.**** approved this pull request.
Looks good to me. I noticed the epochs in schedule, imageSize, and
batchSize all add up to 40 while the TSV only has 38 lines. Was that
because it reached 93% at 38 epochs and you didn't include the last 2
epochs in the TSV? I just want to make sure the learning rate schedule is
correct.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHOU-LUoACFWHcAexoKDLt470RRzBks5uaVmXgaJpZM4WfpWE>
.
|
Great! Everything looks good. I'll merge everything in by tomorrow evening |
@yaroslavvb it is going to take me one more day to add this to the website. Sorry for the delay. |
Hi @yaroslavvb, I had a couple of other questions on how to reproduce these results:
Thanks! |
When I run this code, I get the following exception:
Any idea why? Thanks! |
@deepakn94 Correct, our workflow has been to run on laptops. "Do you have a list of AWS permissions needed to run this code -- it seems that at minimum you need elasticfilesystem:DescribeFileSystems?" I don't have a list of permissions, things just worked on my account from the beginning, do you know a way to obtain list of permissions I have? Filed tracking issue cybertronai/imagenet18_old#12 "2. What needs to be in the AMI used to run the job? A copy of the ImageNet dataset in the right format and a source-compiled PyTorch?" Correct. Specification of ImageNet folder structure is here: https://github.com/diux-dev/cluster/blob/master/pytorch/README.md#data-preparation Doesn't have to be source-compiled PyTorch 0.4.1 also works, perhaps 6% slower. "2. Does the imagenet18 repository also need to be cloned on the AMI" No, code is automatically shipped over EFS error: This suggests that the account didn't get properly setup (no EFS was created). Added an issue to make it more clear diux-dev/ncluster#14 Did you get any errors earlier in the process? It should complain when it fails to create EFS |
@deepakn94 to elaborate on permissions, It would be easier to track if you filed specific issues/suggestions on https://github.com/diux-dev/imagenet18 repo |
Hi @yaroslavvb, I got this working yesterday after asking the questions on this thread -- it turns out that I just needed the |
Interesting Glad to hear you got it running, you might be the first external user |
Hi @yaroslavvb, I'm seeing a non-trivial number of runs hang -- both for 4 and 16 DGX-1s. This happens in the setup phase usually -- have you seen something like this? Also, what is the best way to run multiple trials of the same experiment? I have been running multiple Python processes but this seems really wasteful, since the setup steps can / should be performed just once. Thanks again for the help! |
I have observed a number of hangs early in experimentation, filed an issue here pytorch/pytorch#9696 You may want to try "gdb -p" and look at stack trace to see if it hangs in the same place I have not seen these hangs in the last month or so however, not sure if it's due to upgrade of PyTorch (built from master) or something else. Occasionally there's a different kind of hang -- if any worker OOM's, the rest of the workers will hang. This was fixed by reducing batch size. Are you using actual DGX-1 or p3.16xlarge instances on AWS? |
I'm using a This seems to usually happen way before the PyTorch run starts. However, I just had a 16-machine run stall here while it was trying to switch datasets:
|
OK, this last one looks more like OOM. When you change dataset, memory requirements change, and some versions of PyTorch (0.4.1) runs out of memory. If you SSH into each of 16 machines, and attach to tmux session, you'll probably find it crashed with OOM. The rest of the workers will hang forever. The version of PyTorch baked into AMI (built from master couple of weeks ago) shouldn't run out of memory |
I've actually hit this failure at this exact epoch quite frequently before upgrading pytorch |
Ah, interesting. I’m using the AMI specified in the code (IMAGE_NAME =
'pytorch.imagenet.source.v7'). Is this not the right AMI to use?
…On Tue, Sep 18, 2018 at 7:25 PM Yaroslav Bulatov ***@***.***> wrote:
OK, this last one looks more like OOM. When you change dataset, memory
requirements change, and some versions of PyTorch (0.4.1) runs out of
memory. If you SSH into each of 16 machines, and attach to tmux session,
you'll probably find it crashed with OOM. The rest of the workers will hang
forever.
The version of PyTorch baked into AMI (built from master couple of weeks
ago) shouldn't run out of memory
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACmQxuJiEkidBZmSd76NM4mShwrpjyB9ks5ucasagaJpZM4WfpWE>
.
|
That's correct version. Can you try the 8-machine version and see if you have hangs still? That one should only be a minute slower |
I haven't run into any issues with the 8-machine version. We need to run the 8-machine and 16-machine versions for some experiments we're running internally. I guess I can try building my own AMI that uses PyTorch built from current master? |
Also, do the hangs happen non-deterministically? |
The other time hangs usually happen is here, during initialization (this is on a 4 machine run, one of the machines doesn't successfully initialize):
|
About initial hang, sometimes Amazon gives a bad machine. You can verify
this by trying to SSH into the machine, and if that fails going to console,
right click on instance, instance settings -> capture instance screenshots.
If you see a bunch of error messages instead of login prompt, it's an
Amazon issue.
Also, occasionally (maybe 1 in 10 runs), I got a working but slow machine.
Things just run but everything is 3x slower.
When I dealt with PyTorch OOM errors, they indeed happen
non-deterministically. Does your 16 machine run always hang at epoch 37?
You can tell if your hang is due to OOM by changing the last couple of
epochs to use slightly smaller batch size, ie 96 instead of 128.
Note that latest PyTorch switched to new distributed backend last week, so
if you build from source, the characteristics of network exchange may be
quite different
…On Tue, Sep 18, 2018 at 8:17 PM Deepak Narayanan ***@***.***> wrote:
The other time hangs usually happen is here, during initialization (this
is on a 4 machine run, one of the machines doesn't successfully initialize):
2018-09-18 20:09:29.404726 2.imagenet: downloading /tmp/ncluster/2.imagenet.initialized
2018-09-18 20:09:29.428946 3.imagenet: Checking for initialization status
2018-09-18 20:09:29.429114 3.imagenet: downloading /tmp/ncluster/3.imagenet.initialized
2018-09-18 20:09:29.431599 0.imagenet: Checking for initialization status
2018-09-18 20:09:29.431674 0.imagenet: downloading /tmp/ncluster/0.imagenet.initialized
2018-09-18 20:09:31.742750 2.imagenet: Initialize complete
2018-09-18 20:09:31.743137 2.imagenet: To connect to 2.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.***
tmux a
2018-09-18 20:09:31.744758 3.imagenet: Initialize complete
2018-09-18 20:09:31.744800 3.imagenet: To connect to 3.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.***
tmux a
2018-09-18 20:09:31.749249 0.imagenet: Initialize complete
2018-09-18 20:09:31.749306 0.imagenet: To connect to 0.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.***
tmux a
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHHZbUFvUMDcnDmKGLONqwx8AhEuTks5ucbdDgaJpZM4WfpWE>
.
|
Understood about the initial hang. I've tried the 16 machine run twice, and it's failed both times after epoch 37. I verified the second time that it was because of an OOM error. I will try to run these experiments with a batch size of 96 in the last phase then -- this would presumably change the converge properties of the model slightly? |
If you also lower learning rate by 128/96, it should not affect convergence
properties -- your SGD will roughly traverse the same length after one epoch
…On Wed, Sep 19, 2018 at 4:13 PM Deepak Narayanan ***@***.***> wrote:
Understood about the initial hang.
I've tried the 16 machine run twice, and it's failed both times after
epoch 37. I verified the second time that it was because of an OOM error. I
will try to run these experiments with a batch size of 96 in the last phase
then -- this would presumably change the converge properties of the model
slightly?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AABaHGOL9eBHpE6mrrMvgyZfBj036f36ks5ucs97gaJpZM4WfpWE>
.
|
@codyaustun