imagenet in 18 minutes submission #53

yaroslavvb · 2018-09-08T00:51:07Z

codyaustun · 2018-09-09T03:51:53Z

Thanks @yaroslavvb! I aim to review this and #54 by end of day Monday. However, I wouldn't expect the result to go live on the website until the end of the week (9/14). Please let me know if that is an issue.

yaroslavvb · 2018-09-09T04:16:58Z

9/14 sounds good to me

…

On Sat, Sep 8, 2018 at 8:51 PM codyaustun ***@***.***> wrote: Thanks @yaroslavvb <https://github.com/yaroslavvb>! I aim to review this and #54 <#54> by end of day Monday. However, I wouldn't expect the result to go live on the website until the end of the week (9/14). Please let me know if that is an issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHOWFnrW3_KVlHUPwoVk1OD_v6Wb2ks5uZJBagaJpZM4WfpWE> .

codyaustun

Looks good for the most part. I had some clarification questions and requested a few version numbers.

codyaustun · 2018-09-09T23:54:15Z

ImageNet/train/dawn_resnet50_18minutes.json

+        "momentum": 0.9,
+        "weightDecay": 0.0001,
+        "schedule": [
+            {"learning_rate": 1.8819957971572876, "example": 0},


What does example mean? Is this the same as iteration?

Example means image. IE, at image 0 we used learning rate 1.8819957971572876.

codyaustun · 2018-09-09T23:58:24Z

ImageNet/train/dawn_resnet50_18minutes.json

+    "author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "PyTorch",
+    "codeURL": "https://github.com/diux-dev/imagenet18",


Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

codyaustun · 2018-09-10T00:04:22Z

ImageNet/train/dawn_resnet50_18minutes.json

+            {"learning_rate": 3.68815279006958, "example": 7389440},
+            {"learning_rate": 3.6901485919952393, "example": 7397632},
+            {"learning_rate": 3.6921443939208984, "example": 7405824},
+            {"learning_rate": 3.69414019584


To avoid any confusion, can you add the usedBlackList field as shown here. If you used all 50,000 images for validation, this value should be true. If not, please rerun with all 50,000 images in the validation set to be comparable with other results. More details are available in Issues #36

codyaustun · 2018-09-10T00:09:08Z

ImageNet/train/dawn_resnet50_18minutes.json

+        "optimizer": "SGD with Momentum",
+        "momentum": 0.9,
+        "weightDecay": 0.0001,
+        "schedule": [


Your learning rate schedule looks complicated. Can you add a high-level description as a separate field in misc? Here is an example.

It does not have a simple description as some parts of the schedule were due to bugs in the learning rate scheduler -- fixing those bugs made convergence worse so we kept the buggy version, I'll try my best though.

codyaustun · 2018-09-10T00:11:51Z

ImageNet/train/dawn_resnet50_18minutes.json

+    "version": "v1.0",
+    "author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "PyTorch",


Please add the version number.

codyaustun · 2018-09-10T00:17:46Z

ImageNet/train/dawn_resnet50_18minutes.json

+    "author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "PyTorch",
+    "codeURL": "https://github.com/diux-dev/imagenet18",


From the link, it looks like the following is used to reproduced this result:

pip install -r requirements.txt aws configure (or set your AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION) python train.py # pre-warming python train.py

If that is true, what does the pre-warming do?

We are using AWS io2 root disks initialized from AMI which are created on the fly. These disks are created lazily, so the first time you access them, the data is copied from S3 and it adds 10 minutes to run time. When you run again, the disks are reused, so you no longer pay the copy penalty

That sounds reasonable to me. I wanted to make sure there was no pretraining or caching. Is it true that the data could be persisted on the io2 disks?

yaroslavvb · 2018-09-10T21:12:49Z

ptal

yaroslavvb · 2018-09-10T22:59:41Z

Yes, it is true. We are not loading from checkpoints, training starts from random parameters each time

…

On Mon, Sep 10, 2018, 15:52 codyaustun ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In ImageNet/train/dawn_resnet50_18minutes.json <#53 (comment)> : > @@ -0,0 +1,975 @@ +{ + "version": "v1.0", + "author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard", + "authorEmail": ***@***.***", + "framework": "PyTorch", + "codeURL": "https://github.com/diux-dev/imagenet18", That sounds reasonable to me. I wanted to make sure there was no pretraining or caching. Is it true that the data could be persisted on the io2 disks? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHLy8zcPP18mKARJSxeKf6f6QoHB0ks5uZu0pgaJpZM4WfpWE> .

codyaustun

@yaroslavvb it seems like Travis is having some issues today, and the build for your commit failed but isn't showing on GitHub. Please fix these JSON issues. In case Travis fails again, you can run the tests locally by going to the root of this repo and running the following:

pip install -r requirements.txt
pytest

codyaustun · 2018-09-10T23:01:48Z

ImageNet/train/dawn_resnet50_18minutes.json

+        "optimizer": "SGD with Momentum",
+        "momentum": 0.9,
+        "weightDecay": 0.0001,
+        "schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0,  'sz':128, 'bs':64}, {'ep':(0,6),  'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]"


This line is missing a comma

codyaustun · 2018-09-10T23:02:06Z

ImageNet/train/dawn_resnet50_18minutes.json

+            {"learning_rate": 3.672186851501465, "example": 7323904},
+            {"learning_rate": 3.674182653427124, "example": 7332096},
+            {"learning_rate": 3.676178455352783, "example": 7340288},
+


This line is missing a comma.

yaroslavvb · 2018-09-10T23:23:14Z

fixed and ran through json validator

codyaustun

Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to
dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

codyaustun · 2018-09-10T23:38:11Z

ImageNet/train/dawn_resnet50_18minutes.json

+        "momentum": 0.9,
+        "weightDecay": 0.0001,
+        "schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0,  'sz':128, 'bs':64}, {'ep':(0,6),  'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]",
+        "schedule": [


This is minor, but by any chance can you express the learning rate, batch size, and image size schedule in a format similar to #54

yaroslavvb · 2018-09-11T00:21:08Z

ptal

codyaustun

Looks good to me. I noticed the epochs in schedule, imageSize, and batchSize all add up to 40 while the TSV only has 38 lines. Was that because it reached 93% at 38 epochs and you didn't include the last 2 epochs in the TSV? I just want to make sure the learning rate schedule is correct.

yaroslavvb · 2018-09-12T20:06:50Z

Yes, that is correct, tsv cuts off at first epoch that reaches 93

…

On Wed, Sep 12, 2018, 11:59 codyaustun ***@***.***> wrote: ***@***.**** approved this pull request. Looks good to me. I noticed the epochs in schedule, imageSize, and batchSize all add up to 40 while the TSV only has 38 lines. Was that because it reached 93% at 38 epochs and you didn't include the last 2 epochs in the TSV? I just want to make sure the learning rate schedule is correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHOU-LUoACFWHcAexoKDLt470RRzBks5uaVmXgaJpZM4WfpWE> .

codyaustun · 2018-09-13T20:20:06Z

Great! Everything looks good. I'll merge everything in by tomorrow evening

codyaustun · 2018-09-15T03:51:19Z

@yaroslavvb it is going to take me one more day to add this to the website. Sorry for the delay.

deepakn94 · 2018-09-16T18:06:16Z

Hi @yaroslavvb, I had a couple of other questions on how to reproduce these results:

The train command can be run from a laptop, correct? It doesn't have to run on a worker machine on AWS?
Do you have a list of AWS permissions needed to run this code -- it seems that at minimum you need elasticfilesystem:DescribeFileSystems?
What needs to be in the AMI used to run the job? A copy of the ImageNet dataset in the right format and a source-compiled PyTorch? Does the imagenet18 repository also need to be cloned on the AMI, or is the code automatically shipped over to the worker machines?

Thanks!

deepakn94 · 2018-09-16T21:53:22Z

When I run this code, I get the following exception:

2018-09-16 14:45:10.426365 0.imagenet: Setting up tmux
2018-09-16 14:45:13.052139 0.imagenet: Mounting EFS
Exception are  [KeyError('ncluster',)]
Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 167, in main
    install_script=open('setup.sh').read())
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/ncluster.py", line 100, in make_job
    return _backend.make_job(name, run_name=run_name, num_tasks=num_tasks, install_script=install_script, **kwargs)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 775, in make_job
    raise exceptions[0]
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 761, in make_task_fn
    **kwargs)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 691, in make_task
    instance_type=instance_type)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 87, in __init__
    self._mount_efs()
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 139, in _mount_efs
    efs_id = u.get_efs_dict()[u.get_prefix()]
KeyError: 'ncluster'

Any idea why? Thanks!

yaroslavvb · 2018-09-17T19:03:06Z

@deepakn94
"The train command can be run from a laptop, correct? It doesn't have to run on a worker machine on AWS?"

Correct, our workflow has been to run on laptops.

"Do you have a list of AWS permissions needed to run this code -- it seems that at minimum you need elasticfilesystem:DescribeFileSystems?"

I don't have a list of permissions, things just worked on my account from the beginning, do you know a way to obtain list of permissions I have? Filed tracking issue cybertronai/imagenet18_old#12

"2. What needs to be in the AMI used to run the job? A copy of the ImageNet dataset in the right format and a source-compiled PyTorch?"

Correct. Specification of ImageNet folder structure is here: https://github.com/diux-dev/cluster/blob/master/pytorch/README.md#data-preparation

Doesn't have to be source-compiled PyTorch 0.4.1 also works, perhaps 6% slower.

"2. Does the imagenet18 repository also need to be cloned on the AMI"

No, code is automatically shipped over

EFS error:

This suggests that the account didn't get properly setup (no EFS was created). Added an issue to make it more clear diux-dev/ncluster#14

Did you get any errors earlier in the process? It should complain when it fails to create EFS

yaroslavvb · 2018-09-17T20:33:48Z

@deepakn94 to elaborate on permissions, train.py was created with an assumption of being run on a fresh Amazon account with admin level permissions. In particular, it will try to create all infrastructure needed for the run, ie EFS, VPC, subnets, keypairs, placement groups. So on a high-level, write-level permissions dealing those resources are needed. EFS is needed to save training logs that are persistent. Creation of VPC/subnets was needed to work on EFS-classic account which did not have default VPC. Perhaps this part is no longer necessary. Keypairs/placement groups are needed.

It would be easier to track if you filed specific issues/suggestions on https://github.com/diux-dev/imagenet18 repo

deepakn94 · 2018-09-17T20:59:22Z

Hi @yaroslavvb,
Thanks for the response!

I got this working yesterday after asking the questions on this thread -- it turns out that I just needed the elasticfilesystem:DescribeFileSystems privilege (I'm guessing there are other privileges associated with launching instances, etc. needed, but I already had those). The exception above was because I hadn't created an EFS with a Name tag of ncluster (I did this using the AWS web UI) -- it took me some time to figure out that I needed to do this, but things were smooth after I figured that out!

yaroslavvb · 2018-09-17T21:02:48Z

Interesting
the train.py is supposed to create EFS for you. However, I'm using heuristics to decide whether to create it here so if it failed halfway through on the first run, it may decide not to create it.

Glad to hear you got it running, you might be the first external user

deepakn94 · 2018-09-19T01:07:56Z

Hi @yaroslavvb,

I'm seeing a non-trivial number of runs hang -- both for 4 and 16 DGX-1s. This happens in the setup phase usually -- have you seen something like this?

Also, what is the best way to run multiple trials of the same experiment? I have been running multiple Python processes but this seems really wasteful, since the setup steps can / should be performed just once.

Thanks again for the help!

yaroslavvb · 2018-09-19T01:10:45Z

I have observed a number of hangs early in experimentation, filed an issue here pytorch/pytorch#9696

You may want to try "gdb -p" and look at stack trace to see if it hangs in the same place

I have not seen these hangs in the last month or so however, not sure if it's due to upgrade of PyTorch (built from master) or something else.

Occasionally there's a different kind of hang -- if any worker OOM's, the rest of the workers will hang. This was fixed by reducing batch size. Are you using actual DGX-1 or p3.16xlarge instances on AWS?

deepakn94 · 2018-09-19T01:56:37Z

I'm using a p3.16xlarge instance on AWS.

This seems to usually happen way before the PyTorch run starts.

However, I just had a 16-machine run stall here while it was trying to switch datasets:

Epoch: [34][5/53]       Time 0.396 (1.135)      Loss 1.1164 (1.0954)    Acc@1 73.226 (73.747)   Acc@5 89.543 (89.967)   Data 0.030 (0.572)      BW 0.724 0.724
Epoch: [34][10/53]      Time 0.410 (0.784)      Loss 1.1066 (1.0946)    Acc@1 73.478 (73.752)   Acc@5 90.011 (90.008)   Data 0.034 (0.310)      BW 1.980 1.979
Epoch: [34][15/53]      Time 0.398 (0.654)      Loss 1.1052 (1.0971)    Acc@1 73.433 (73.646)   Acc@5 89.604 (89.984)   Data 0.010 (0.218)      BW 2.202 2.200
Epoch: [34][20/53]      Time 0.384 (0.589)      Loss 1.1084 (1.0991)    Acc@1 73.446 (73.580)   Acc@5 89.852 (89.955)   Data 0.041 (0.174)      BW 2.187 2.185
Epoch: [34][25/53]      Time 0.361 (0.546)      Loss 1.1007 (1.0995)    Acc@1 73.751 (73.601)   Acc@5 89.921 (89.945)   Data 0.010 (0.146)      BW 2.320 2.318
Epoch: [34][30/53]      Time 0.361 (0.519)      Loss 1.0831 (1.0985)    Acc@1 74.373 (73.644)   Acc@5 90.116 (89.946)   Data 0.019 (0.128)      BW 2.254 2.252
Epoch: [34][35/53]      Time 0.354 (0.500)      Loss 1.0976 (1.0980)    Acc@1 73.560 (73.645)   Acc@5 89.909 (89.956)   Data 0.009 (0.114)      BW 2.203 2.201
Epoch: [34][40/53]      Time 0.305 (0.482)      Loss 1.0833 (1.0980)    Acc@1 73.873 (73.640)   Acc@5 90.023 (89.955)   Data 0.009 (0.103)      BW 2.467 2.465
Epoch: [34][45/53]      Time 0.312 (0.465)      Loss 1.1108 (1.0985)    Acc@1 73.462 (73.629)   Acc@5 90.007 (89.962)   Data 0.007 (0.094)      BW 2.599 2.598
Epoch: [34][50/53]      Time 0.303 (0.452)      Loss 1.0847 (1.0982)    Acc@1 73.816 (73.622)   Acc@5 90.283 (89.958)   Data 0.007 (0.087)      BW 2.590 2.588
Epoch: [34][53/53]      Time 0.139 (0.443)      Loss 1.2487 (1.0987)    Acc@1 70.913 (73.605)   Acc@5 88.582 (89.950)   Data 0.007 (0.084)      BW 3.160 3.157
Test:  [34][2/2]        Time 0.147 (1.493)      Loss 1.0926 (1.0645)    Acc@1 72.937 (73.358)   Acc@5 91.093 (91.374)
~~34    0.30967         73.358          91.374

Epoch: [35][5/53]       Time 0.447 (1.124)      Loss 1.1017 (1.0925)    Acc@1 73.674 (73.824)   Acc@5 89.819 (89.978)   Data 0.037 (0.590)      BW 0.728 0.728
Epoch: [35][10/53]      Time 0.416 (0.779)      Loss 1.0988 (1.0929)    Acc@1 73.706 (73.820)   Acc@5 89.742 (89.960)   Data 0.037 (0.323)      BW 2.021 2.021
Epoch: [35][15/53]      Time 0.386 (0.651)      Loss 1.1113 (1.0938)    Acc@1 73.515 (73.786)   Acc@5 89.693 (89.968)   Data 0.018 (0.229)      BW 2.230 2.228
Epoch: [35][20/53]      Time 0.377 (0.586)      Loss 1.0699 (1.0932)    Acc@1 74.276 (73.771)   Acc@5 90.214 (89.990)   Data 0.021 (0.180)      BW 2.207 2.205
Epoch: [35][25/53]      Time 0.351 (0.545)      Loss 1.1054 (1.0926)    Acc@1 73.722 (73.782)   Acc@5 89.868 (89.997)   Data 0.012 (0.152)      BW 2.269 2.268
Epoch: [35][30/53]      Time 0.368 (0.518)      Loss 1.1176 (1.0931)    Acc@1 73.242 (73.759)   Acc@5 89.567 (89.994)   Data 0.018 (0.132)      BW 2.277 2.275
Epoch: [35][35/53]      Time 0.353 (0.500)      Loss 1.1104 (1.0932)    Acc@1 73.726 (73.764)   Acc@5 89.746 (89.992)   Data 0.016 (0.118)      BW 2.219 2.217
Epoch: [35][40/53]      Time 0.317 (0.481)      Loss 1.1106 (1.0937)    Acc@1 73.478 (73.766)   Acc@5 89.791 (89.990)   Data 0.012 (0.107)      BW 2.491 2.488
Epoch: [35][45/53]      Time 0.304 (0.464)      Loss 1.0941 (1.0931)    Acc@1 73.735 (73.773)   Acc@5 90.096 (89.994)   Data 0.006 (0.098)      BW 2.654 2.651
Epoch: [35][50/53]      Time 0.308 (0.451)      Loss 1.0968 (1.0933)    Acc@1 73.531 (73.760)   Acc@5 89.876 (89.990)   Data 0.006 (0.090)      BW 2.615 2.612
Epoch: [35][53/53]      Time 0.129 (0.442)      Loss 1.2904 (1.0941)    Acc@1 69.441 (73.741)   Acc@5 88.131 (89.979)   Data 0.006 (0.087)      BW 3.119 3.116
Test:  [35][2/2]        Time 0.232 (1.488)      Loss 1.0978 (1.0666)    Acc@1 72.864 (73.304)   Acc@5 91.012 (91.304)
~~35    0.31711         73.304          91.304

Epoch: [36][5/53]       Time 0.371 (1.174)      Loss 1.1028 (1.0829)    Acc@1 73.824 (74.083)   Acc@5 89.726 (90.086)   Data 0.020 (0.531)      BW 0.701 0.700
Epoch: [36][10/53]      Time 0.397 (0.796)      Loss 1.0840 (1.0861)    Acc@1 73.604 (73.918)   Acc@5 90.226 (90.116)   Data 0.027 (0.285)      BW 2.055 2.053
Epoch: [36][15/53]      Time 0.385 (0.659)      Loss 1.0857 (1.0860)    Acc@1 74.219 (73.953)   Acc@5 90.039 (90.094)   Data 0.012 (0.201)      BW 2.255 2.255
Epoch: [36][20/53]      Time 0.358 (0.591)      Loss 1.0993 (1.0885)    Acc@1 73.348 (73.876)   Acc@5 89.917 (90.051)   Data 0.012 (0.158)      BW 2.238 2.235
Epoch: [36][25/53]      Time 0.376 (0.552)      Loss 1.0885 (1.0874)    Acc@1 73.608 (73.896)   Acc@5 90.043 (90.052)   Data 0.024 (0.134)      BW 2.177 2.175
Epoch: [36][30/53]      Time 0.375 (0.523)      Loss 1.0936 (1.0868)    Acc@1 73.389 (73.880)   Acc@5 90.181 (90.076)   Data 0.035 (0.117)      BW 2.252 2.251
Epoch: [36][35/53]      Time 0.382 (0.502)      Loss 1.0706 (1.0873)    Acc@1 74.150 (73.869)   Acc@5 90.365 (90.069)   Data 0.013 (0.104)      BW 2.275 2.272
Epoch: [36][40/53]      Time 0.310 (0.483)      Loss 1.0875 (1.0881)    Acc@1 74.007 (73.858)   Acc@5 89.921 (90.054)   Data 0.011 (0.095)      BW 2.492 2.491
Epoch: [36][45/53]      Time 0.306 (0.466)      Loss 1.0923 (1.0881)    Acc@1 73.820 (73.852)   Acc@5 90.072 (90.064)   Data 0.007 (0.086)      BW 2.615 2.612
Epoch: [36][50/53]      Time 0.303 (0.452)      Loss 1.0981 (1.0878)    Acc@1 73.596 (73.854)   Acc@5 90.141 (90.073)   Data 0.007 (0.080)      BW 2.604 2.602
Epoch: [36][53/53]      Time 0.129 (0.442)      Loss 1.2620 (1.0881)    Acc@1 69.471 (73.844)   Acc@5 88.762 (90.069)   Data 0.007 (0.077)      BW 3.259 3.255
Test:  [36][2/2]        Time 0.108 (1.513)      Loss 1.0903 (1.0621)    Acc@1 73.025 (73.500)   Acc@5 91.117 (91.406)
~~36    0.32456         73.500          91.406

Dataset changed.
Image size: 288
Batch size: 128
Train Directory: /home/ubuntu/data/imagenet/train
Validation Directory: /home/ubuntu/data/imagenet/validation
Changing LR from 0.05639999999999999 to 0.037599999999999995

yaroslavvb · 2018-09-19T02:25:29Z

OK, this last one looks more like OOM. When you change dataset, memory requirements change, and some versions of PyTorch (0.4.1) runs out of memory. If you SSH into each of 16 machines, and attach to tmux session, you'll probably find it crashed with OOM. The rest of the workers will hang forever.

The version of PyTorch baked into AMI (built from master couple of weeks ago) shouldn't run out of memory

yaroslavvb · 2018-09-19T02:29:44Z

I've actually hit this failure at this exact epoch quite frequently before upgrading pytorch

deepakn94 · 2018-09-19T02:59:15Z

Ah, interesting. I’m using the AMI specified in the code (IMAGE_NAME = 'pytorch.imagenet.source.v7'). Is this not the right AMI to use?

…

On Tue, Sep 18, 2018 at 7:25 PM Yaroslav Bulatov ***@***.***> wrote: OK, this last one looks more like OOM. When you change dataset, memory requirements change, and some versions of PyTorch (0.4.1) runs out of memory. If you SSH into each of 16 machines, and attach to tmux session, you'll probably find it crashed with OOM. The rest of the workers will hang forever. The version of PyTorch baked into AMI (built from master couple of weeks ago) shouldn't run out of memory — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACmQxuJiEkidBZmSd76NM4mShwrpjyB9ks5ucasagaJpZM4WfpWE> .

yaroslavvb · 2018-09-19T03:06:21Z

That's correct version. Can you try the 8-machine version and see if you have hangs still? That one should only be a minute slower

deepakn94 · 2018-09-19T03:13:06Z

I haven't run into any issues with the 8-machine version.

We need to run the 8-machine and 16-machine versions for some experiments we're running internally. I guess I can try building my own AMI that uses PyTorch built from current master?

deepakn94 · 2018-09-19T03:13:38Z

Also, do the hangs happen non-deterministically?

deepakn94 · 2018-09-19T03:17:22Z

The other time hangs usually happen is here, during initialization (this is on a 4 machine run, one of the machines doesn't successfully initialize):

2018-09-18 20:09:29.404726 2.imagenet: downloading /tmp/ncluster/2.imagenet.initialized
2018-09-18 20:09:29.428946 3.imagenet: Checking for initialization status
2018-09-18 20:09:29.429114 3.imagenet: downloading /tmp/ncluster/3.imagenet.initialized
2018-09-18 20:09:29.431599 0.imagenet: Checking for initialization status
2018-09-18 20:09:29.431674 0.imagenet: downloading /tmp/ncluster/0.imagenet.initialized
2018-09-18 20:09:31.742750 2.imagenet: Initialize complete
2018-09-18 20:09:31.743137 2.imagenet: To connect to 2.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a
2018-09-18 20:09:31.744758 3.imagenet: Initialize complete
2018-09-18 20:09:31.744800 3.imagenet: To connect to 3.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a
2018-09-18 20:09:31.749249 0.imagenet: Initialize complete
2018-09-18 20:09:31.749306 0.imagenet: To connect to 0.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a

yaroslavvb · 2018-09-19T13:55:15Z

About initial hang, sometimes Amazon gives a bad machine. You can verify this by trying to SSH into the machine, and if that fails going to console, right click on instance, instance settings -> capture instance screenshots. If you see a bunch of error messages instead of login prompt, it's an Amazon issue. Also, occasionally (maybe 1 in 10 runs), I got a working but slow machine. Things just run but everything is 3x slower. When I dealt with PyTorch OOM errors, they indeed happen non-deterministically. Does your 16 machine run always hang at epoch 37? You can tell if your hang is due to OOM by changing the last couple of epochs to use slightly smaller batch size, ie 96 instead of 128. Note that latest PyTorch switched to new distributed backend last week, so if you build from source, the characteristics of network exchange may be quite different

…

On Tue, Sep 18, 2018 at 8:17 PM Deepak Narayanan ***@***.***> wrote: The other time hangs usually happen is here, during initialization (this is on a 4 machine run, one of the machines doesn't successfully initialize): 2018-09-18 20:09:29.404726 2.imagenet: downloading /tmp/ncluster/2.imagenet.initialized 2018-09-18 20:09:29.428946 3.imagenet: Checking for initialization status 2018-09-18 20:09:29.429114 3.imagenet: downloading /tmp/ncluster/3.imagenet.initialized 2018-09-18 20:09:29.431599 0.imagenet: Checking for initialization status 2018-09-18 20:09:29.431674 0.imagenet: downloading /tmp/ncluster/0.imagenet.initialized 2018-09-18 20:09:31.742750 2.imagenet: Initialize complete 2018-09-18 20:09:31.743137 2.imagenet: To connect to 2.imagenet ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.*** tmux a 2018-09-18 20:09:31.744758 3.imagenet: Initialize complete 2018-09-18 20:09:31.744800 3.imagenet: To connect to 3.imagenet ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.*** tmux a 2018-09-18 20:09:31.749249 0.imagenet: Initialize complete 2018-09-18 20:09:31.749306 0.imagenet: To connect to 0.imagenet ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no ***@***.*** tmux a — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHHZbUFvUMDcnDmKGLONqwx8AhEuTks5ucbdDgaJpZM4WfpWE> .

deepakn94 · 2018-09-19T23:12:57Z

Understood about the initial hang.

I've tried the 16 machine run twice, and it's failed both times after epoch 37. I verified the second time that it was because of an OOM error. I will try to run these experiments with a batch size of 96 in the last phase then -- this would presumably change the converge properties of the model slightly?

yaroslavvb · 2018-09-19T23:16:28Z

If you also lower learning rate by 128/96, it should not affect convergence properties -- your SGD will roughly traverse the same length after one epoch

…

On Wed, Sep 19, 2018 at 4:13 PM Deepak Narayanan ***@***.***> wrote: Understood about the initial hang. I've tried the 16 machine run twice, and it's failed both times after epoch 37. I verified the second time that it was because of an OOM error. I will try to run these experiments with a batch size of 96 in the last phase then -- this would presumably change the converge properties of the model slightly? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#53 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABaHGOL9eBHpE6mrrMvgyZfBj036f36ks5ucs97gaJpZM4WfpWE> .

yaroslavvb added 3 commits September 7, 2018 17:40

imagenet in 18 minutes submission

ef6fcb2

add header for tsv

a71d916

fix tsv header

7f97a58

bearpelican mentioned this pull request Sep 8, 2018

Imagenet in 18 minutes entries for 4 and 8 machines #54

Merged

codyaustun self-assigned this Sep 9, 2018

codyaustun requested changes Sep 10, 2018

View reviewed changes

address comments

ec50da7

codyaustun approved these changes Sep 10, 2018

View reviewed changes

codyaustun requested changes Sep 10, 2018

View reviewed changes

make json valid

acba7d9

codyaustun approved these changes Sep 10, 2018

View reviewed changes

codyaustun requested changes Sep 10, 2018

View reviewed changes

rename json, redo schedule

7e1d64c

yaroslavvb added 2 commits September 11, 2018 09:44

rename TSV as well

0d5037b

rename tsv

c6531a0

codyaustun approved these changes Sep 12, 2018

View reviewed changes

Amend authors

26e281f

codyaustun merged commit e0d5423 into stanford-futuredata:master Sep 16, 2018

imagenet in 18 minutes submission #53

imagenet in 18 minutes submission #53

Conversation

yaroslavvb commented Sep 8, 2018

codyaustun commented Sep 9, 2018

yaroslavvb commented Sep 9, 2018 via email

codyaustun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslavvb Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslavvb commented Sep 10, 2018

yaroslavvb commented Sep 10, 2018 via email

codyaustun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslavvb commented Sep 10, 2018

codyaustun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yaroslavvb commented Sep 11, 2018

codyaustun left a comment

Choose a reason for hiding this comment

yaroslavvb commented Sep 12, 2018 via email

codyaustun commented Sep 13, 2018

codyaustun commented Sep 15, 2018

deepakn94 commented Sep 16, 2018

deepakn94 commented Sep 16, 2018

yaroslavvb commented Sep 17, 2018 • edited Loading

yaroslavvb commented Sep 17, 2018

deepakn94 commented Sep 17, 2018

yaroslavvb commented Sep 17, 2018

deepakn94 commented Sep 19, 2018

yaroslavvb commented Sep 19, 2018

deepakn94 commented Sep 19, 2018 • edited Loading

yaroslavvb commented Sep 19, 2018

yaroslavvb commented Sep 19, 2018

deepakn94 commented Sep 19, 2018 via email

yaroslavvb commented Sep 19, 2018

deepakn94 commented Sep 19, 2018

deepakn94 commented Sep 19, 2018

deepakn94 commented Sep 19, 2018

yaroslavvb commented Sep 19, 2018 via email

deepakn94 commented Sep 19, 2018

yaroslavvb commented Sep 19, 2018 via email

yaroslavvb Sep 10, 2018 •

edited

Loading

yaroslavvb commented Sep 17, 2018 •

edited

Loading

deepakn94 commented Sep 19, 2018 •

edited

Loading