Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imagenet in 18 minutes submission #53

Merged
merged 9 commits into from
Sep 16, 2018

Conversation

yaroslavvb
Copy link
Contributor

@codyaustun
Copy link
Contributor

Thanks @yaroslavvb! I aim to review this and #54 by end of day Monday. However, I wouldn't expect the result to go live on the website until the end of the week (9/14). Please let me know if that is an issue.

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 9, 2018 via email

Copy link
Contributor

@codyaustun codyaustun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part. I had some clarification questions and requested a few version numbers.

"momentum": 0.9,
"weightDecay": 0.0001,
"schedule": [
{"learning_rate": 1.8819957971572876, "example": 0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does example mean? Is this the same as iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example means image. IE, at image 0 we used learning rate 1.8819957971572876.

"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "PyTorch",
"codeURL": "https://github.com/diux-dev/imagenet18",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

{"learning_rate": 3.68815279006958, "example": 7389440},
{"learning_rate": 3.6901485919952393, "example": 7397632},
{"learning_rate": 3.6921443939208984, "example": 7405824},
{"learning_rate": 3.69414019584
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid any confusion, can you add the usedBlackList field as shown here. If you used all 50,000 images for validation, this value should be true. If not, please rerun with all 50,000 images in the validation set to be comparable with other results. More details are available in Issues #36

"optimizer": "SGD with Momentum",
"momentum": 0.9,
"weightDecay": 0.0001,
"schedule": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your learning rate schedule looks complicated. Can you add a high-level description as a separate field in misc? Here is an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not have a simple description as some parts of the schedule were due to bugs in the learning rate scheduler -- fixing those bugs made convergence worse so we kept the buggy version, I'll try my best though.

"version": "v1.0",
"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "PyTorch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the version number.

"author": "Yaroslav Bulatov, Andrew Shaw, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "PyTorch",
"codeURL": "https://github.com/diux-dev/imagenet18",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the link, it looks like the following is used to reproduced this result:

pip install -r requirements.txt
aws configure  (or set your AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY/AWS_DEFAULT_REGION)
python train.py  # pre-warming
python train.py 

If that is true, what does the pre-warming do?

Copy link
Contributor Author

@yaroslavvb yaroslavvb Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using AWS io2 root disks initialized from AMI which are created on the fly. These disks are created lazily, so the first time you access them, the data is copied from S3 and it adds 10 minutes to run time. When you run again, the disks are reused, so you no longer pay the copy penalty

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds reasonable to me. I wanted to make sure there was no pretraining or caching. Is it true that the data could be persisted on the io2 disks?

@yaroslavvb
Copy link
Contributor Author

ptal

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 10, 2018 via email

Copy link
Contributor

@codyaustun codyaustun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaroslavvb it seems like Travis is having some issues today, and the build for your commit failed but isn't showing on GitHub. Please fix these JSON issues. In case Travis fails again, you can run the tests locally by going to the root of this repo and running the following:

pip install -r requirements.txt
pytest

"optimizer": "SGD with Momentum",
"momentum": 0.9,
"weightDecay": 0.0001,
"schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0, 'sz':128, 'bs':64}, {'ep':(0,6), 'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is missing a comma

{"learning_rate": 3.672186851501465, "example": 7323904},
{"learning_rate": 3.674182653427124, "example": 7332096},
{"learning_rate": 3.676178455352783, "example": 7340288},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is missing a comma.

@yaroslavvb
Copy link
Contributor Author

fixed and ran through json validator

Copy link
Contributor

@codyaustun codyaustun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to
dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

"momentum": 0.9,
"weightDecay": 0.0001,
"schedule overview": "Base learning rate lr=1.88, schedule consists of several linear scaling segments as well as manual learning rate changes, alongside changing image size and batch size: {'ep':0, 'sz':128, 'bs':64}, {'ep':(0,6), 'lr':(lr,lr*2)}, {'ep':6, 'bs':128,}, {'ep':6, 'lr':lr*2}, {'ep':16, 'sz':224,'bs':64}, {'ep':16, 'lr':lr}, {'ep':19, 'bs':192, 'keep_dl':True}, {'ep':19, 'lr':2*lr/(10/1.5)}, {'ep':31, 'lr':2*lr/(100/1.5)}, {'ep':37, 'sz':288, 'bs':128, 'min_scale':0.5}, {'ep':37, 'lr':2*lr/100}, {'ep':(38,50),'lr':2*lr/1000}]",
"schedule": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, but by any chance can you express the learning rate, batch size, and image size schedule in a format similar to #54

@yaroslavvb
Copy link
Contributor Author

ptal

Copy link
Contributor

@codyaustun codyaustun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I noticed the epochs in schedule, imageSize, and batchSize all add up to 40 while the TSV only has 38 lines. Was that because it reached 93% at 38 epochs and you didn't include the last 2 epochs in the TSV? I just want to make sure the learning rate schedule is correct.

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 12, 2018 via email

@codyaustun
Copy link
Contributor

Great! Everything looks good. I'll merge everything in by tomorrow evening

@codyaustun
Copy link
Contributor

@yaroslavvb it is going to take me one more day to add this to the website. Sorry for the delay.

@deepakn94
Copy link
Contributor

Hi @yaroslavvb, I had a couple of other questions on how to reproduce these results:

  • The train command can be run from a laptop, correct? It doesn't have to run on a worker machine on AWS?
  • Do you have a list of AWS permissions needed to run this code -- it seems that at minimum you need elasticfilesystem:DescribeFileSystems?
  • What needs to be in the AMI used to run the job? A copy of the ImageNet dataset in the right format and a source-compiled PyTorch? Does the imagenet18 repository also need to be cloned on the AMI, or is the code automatically shipped over to the worker machines?

Thanks!

@codyaustun codyaustun merged commit e0d5423 into stanford-futuredata:master Sep 16, 2018
@deepakn94
Copy link
Contributor

When I run this code, I get the following exception:

2018-09-16 14:45:10.426365 0.imagenet: Setting up tmux
2018-09-16 14:45:13.052139 0.imagenet: Mounting EFS
Exception are  [KeyError('ncluster',)]
Traceback (most recent call last):
  File "train.py", line 198, in <module>
    main()
  File "train.py", line 167, in main
    install_script=open('setup.sh').read())
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/ncluster.py", line 100, in make_job
    return _backend.make_job(name, run_name=run_name, num_tasks=num_tasks, install_script=install_script, **kwargs)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 775, in make_job
    raise exceptions[0]
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 761, in make_task_fn
    **kwargs)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 691, in make_task
    instance_type=instance_type)
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 87, in __init__
    self._mount_efs()
  File "/Users/deepakn94/Documents/research/anaconda3/lib/python3.6/site-packages/ncluster/aws_backend.py", line 139, in _mount_efs
    efs_id = u.get_efs_dict()[u.get_prefix()]
KeyError: 'ncluster'

Any idea why? Thanks!

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 17, 2018

@deepakn94
"The train command can be run from a laptop, correct? It doesn't have to run on a worker machine on AWS?"

Correct, our workflow has been to run on laptops.

"Do you have a list of AWS permissions needed to run this code -- it seems that at minimum you need elasticfilesystem:DescribeFileSystems?"

I don't have a list of permissions, things just worked on my account from the beginning, do you know a way to obtain list of permissions I have? Filed tracking issue cybertronai/imagenet18_old#12

"2. What needs to be in the AMI used to run the job? A copy of the ImageNet dataset in the right format and a source-compiled PyTorch?"

Correct. Specification of ImageNet folder structure is here: https://github.com/diux-dev/cluster/blob/master/pytorch/README.md#data-preparation

Doesn't have to be source-compiled PyTorch 0.4.1 also works, perhaps 6% slower.

"2. Does the imagenet18 repository also need to be cloned on the AMI"

No, code is automatically shipped over

EFS error:

This suggests that the account didn't get properly setup (no EFS was created). Added an issue to make it more clear diux-dev/ncluster#14

Did you get any errors earlier in the process? It should complain when it fails to create EFS

@yaroslavvb
Copy link
Contributor Author

@deepakn94 to elaborate on permissions, train.py was created with an assumption of being run on a fresh Amazon account with admin level permissions. In particular, it will try to create all infrastructure needed for the run, ie EFS, VPC, subnets, keypairs, placement groups. So on a high-level, write-level permissions dealing those resources are needed. EFS is needed to save training logs that are persistent. Creation of VPC/subnets was needed to work on EFS-classic account which did not have default VPC. Perhaps this part is no longer necessary. Keypairs/placement groups are needed.

It would be easier to track if you filed specific issues/suggestions on https://github.com/diux-dev/imagenet18 repo

@deepakn94
Copy link
Contributor

Hi @yaroslavvb,
Thanks for the response!

I got this working yesterday after asking the questions on this thread -- it turns out that I just needed the elasticfilesystem:DescribeFileSystems privilege (I'm guessing there are other privileges associated with launching instances, etc. needed, but I already had those). The exception above was because I hadn't created an EFS with a Name tag of ncluster (I did this using the AWS web UI) -- it took me some time to figure out that I needed to do this, but things were smooth after I figured that out!

@yaroslavvb
Copy link
Contributor Author

Interesting
the train.py is supposed to create EFS for you. However, I'm using heuristics to decide whether to create it here so if it failed halfway through on the first run, it may decide not to create it.

Glad to hear you got it running, you might be the first external user

@deepakn94
Copy link
Contributor

Hi @yaroslavvb,

I'm seeing a non-trivial number of runs hang -- both for 4 and 16 DGX-1s. This happens in the setup phase usually -- have you seen something like this?

Also, what is the best way to run multiple trials of the same experiment? I have been running multiple Python processes but this seems really wasteful, since the setup steps can / should be performed just once.

Thanks again for the help!

@yaroslavvb
Copy link
Contributor Author

I have observed a number of hangs early in experimentation, filed an issue here pytorch/pytorch#9696

You may want to try "gdb -p" and look at stack trace to see if it hangs in the same place

I have not seen these hangs in the last month or so however, not sure if it's due to upgrade of PyTorch (built from master) or something else.

Occasionally there's a different kind of hang -- if any worker OOM's, the rest of the workers will hang. This was fixed by reducing batch size. Are you using actual DGX-1 or p3.16xlarge instances on AWS?

@deepakn94
Copy link
Contributor

deepakn94 commented Sep 19, 2018

I'm using a p3.16xlarge instance on AWS.

This seems to usually happen way before the PyTorch run starts.

However, I just had a 16-machine run stall here while it was trying to switch datasets:

Epoch: [34][5/53]       Time 0.396 (1.135)      Loss 1.1164 (1.0954)    Acc@1 73.226 (73.747)   Acc@5 89.543 (89.967)   Data 0.030 (0.572)      BW 0.724 0.724
Epoch: [34][10/53]      Time 0.410 (0.784)      Loss 1.1066 (1.0946)    Acc@1 73.478 (73.752)   Acc@5 90.011 (90.008)   Data 0.034 (0.310)      BW 1.980 1.979
Epoch: [34][15/53]      Time 0.398 (0.654)      Loss 1.1052 (1.0971)    Acc@1 73.433 (73.646)   Acc@5 89.604 (89.984)   Data 0.010 (0.218)      BW 2.202 2.200
Epoch: [34][20/53]      Time 0.384 (0.589)      Loss 1.1084 (1.0991)    Acc@1 73.446 (73.580)   Acc@5 89.852 (89.955)   Data 0.041 (0.174)      BW 2.187 2.185
Epoch: [34][25/53]      Time 0.361 (0.546)      Loss 1.1007 (1.0995)    Acc@1 73.751 (73.601)   Acc@5 89.921 (89.945)   Data 0.010 (0.146)      BW 2.320 2.318
Epoch: [34][30/53]      Time 0.361 (0.519)      Loss 1.0831 (1.0985)    Acc@1 74.373 (73.644)   Acc@5 90.116 (89.946)   Data 0.019 (0.128)      BW 2.254 2.252
Epoch: [34][35/53]      Time 0.354 (0.500)      Loss 1.0976 (1.0980)    Acc@1 73.560 (73.645)   Acc@5 89.909 (89.956)   Data 0.009 (0.114)      BW 2.203 2.201
Epoch: [34][40/53]      Time 0.305 (0.482)      Loss 1.0833 (1.0980)    Acc@1 73.873 (73.640)   Acc@5 90.023 (89.955)   Data 0.009 (0.103)      BW 2.467 2.465
Epoch: [34][45/53]      Time 0.312 (0.465)      Loss 1.1108 (1.0985)    Acc@1 73.462 (73.629)   Acc@5 90.007 (89.962)   Data 0.007 (0.094)      BW 2.599 2.598
Epoch: [34][50/53]      Time 0.303 (0.452)      Loss 1.0847 (1.0982)    Acc@1 73.816 (73.622)   Acc@5 90.283 (89.958)   Data 0.007 (0.087)      BW 2.590 2.588
Epoch: [34][53/53]      Time 0.139 (0.443)      Loss 1.2487 (1.0987)    Acc@1 70.913 (73.605)   Acc@5 88.582 (89.950)   Data 0.007 (0.084)      BW 3.160 3.157
Test:  [34][2/2]        Time 0.147 (1.493)      Loss 1.0926 (1.0645)    Acc@1 72.937 (73.358)   Acc@5 91.093 (91.374)
~~34    0.30967         73.358          91.374

Epoch: [35][5/53]       Time 0.447 (1.124)      Loss 1.1017 (1.0925)    Acc@1 73.674 (73.824)   Acc@5 89.819 (89.978)   Data 0.037 (0.590)      BW 0.728 0.728
Epoch: [35][10/53]      Time 0.416 (0.779)      Loss 1.0988 (1.0929)    Acc@1 73.706 (73.820)   Acc@5 89.742 (89.960)   Data 0.037 (0.323)      BW 2.021 2.021
Epoch: [35][15/53]      Time 0.386 (0.651)      Loss 1.1113 (1.0938)    Acc@1 73.515 (73.786)   Acc@5 89.693 (89.968)   Data 0.018 (0.229)      BW 2.230 2.228
Epoch: [35][20/53]      Time 0.377 (0.586)      Loss 1.0699 (1.0932)    Acc@1 74.276 (73.771)   Acc@5 90.214 (89.990)   Data 0.021 (0.180)      BW 2.207 2.205
Epoch: [35][25/53]      Time 0.351 (0.545)      Loss 1.1054 (1.0926)    Acc@1 73.722 (73.782)   Acc@5 89.868 (89.997)   Data 0.012 (0.152)      BW 2.269 2.268
Epoch: [35][30/53]      Time 0.368 (0.518)      Loss 1.1176 (1.0931)    Acc@1 73.242 (73.759)   Acc@5 89.567 (89.994)   Data 0.018 (0.132)      BW 2.277 2.275
Epoch: [35][35/53]      Time 0.353 (0.500)      Loss 1.1104 (1.0932)    Acc@1 73.726 (73.764)   Acc@5 89.746 (89.992)   Data 0.016 (0.118)      BW 2.219 2.217
Epoch: [35][40/53]      Time 0.317 (0.481)      Loss 1.1106 (1.0937)    Acc@1 73.478 (73.766)   Acc@5 89.791 (89.990)   Data 0.012 (0.107)      BW 2.491 2.488
Epoch: [35][45/53]      Time 0.304 (0.464)      Loss 1.0941 (1.0931)    Acc@1 73.735 (73.773)   Acc@5 90.096 (89.994)   Data 0.006 (0.098)      BW 2.654 2.651
Epoch: [35][50/53]      Time 0.308 (0.451)      Loss 1.0968 (1.0933)    Acc@1 73.531 (73.760)   Acc@5 89.876 (89.990)   Data 0.006 (0.090)      BW 2.615 2.612
Epoch: [35][53/53]      Time 0.129 (0.442)      Loss 1.2904 (1.0941)    Acc@1 69.441 (73.741)   Acc@5 88.131 (89.979)   Data 0.006 (0.087)      BW 3.119 3.116
Test:  [35][2/2]        Time 0.232 (1.488)      Loss 1.0978 (1.0666)    Acc@1 72.864 (73.304)   Acc@5 91.012 (91.304)
~~35    0.31711         73.304          91.304

Epoch: [36][5/53]       Time 0.371 (1.174)      Loss 1.1028 (1.0829)    Acc@1 73.824 (74.083)   Acc@5 89.726 (90.086)   Data 0.020 (0.531)      BW 0.701 0.700
Epoch: [36][10/53]      Time 0.397 (0.796)      Loss 1.0840 (1.0861)    Acc@1 73.604 (73.918)   Acc@5 90.226 (90.116)   Data 0.027 (0.285)      BW 2.055 2.053
Epoch: [36][15/53]      Time 0.385 (0.659)      Loss 1.0857 (1.0860)    Acc@1 74.219 (73.953)   Acc@5 90.039 (90.094)   Data 0.012 (0.201)      BW 2.255 2.255
Epoch: [36][20/53]      Time 0.358 (0.591)      Loss 1.0993 (1.0885)    Acc@1 73.348 (73.876)   Acc@5 89.917 (90.051)   Data 0.012 (0.158)      BW 2.238 2.235
Epoch: [36][25/53]      Time 0.376 (0.552)      Loss 1.0885 (1.0874)    Acc@1 73.608 (73.896)   Acc@5 90.043 (90.052)   Data 0.024 (0.134)      BW 2.177 2.175
Epoch: [36][30/53]      Time 0.375 (0.523)      Loss 1.0936 (1.0868)    Acc@1 73.389 (73.880)   Acc@5 90.181 (90.076)   Data 0.035 (0.117)      BW 2.252 2.251
Epoch: [36][35/53]      Time 0.382 (0.502)      Loss 1.0706 (1.0873)    Acc@1 74.150 (73.869)   Acc@5 90.365 (90.069)   Data 0.013 (0.104)      BW 2.275 2.272
Epoch: [36][40/53]      Time 0.310 (0.483)      Loss 1.0875 (1.0881)    Acc@1 74.007 (73.858)   Acc@5 89.921 (90.054)   Data 0.011 (0.095)      BW 2.492 2.491
Epoch: [36][45/53]      Time 0.306 (0.466)      Loss 1.0923 (1.0881)    Acc@1 73.820 (73.852)   Acc@5 90.072 (90.064)   Data 0.007 (0.086)      BW 2.615 2.612
Epoch: [36][50/53]      Time 0.303 (0.452)      Loss 1.0981 (1.0878)    Acc@1 73.596 (73.854)   Acc@5 90.141 (90.073)   Data 0.007 (0.080)      BW 2.604 2.602
Epoch: [36][53/53]      Time 0.129 (0.442)      Loss 1.2620 (1.0881)    Acc@1 69.471 (73.844)   Acc@5 88.762 (90.069)   Data 0.007 (0.077)      BW 3.259 3.255
Test:  [36][2/2]        Time 0.108 (1.513)      Loss 1.0903 (1.0621)    Acc@1 73.025 (73.500)   Acc@5 91.117 (91.406)
~~36    0.32456         73.500          91.406

Dataset changed.
Image size: 288
Batch size: 128
Train Directory: /home/ubuntu/data/imagenet/train
Validation Directory: /home/ubuntu/data/imagenet/validation
Changing LR from 0.05639999999999999 to 0.037599999999999995

@yaroslavvb
Copy link
Contributor Author

OK, this last one looks more like OOM. When you change dataset, memory requirements change, and some versions of PyTorch (0.4.1) runs out of memory. If you SSH into each of 16 machines, and attach to tmux session, you'll probably find it crashed with OOM. The rest of the workers will hang forever.

The version of PyTorch baked into AMI (built from master couple of weeks ago) shouldn't run out of memory

@yaroslavvb
Copy link
Contributor Author

I've actually hit this failure at this exact epoch quite frequently before upgrading pytorch

@deepakn94
Copy link
Contributor

deepakn94 commented Sep 19, 2018 via email

@yaroslavvb
Copy link
Contributor Author

That's correct version. Can you try the 8-machine version and see if you have hangs still? That one should only be a minute slower

@deepakn94
Copy link
Contributor

I haven't run into any issues with the 8-machine version.

We need to run the 8-machine and 16-machine versions for some experiments we're running internally. I guess I can try building my own AMI that uses PyTorch built from current master?

@deepakn94
Copy link
Contributor

Also, do the hangs happen non-deterministically?

@deepakn94
Copy link
Contributor

The other time hangs usually happen is here, during initialization (this is on a 4 machine run, one of the machines doesn't successfully initialize):

2018-09-18 20:09:29.404726 2.imagenet: downloading /tmp/ncluster/2.imagenet.initialized
2018-09-18 20:09:29.428946 3.imagenet: Checking for initialization status
2018-09-18 20:09:29.429114 3.imagenet: downloading /tmp/ncluster/3.imagenet.initialized
2018-09-18 20:09:29.431599 0.imagenet: Checking for initialization status
2018-09-18 20:09:29.431674 0.imagenet: downloading /tmp/ncluster/0.imagenet.initialized
2018-09-18 20:09:31.742750 2.imagenet: Initialize complete
2018-09-18 20:09:31.743137 2.imagenet: To connect to 2.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a
2018-09-18 20:09:31.744758 3.imagenet: Initialize complete
2018-09-18 20:09:31.744800 3.imagenet: To connect to 3.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a
2018-09-18 20:09:31.749249 0.imagenet: Initialize complete
2018-09-18 20:09:31.749306 0.imagenet: To connect to 0.imagenet
ssh -i /Users/deepakn94/.ncluster/ncluster-deepakn94-491037173944-us-east-1.pem -o StrictHostKeyChecking=no [email protected]
tmux a

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 19, 2018 via email

@deepakn94
Copy link
Contributor

Understood about the initial hang.

I've tried the 16 machine run twice, and it's failed both times after epoch 37. I verified the second time that it was because of an OOM error. I will try to run these experiments with a batch size of 96 in the last phase then -- this would presumably change the converge properties of the model slightly?

@yaroslavvb
Copy link
Contributor Author

yaroslavvb commented Sep 19, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants