Imagenet in 18 minutes entries for 4 and 8 machines #54

bearpelican · 2018-09-08T18:22:49Z

Using similar schedule and training tricks as this PR:
#53

codyaustun · 2018-09-09T03:46:03Z

@bearpelican please fix the failed checks. You need to change the column headers in the TSV files from top1 and top5 to top1Accuracy and top5Accuracy respectively.

codyaustun · 2018-09-10T23:13:22Z

ImageNet/train/dawn_4_machine_30min.json

+    "codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
+    "model": "Resnet 50",
+    "hardware": "32 * V100 (4 machines - AWS p3.16xlarge)",
+    "costPerHour": 24.48,


I think this is the wrong value for costPerHour. You used 4 p3.16xlarge machines, so shouldn't the cost be 97.92 (4x24.48)?

codyaustun · 2018-09-10T23:14:29Z

ImageNet/train/dawn_8_machine_19min.json

+    "codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
+    "model": "Resnet 50",
+    "hardware": "64 * V100 (8 machines - AWS p3.16xlarge)",
+    "costPerHour": 24.48,


I think this is the wrong value for costPerHour. You used 8 p3.16xlarge machines, so shouldn't the cost be 195.84 (8x24.48)?

Oops good catch!

codyaustun · 2018-09-10T23:14:48Z

ImageNet/train/dawn_8_machine_19min.json

+    "version": "v1.0",
+    "author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "ncluster / pytorch",


Please add the PyTorch version number

codyaustun · 2018-09-10T23:15:28Z

ImageNet/train/dawn_8_machine_19min.json

+    "author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "ncluster / pytorch",
+    "codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",


Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

codyaustun · 2018-09-10T23:15:38Z

ImageNet/train/dawn_4_machine_30min.json

+    "author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "ncluster / pytorch",
+    "codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",


Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

codyaustun · 2018-09-10T23:15:53Z

ImageNet/train/dawn_4_machine_30min.json

+    "version": "v1.0",
+    "author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
+    "authorEmail": "[email protected]",
+    "framework": "ncluster / pytorch",


Please add the PyTorch version number

codyaustun · 2018-09-10T23:39:14Z

Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to
dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

bearpelican · 2018-09-11T01:02:13Z

07ec7d0 should have all the updates!

codyaustun

Thanks for addressing all of the feedback so far. I just want to make sure we have all the details for the learning rate schedule.

codyaustun · 2018-09-12T18:49:21Z

ImageNet/train/fastai_resnet50_p3_4_machine_pytorch.json

+          {"learning_rate": 1.75, "epochs": 3},
+          {"learning_rate": [1.75,0.175], "epochs": 7, "linear": true},
+          {"learning_rate": [0.175,0.0175], "epochs": 4, "linear": true},
+          {"learning_rate": [0.01,0.001], "epochs": 2, "linear": true}


The epochs don't add up to the same number here and don't match the TSV. Is there something missing? schedule has 29 epochs, imageSize has 30, batchSize has 30, and the TSV has 30 lines.

codyaustun · 2018-09-12T18:54:13Z

ImageNet/train/fastai_resnet50_p3_8_machine_pytorch.json

+          {"learning_rate": 1.92, "epochs": 3},
+          {"learning_rate": [1.92,0.336], "epochs": 6, "linear": true},
+          {"learning_rate": [0.336,0.0336], "epochs": 6, "linear": true},
+          {"learning_rate": [0.0192,0.00192], "epochs": 6, "linear": true}


batchSize, imageSize, and schedule all match with 35 epochs, but there are 36 rows in the TSV. What is the reason for the difference?

bearpelican · 2018-09-12T21:57:35Z

4 machine was an off by one error - 52481c9. Math is hard for me

8 machine should be fixed. I forgot that run was from an older schedule (which had useless epochs at the end)

codyaustun · 2018-09-13T20:20:24Z

Great! Everything looks good. I'll merge everything in by tomorrow evening

Adding 4 and 8 entries

9adda46

codyaustun self-assigned this Sep 9, 2018

codyaustun mentioned this pull request Sep 9, 2018

imagenet in 18 minutes submission #53

Merged

bearpelican added 2 commits September 8, 2018 21:11

Fixing column headers

14e27d9

Adding misc section

75cdf57

bearpelican force-pushed the master branch from be4658a to 75cdf57 Compare September 10, 2018 20:55

codyaustun requested changes Sep 10, 2018

View reviewed changes

bearpelican force-pushed the master branch from c989e07 to 43315f4 Compare September 10, 2018 23:39

Including pytorch version and fixing cost

07ec7d0

bearpelican force-pushed the master branch from 43315f4 to 07ec7d0 Compare September 10, 2018 23:42

codyaustun requested changes Sep 12, 2018

View reviewed changes

Fixing off by one errors in training schedule

52481c9

codyaustun approved these changes Sep 13, 2018

View reviewed changes

codyaustun merged commit 5f29e3c into stanford-futuredata:master Sep 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imagenet in 18 minutes entries for 4 and 8 machines #54

Imagenet in 18 minutes entries for 4 and 8 machines #54

bearpelican commented Sep 8, 2018

codyaustun commented Sep 9, 2018

codyaustun Sep 10, 2018

codyaustun Sep 10, 2018

bearpelican Sep 11, 2018

codyaustun Sep 10, 2018

codyaustun Sep 10, 2018

codyaustun Sep 10, 2018

codyaustun Sep 10, 2018

codyaustun commented Sep 10, 2018

bearpelican commented Sep 11, 2018

codyaustun left a comment

codyaustun Sep 12, 2018

codyaustun Sep 12, 2018

bearpelican commented Sep 12, 2018

codyaustun commented Sep 13, 2018

Imagenet in 18 minutes entries for 4 and 8 machines #54

Imagenet in 18 minutes entries for 4 and 8 machines #54

Conversation

bearpelican commented Sep 8, 2018

codyaustun commented Sep 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codyaustun commented Sep 10, 2018

bearpelican commented Sep 11, 2018

codyaustun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bearpelican commented Sep 12, 2018

codyaustun commented Sep 13, 2018