Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imagenet in 18 minutes entries for 4 and 8 machines #54

Merged
merged 5 commits into from
Sep 16, 2018

Conversation

bearpelican
Copy link
Contributor

Using similar schedule and training tricks as this PR:
#53

@codyaustun
Copy link
Contributor

@bearpelican please fix the failed checks. You need to change the column headers in the TSV files from top1 and top5 to top1Accuracy and top5Accuracy respectively.

@codyaustun codyaustun self-assigned this Sep 9, 2018
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
"model": "Resnet 50",
"hardware": "32 * V100 (4 machines - AWS p3.16xlarge)",
"costPerHour": 24.48,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong value for costPerHour. You used 4 p3.16xlarge machines, so shouldn't the cost be 97.92 (4x24.48)?

"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
"model": "Resnet 50",
"hardware": "64 * V100 (8 machines - AWS p3.16xlarge)",
"costPerHour": 24.48,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong value for costPerHour. You used 8 p3.16xlarge machines, so shouldn't the cost be 195.84 (8x24.48)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops good catch!

"version": "v1.0",
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "ncluster / pytorch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the PyTorch version number

"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "ncluster / pytorch",
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "ncluster / pytorch",
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a commit hash. You can either update the codeURL field or add a separate commitHash field.

"version": "v1.0",
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard",
"authorEmail": "[email protected]",
"framework": "ncluster / pytorch",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the PyTorch version number

@codyaustun
Copy link
Contributor

Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to
dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

@bearpelican
Copy link
Contributor Author

07ec7d0 should have all the updates!

Copy link
Contributor

@codyaustun codyaustun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all of the feedback so far. I just want to make sure we have all the details for the learning rate schedule.

{"learning_rate": 1.75, "epochs": 3},
{"learning_rate": [1.75,0.175], "epochs": 7, "linear": true},
{"learning_rate": [0.175,0.0175], "epochs": 4, "linear": true},
{"learning_rate": [0.01,0.001], "epochs": 2, "linear": true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epochs don't add up to the same number here and don't match the TSV. Is there something missing? schedule has 29 epochs, imageSize has 30, batchSize has 30, and the TSV has 30 lines.

{"learning_rate": 1.92, "epochs": 3},
{"learning_rate": [1.92,0.336], "epochs": 6, "linear": true},
{"learning_rate": [0.336,0.0336], "epochs": 6, "linear": true},
{"learning_rate": [0.0192,0.00192], "epochs": 6, "linear": true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batchSize, imageSize, and schedule all match with 35 epochs, but there are 36 rows in the TSV. What is the reason for the difference?

@bearpelican
Copy link
Contributor Author

4 machine was an off by one error - 52481c9. Math is hard for me

8 machine should be fixed. I forgot that run was from an older schedule (which had useless epochs at the end)

@codyaustun
Copy link
Contributor

Great! Everything looks good. I'll merge everything in by tomorrow evening

@codyaustun codyaustun merged commit 5f29e3c into stanford-futuredata:master Sep 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants