-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Imagenet in 18 minutes entries for 4 and 8 machines #54
Conversation
@bearpelican please fix the failed checks. You need to change the column headers in the TSV files from |
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py", | ||
"model": "Resnet 50", | ||
"hardware": "32 * V100 (4 machines - AWS p3.16xlarge)", | ||
"costPerHour": 24.48, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong value for costPerHour
. You used 4 p3.16xlarge machines, so shouldn't the cost be 97.92 (4x24.48)?
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py", | ||
"model": "Resnet 50", | ||
"hardware": "64 * V100 (8 machines - AWS p3.16xlarge)", | ||
"costPerHour": 24.48, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong value for costPerHour
. You used 8 p3.16xlarge machines, so shouldn't the cost be 195.84 (8x24.48)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops good catch!
"version": "v1.0", | ||
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "ncluster / pytorch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the PyTorch version number
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "ncluster / pytorch", | ||
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a commit hash. You can either update the codeURL
field or add a separate commitHash
field.
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "ncluster / pytorch", | ||
"codeURL": "https://github.com/diux-dev/imagenet18/blob/master/training/train_imagenet_nv.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a commit hash. You can either update the codeURL
field or add a separate commitHash
field.
"version": "v1.0", | ||
"author": "Andrew Shaw, Yaroslav Bulatov, Jeremy Howard", | ||
"authorEmail": "[email protected]", | ||
"framework": "ncluster / pytorch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the PyTorch version number
Can you change the filenames? They shouldn't start with dawn. Here are the instructions from the README.md
|
07ec7d0 should have all the updates! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing all of the feedback so far. I just want to make sure we have all the details for the learning rate schedule.
{"learning_rate": 1.75, "epochs": 3}, | ||
{"learning_rate": [1.75,0.175], "epochs": 7, "linear": true}, | ||
{"learning_rate": [0.175,0.0175], "epochs": 4, "linear": true}, | ||
{"learning_rate": [0.01,0.001], "epochs": 2, "linear": true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The epochs don't add up to the same number here and don't match the TSV. Is there something missing? schedule
has 29 epochs, imageSize
has 30, batchSize
has 30, and the TSV has 30 lines.
{"learning_rate": 1.92, "epochs": 3}, | ||
{"learning_rate": [1.92,0.336], "epochs": 6, "linear": true}, | ||
{"learning_rate": [0.336,0.0336], "epochs": 6, "linear": true}, | ||
{"learning_rate": [0.0192,0.00192], "epochs": 6, "linear": true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchSize
, imageSize
, and schedule
all match with 35 epochs, but there are 36 rows in the TSV. What is the reason for the difference?
4 machine was an off by one error - 52481c9. Math is hard for me 8 machine should be fixed. I forgot that run was from an older schedule (which had useless epochs at the end) |
Great! Everything looks good. I'll merge everything in by tomorrow evening |
Using similar schedule and training tricks as this PR:
#53