-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproduced BPC of 1.077 using model with one attention layer #3
Comments
Thanks for running this from scratch and I'm glad you're already almost to replicating it! I'll be able to get help you get the rest of the way as the mistakes are issues in the commands I've included in the README :) The faster training time is as "the single headed SHA-LSTM is that each epoch took almost exactly 1800 ± 1 seconds (30 minutes) compared to the 4 headed SHA-LSTM which took 4020 seconds (67 minutes)" and the code currently runs the single-headed SHA-LSTM. That maps fairly well to your 5493 batches * 0.27 seconds per batch ~= 25 minutes per epoch. If you'd like to use the full 4 headed SHA-LSTM (which requires a batch size of 8 on the Titan V but gets a slightly better result as noted in the paper but twice as slow - your V100 may be able to get a larger batch though!) you can set The commands supplied were originally in reference to the 4 layer SHA-LSTM where each layer contains an attention mechanism, not the single headed SHA-LSTM. The batch size 16 model requires a few extra epochs as there are less training steps per epoch compared to the batch size 8 model. As such the limit of 14 epochs for training is incorrect and was in reference to the full 4 layer SHA-LSTM (i.e. reproducing Figure 3 of the paper) which only used 19 epochs total - 16 using For reproducing with the single headed SHA-LSTM, training with the first command ( Training with the second command ( If you still have the log you can run
That set of 27 epochs is about 13.5 hours of compute and gets to approximately your number. I killed training at that stage as the validation perplexity stopped improving. Then I resumed with
As noted all the benefit in the case above is on the first epoch. Sometimes the drop is over a few epochs. There are likely better ways to decay the learning rate but I hadn't explored them. The above model resulted in a test bpc of 1.078. If you have spare compute and want to try it again, do so and get back to me. Otherwise I'll be repeating the experiment myself overnight (... it's 5am but I'll pretend it's overnight lol ...) and report back. Do also note if you have a preference for the faster model or the slightly better model but slower / heavier model. I might have gone the wrong direction by setting the slightly faster one as the default for the codebase. Thanks for your experiment! ^_^ |
Thanks for the quick and detailed reply. I was foolish enough to run this in terminal, so the logs are mostly lost to the limited terminal scrolling window :(. As I watched the first run, I am pretty sure it plateaued at the last few epochs. Validation bpcs were a bit lagging, but fairly close to those plotted in the article. Yes, makes sense to do the batch of 16 on the second run and just few epochs. If I have time I'll repeat the experiment and report here. Love the speed and high GPU utilization. Thank you for publishing this. Not everyone has 1K of TPUs, this thing gives us poor guys some hope ;) |
I did this:
The latest trained model can be downloaded from here. Summary: all works as advertised! |
Above is the output at the end of the second training run, as in the README.
My setup:
Trained model is here (205Mb)
Other notes:
The text was updated successfully, but these errors were encountered: