-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer implementation seemingly not corresponding to paper #46
Comments
I believe
ought to be
So a rather minor thing, but just FYI |
Hi, Thank you for the issue and a PR. You're right that there is a discrepancy between our implementation and the paper in terms of the placement of the layer norm and your proposed fix would indeed resolve that discrepancy. However, having consulted with the rest of the team working on the paper we decided it's best to leave the code as is. The reason is that even though in the paper we have written that the layer norm if applied as the very last layer on the output of the residual connection, the results reported in the paper are obtained with the implementation as in this repo i.e. with layer norm applied to the output of FFN, before dropout and residual connection. Following the steps from our reproducibility guide https://github.com/allegro/allRank/tree/master/reproducibility one can easily obtain the exact same results as reported in the paper. If we were to change the order of the layer norm, even though we'd make the implementation follow what's written in the paper, we would break the reproducibility of the results. Hence we decided against it. Once again thanks for pointing out that issue and I hope you understand our rationale. Best, |
That is perfectly reasonable and understandable. Thank you for taking the time to look into it and getting back on this. Best, |
Hi there,
First off - thank you for open sourcing this very interesting work(!).
I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:
https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.
It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.
Thanks & best wishes,
Patrick
The text was updated successfully, but these errors were encountered: