Transformer implementation seemingly not corresponding to paper #46

patrickgadd · 2022-09-06T14:20:49Z

Hi there,

First off - thank you for open sourcing this very interesting work(!).

I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:

https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.

It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.

Thanks & best wishes,
Patrick

patrickgadd · 2022-09-06T14:29:41Z

I believe

    def forward(self, x, sublayer):
        return x + self.dropout(
            sublayer(self.norm(x)))

ought to be


    def forward(self, x, sublayer):
        return self.norm(x + self.dropout(
            sublayer(x)))

So a rather minor thing, but just FYI

PrzemekPobrotyn · 2022-09-07T11:42:49Z

Hi,

Thank you for the issue and a PR. You're right that there is a discrepancy between our implementation and the paper in terms of the placement of the layer norm and your proposed fix would indeed resolve that discrepancy. However, having consulted with the rest of the team working on the paper we decided it's best to leave the code as is. The reason is that even though in the paper we have written that the layer norm if applied as the very last layer on the output of the residual connection, the results reported in the paper are obtained with the implementation as in this repo i.e. with layer norm applied to the output of FFN, before dropout and residual connection. Following the steps from our reproducibility guide https://github.com/allegro/allRank/tree/master/reproducibility one can easily obtain the exact same results as reported in the paper. If we were to change the order of the layer norm, even though we'd make the implementation follow what's written in the paper, we would break the reproducibility of the results. Hence we decided against it. Once again thanks for pointing out that issue and I hope you understand our rationale.

Best,
Przemek

patrickgadd · 2022-09-07T13:40:05Z

That is perfectly reasonable and understandable.
Hopefully if another curious party wonders they can see the explanation here.

Thank you for taking the time to look into it and getting back on this.

Best,
Patrick

patrickgadd mentioned this issue Sep 6, 2022

Moves LayerNorm to output of the Encoder's sub-layers #47

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer implementation seemingly not corresponding to paper #46

Transformer implementation seemingly not corresponding to paper #46

patrickgadd commented Sep 6, 2022

patrickgadd commented Sep 6, 2022

PrzemekPobrotyn commented Sep 7, 2022

patrickgadd commented Sep 7, 2022

Transformer implementation seemingly not corresponding to paper #46

Transformer implementation seemingly not corresponding to paper #46

Comments

patrickgadd commented Sep 6, 2022

patrickgadd commented Sep 6, 2022

PrzemekPobrotyn commented Sep 7, 2022

patrickgadd commented Sep 7, 2022