Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiency claims on attention module used #15

Open
munael opened this issue Nov 1, 2020 · 1 comment
Open

Efficiency claims on attention module used #15

munael opened this issue Nov 1, 2020 · 1 comment

Comments

@munael
Copy link

munael commented Nov 1, 2020

image

In Figure 1 there's a claim that the attention module is "highly efficient". This's explained by removing the need for K/V transforms.
Then for the attention scores block it is said

The A block represents scaled dot product attention, a vector-vector operation

This seems misleading, as the overall complexity of the A block is still a large N^2 Matrix-Matrix product. This's usually the highest complexity section in the classical Attention module.

Can you clarify :D ?

@Smerity
Copy link
Owner

Smerity commented Jan 17, 2021

Apologies for the delayed reply.

You are correct that dot product attention requires N by N dot products to compute the attention.

The claim for attention efficiency for the SHA-RNN is along the lines of Shazeer's One Write-Head is All You Need. Given the keys and values do not require a matrix multiplication there's substantial computational savings with only the queries requiring a matrix multiplication. That's why I note the vector-vector operation.

For reducing the N by N attention component you would indeed need to look towards other potential solutions (approximate attention, sparse attention, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants