You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Figure 1 there's a claim that the attention module is "highly efficient". This's explained by removing the need for K/V transforms.
Then for the attention scores block it is said
The A block represents scaled dot product attention, a vector-vector operation
This seems misleading, as the overall complexity of the A block is still a large N^2 Matrix-Matrix product. This's usually the highest complexity section in the classical Attention module.
Can you clarify :D ?
The text was updated successfully, but these errors were encountered:
You are correct that dot product attention requires N by N dot products to compute the attention.
The claim for attention efficiency for the SHA-RNN is along the lines of Shazeer's One Write-Head is All You Need. Given the keys and values do not require a matrix multiplication there's substantial computational savings with only the queries requiring a matrix multiplication. That's why I note the vector-vector operation.
For reducing the N by N attention component you would indeed need to look towards other potential solutions (approximate attention, sparse attention, ...).
In Figure 1 there's a claim that the attention module is "highly efficient". This's explained by removing the need for K/V transforms.
Then for the attention scores block it is said
This seems misleading, as the overall complexity of the A block is still a large N^2 Matrix-Matrix product. This's usually the highest complexity section in the classical Attention module.
Can you clarify :D ?
The text was updated successfully, but these errors were encountered: