You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The paper indeed shows 4 forms of attention:
a) Dense attention.
b) Window attention.
c) Window attention with re-calculations.
d) Window attention with sink tokens.
And I only benchmark 3 of these: a (transformers), b (windowed) and d (attention_sinks).
The only missing one is c: Window attention with re-calculations, i.e. not the one that only differs with the 4 sink keys, that's regular window attention.
Window attention with re-calculations seems more involved to implement, though I'm not actually very familiar with it. That's one of the two why it's not implemented in my experiments.
The other reason is that it's (in my opinion) a less interesting comparison than vs dense and regular window attention, as it should perform equivalently but just be a lot slower.
I've been a bit busy this week with work, so I haven't been able to research how to implement this attention approach properly.
As the title suggests, we are missing benchmarks on SWA. From my understanding, the only difference is the 4 tokens as sinks that we save.
The text was updated successfully, but these errors were encountered: