Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmarks comparing against Sliding Window Attention #10

Open
casper-hansen opened this issue Oct 5, 2023 · 1 comment
Open

Add benchmarks comparing against Sliding Window Attention #10

casper-hansen opened this issue Oct 5, 2023 · 1 comment

Comments

@casper-hansen
Copy link

As the title suggests, we are missing benchmarks on SWA. From my understanding, the only difference is the 4 tokens as sinks that we save.

@tomaarsen
Copy link
Owner

tomaarsen commented Oct 5, 2023

Hello!

The paper indeed shows 4 forms of attention:
a) Dense attention.
b) Window attention.
c) Window attention with re-calculations.
d) Window attention with sink tokens.

And I only benchmark 3 of these: a (transformers), b (windowed) and d (attention_sinks).
The only missing one is c: Window attention with re-calculations, i.e. not the one that only differs with the 4 sink keys, that's regular window attention.
Window attention with re-calculations seems more involved to implement, though I'm not actually very familiar with it. That's one of the two why it's not implemented in my experiments.

The other reason is that it's (in my opinion) a less interesting comparison than vs dense and regular window attention, as it should perform equivalently but just be a lot slower.

I've been a bit busy this week with work, so I haven't been able to research how to implement this attention approach properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants