Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ models support #31

Open
synacktraa opened this issue Nov 20, 2023 · 5 comments
Open

GPTQ models support #31

synacktraa opened this issue Nov 20, 2023 · 5 comments

Comments

@synacktraa
Copy link

Can it handle GPTQ models like transformers library's AutoModelForCausalLM does?

@synacktraa
Copy link
Author

It's working without any problem but why the generation speed is slow compared non quantized models?

@tomaarsen
Copy link
Owner

Hello!

There shouldn't be any major changes in generation, but attention_sinks doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?

  • Tom Aarsen

@synacktraa
Copy link
Author

Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.

@Minami-su
Copy link

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

@synacktraa
Copy link
Author

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Thankyou🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants