GPTQ models support #31

synacktraa · 2023-11-20T08:20:30Z

Can it handle GPTQ models like transformers library's AutoModelForCausalLM does?

The text was updated successfully, but these errors were encountered:

synacktraa · 2023-11-20T09:21:45Z

It's working without any problem but why the generation speed is slow compared non quantized models?

tomaarsen · 2023-11-20T09:34:24Z

Hello!

There shouldn't be any major changes in generation, but attention_sinks doesn't support flash attention in any of its models right now. Perhaps that's the difference in generation speed that you're experiencing?

Tom Aarsen

synacktraa · 2023-11-20T09:37:51Z

Thanks for the fast response. Do you plan to work on it someday? I can implement it If you can explain flash attention a little bit.

Minami-su · 2024-01-11T08:01:54Z

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

synacktraa · 2024-01-11T19:29:17Z

It's available at this branch:https://github.com/Minami-su/attention_sinks_autogptq @synacktraa

Thankyou🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ models support #31

GPTQ models support #31

synacktraa commented Nov 20, 2023

synacktraa commented Nov 20, 2023

tomaarsen commented Nov 20, 2023

synacktraa commented Nov 20, 2023

Minami-su commented Jan 11, 2024

synacktraa commented Jan 11, 2024

GPTQ models support #31

GPTQ models support #31

Comments

synacktraa commented Nov 20, 2023

synacktraa commented Nov 20, 2023

tomaarsen commented Nov 20, 2023

synacktraa commented Nov 20, 2023

Minami-su commented Jan 11, 2024

synacktraa commented Jan 11, 2024