Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gptq params #1284

Merged
merged 3 commits into from
Aug 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/llm_quantization/usage_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

## AutoGPTQ Integration

🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.
🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).

To learn more about the quantization technique used in GPTQ, please refer to:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
- the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend

Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ.

### Requirements
Expand Down
14 changes: 7 additions & 7 deletions optimum/gptq/quantizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ def __init__(
bits: int,
dataset: Optional[Union[List[str], str]] = None,
group_size: int = 128,
damp_percent: float = 0.01,
desc_act: bool = True,
damp_percent: float = 0.1,
desc_act: bool = False,
sym: bool = True,
true_sequential: bool = True,
use_cuda_fp16: bool = False,
Expand All @@ -81,9 +81,9 @@ def __init__(
in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new'].
group_size (int, defaults to 128):
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
damp_percent (`float`, defaults to `0.01`):
The percent of the average Hessian diagonal to use for dampening, recommended value is 0.01.
desc_act (`bool`, defaults to `True`):
damp_percent (`float`, defaults to `0.1`):
The percent of the average Hessian diagonal to use for dampening, recommended value is 0.1.
desc_act (`bool`, defaults to `False`):
Whether to quantize columns in order of decreasing activation size.
Setting it to False can significantly speed up inference but the perplexity may become slightly worse.
Also known as act-order.
Expand Down Expand Up @@ -124,8 +124,8 @@ def __init__(
self.pad_token_id = pad_token_id
self.disable_exllama = disable_exllama

if self.bits not in [2, 4, 6, 8]:
raise ValueError("only support quantize to [2,4,6,8] bits.")
if self.bits not in [2, 3, 4, 8]:
raise ValueError("only support quantize to [2,3,4,8] bits.")
if self.group_size != -1 and self.group_size <= 0:
raise ValueError("group_size must be greater than 0 or equal to -1")
if not (0 < self.damp_percent < 1):
Expand Down
Loading