Skip to content

Commit

Permalink
Fix gptq params (#1284)
Browse files Browse the repository at this point in the history
* fix bits

* space

* fix damp
  • Loading branch information
SunMarc authored Aug 22, 2023
1 parent f600bc6 commit d99a418
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 8 deletions.
3 changes: 2 additions & 1 deletion docs/source/llm_quantization/usage_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,14 @@

## AutoGPTQ Integration

🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.
🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 4, 3 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).

To learn more about the quantization technique used in GPTQ, please refer to:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
- the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend

Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ.

### Requirements
Expand Down
14 changes: 7 additions & 7 deletions optimum/gptq/quantizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,8 @@ def __init__(
bits: int,
dataset: Optional[Union[List[str], str]] = None,
group_size: int = 128,
damp_percent: float = 0.01,
desc_act: bool = True,
damp_percent: float = 0.1,
desc_act: bool = False,
sym: bool = True,
true_sequential: bool = True,
use_cuda_fp16: bool = False,
Expand All @@ -81,9 +81,9 @@ def __init__(
in GPTQ paper ['wikitext2','c4','c4-new','ptb','ptb-new'].
group_size (int, defaults to 128):
The group size to use for quantization. Recommended value is 128 and -1 uses per-column quantization.
damp_percent (`float`, defaults to `0.01`):
The percent of the average Hessian diagonal to use for dampening, recommended value is 0.01.
desc_act (`bool`, defaults to `True`):
damp_percent (`float`, defaults to `0.1`):
The percent of the average Hessian diagonal to use for dampening, recommended value is 0.1.
desc_act (`bool`, defaults to `False`):
Whether to quantize columns in order of decreasing activation size.
Setting it to False can significantly speed up inference but the perplexity may become slightly worse.
Also known as act-order.
Expand Down Expand Up @@ -124,8 +124,8 @@ def __init__(
self.pad_token_id = pad_token_id
self.disable_exllama = disable_exllama

if self.bits not in [2, 4, 6, 8]:
raise ValueError("only support quantize to [2,4,6,8] bits.")
if self.bits not in [2, 3, 4, 8]:
raise ValueError("only support quantize to [2,3,4,8] bits.")
if self.group_size != -1 and self.group_size <= 0:
raise ValueError("group_size must be greater than 0 or equal to -1")
if not (0 < self.damp_percent < 1):
Expand Down

0 comments on commit d99a418

Please sign in to comment.