Skip to content

Commit

Permalink
Fix GPTQ doc
Browse files Browse the repository at this point in the history
  • Loading branch information
regisss committed Aug 11, 2023
1 parent 9f2943e commit 6f788c6
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 11 deletions.
2 changes: 1 addition & 1 deletion docs/source/concept_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ models while respecting accuracy and latency constraints.
[PyTorch quantization functions](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-quantize-fx)
to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two
mentioned above, giving more flexibility, but requiring more work on your end.
- The `optimum.llm_quantization` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization)
- The `optimum.gptq` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) with GPTQ.

## Going further: How do machines represent numbers?

Expand Down
20 changes: 10 additions & 10 deletions docs/source/llm_quantization/usage_guides/quantization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,32 +4,32 @@

🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).
If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).

To learn more about the quantization technique used in GPTQ, please refer to:
To learn more about the quantization technique used in GPTQ, please refer to:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
- the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ.

### Requirements

You need to have the following requirements installed to run the code below:
You need to have the following requirements installed to run the code below:

- AutoGPTQ library:
`pip install auto-gptq`

- Optimum library:
`pip install --upgrade optimum`

- Install latest `transformers` library from source:
- Install latest `transformers` library from source:
`pip install --upgrade git+https://github.com/huggingface/transformers.git`

- Install latest `accelerate` library:
`pip install --upgrade accelerate`

### Load and quantize a model

The [`~optimum.gptq.GPTQQuantizer`] class is used to quantize your model. In order to quantize your model, you need to provide a few arguemnts:
The [`~gptq.GPTQQuantizer`] class is used to quantize your model. In order to quantize your model, you need to provide a few arguemnts:
- the number of bits: `bits`
- the dataset used to calibrate the quantization: `dataset`
- the model sequence length used to process the dataset: `model_seqlen`
Expand All @@ -55,15 +55,15 @@ GPTQ quantization only works for text model for now. Futhermore, the quantizatio

### Save the model

To save your model, use the save method from [`~optimum.gptq.GPTQQuantizer`] class. It will create a folder with your model state dict along with the quantization config.
To save your model, use the save method from [`~gptq.GPTQQuantizer`] class. It will create a folder with your model state dict along with the quantization config.
```python
save_folder = "/path/to/save_folder/"
quantizer.save(model,save_folder)
```

### Load quantized weights

You can load your quantized weights by using the [`~optimum.gptq.load_quantized_model`] function.
You can load your quantized weights by using the [`~gptq.load_quantized_model`] function.
Through the Accelerate library, it is possible to load a model faster with a lower memory usage. The model needs to be initialized using empty weights, with weights loaded as a next step.
```python
from accelerate import init_empty_weights
Expand All @@ -75,7 +75,7 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev

### Exllama kernels for faster inference

For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. If you want to change its value, you just need to pass `disable_exllama` in [`~optimum.gptq.load_quantized_model`]. In order to use these kernels, you need to have the entire model on gpus.
For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. If you want to change its value, you just need to pass `disable_exllama` in [`~gptq.load_quantized_model`]. In order to use these kernels, you need to have the entire model on gpus.

```py
from optimum.gptq import GPTQQuantizer, load_quantized_model
Expand All @@ -90,9 +90,9 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev

Note that only 4-bit models are supported with exllama kernels for now. Furthermore, it is recommended to disable the exllama kernel when you are finetuning your model with peft.

#### Fine-tune a quantized model
#### Fine-tune a quantized model

With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details.

### References
Expand Down

0 comments on commit 6f788c6

Please sign in to comment.