Skip to content

Commit

Permalink
add doc
Browse files Browse the repository at this point in the history
  • Loading branch information
SunMarc committed Jul 28, 2023
1 parent 28acd3c commit ae77ffa
Show file tree
Hide file tree
Showing 3 changed files with 102 additions and 0 deletions.
5 changes: 5 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@
isExpanded: false
title: BetterTransformer
isExpanded: false
- sections:
- local: optimization_toolbox/usage_guides/quantization
title: How to quantize model ?
title: Optimization toolbox
isExpanded: false
- sections:
- local: utils/dummy_input_generators
title: Dummy input generators
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

[[autodoc]] gtpq.GPTQQuantizer
- all

[[autodoc]] gtpq.load_quantized_model
- all
80 changes: 80 additions & 0 deletions docs/source/optimization_toolbox/usage_guides/quantization.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Quantization

## `AutoGPTQ` Integration

🤗 Optimum collaborated with `AutoGPTQ` library to provides a simple API that perform GPTQ quantization on language models. With GPTQ quantization, you can quantize your favorite lanmguage model to 8,6,4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.

If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).

To learn more about the the quantization model, check out:
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
- the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend

### Requirements

You need to have the following requirements installed to run the code below:

- Install latest `AutoGPTQ` library
`pip install auto-gptq`

- Install latest `optimum` from source
`pip install --upgrade optimum`

- Install latest `transformers` from source
`pip install --upgrade transformers`

- Install latest `accelerate` from source
`pip install --upgrade accelerate`

### Load and quantize model

The [`~optimum.gptq.GPTQQuantizer`] class is used to quantize your model. In order to quantize your model, you need to provide a few arguemnts:
- the number of bits: `bits`
- the dataset used to calibrate the quantization: `dataset`
- the model sequence length used to process the dataset: `model_seqlen`
- the block name to quantize: `block_name_to_quantize`

With 🤗 Transformers integration, you don't need to pass the `block_name_to_quantize` and `model_seqlen` as we can retrieve them. However, for custom model, you need to specify them. Also, make sure that your model is converted to `torch.float16` before quantization.

```py
from transformers import AutoModelForCausalLM, AutoTokenizer
from optimum.gptq import GPTQQuantizer, load_quantized_model
model_name = "bigscience/bloom-1b7"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

quantizer = GPTQQuantizer(bits=4, dataset="c4", block_name_to_quantize = "model.decoder.layers", model_seqlen = 2048)
quantized_model = quantizer.quantize_model(quantized_model, tokenizer)
```

<Tip warning={true}>
GPTQ quantization only works for text model for now. Futhermore, the quantization process can a lot of time depending on one's hardware (175B model = 4 gpu hours using NVIDIA A100). Please check on the hub if there is not a GPTQ quantized version of the model. If not, you can submit a demand on github.
</Tip>

### Save model

To save your model, use the save method from [`~optimum.gptq.GPTQQuantizer`] class. It will create a folder with your model state dict along with the quantization config.
```python
save_folder = "/path/to/save_folder/"
quantizer.save(model,save_folder)
```

### Load quantized weights

You can load your quantized weights by using the [`~optimum.gptq.load_quantized_model`] function.

```python
from optimum.gptq import load_quantized_model
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model_from_saved = load_quantized_model(model, save_folder=save_folder, device_map = "auto")
```

You can also load your model faster and without using more memory with `accelerate`. You need to initialize an empty model and laod the quantized weights.

```python
from accelerate import init_empty_weights
with init_empty_weights():
empty_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
empty_model.tie_weights()
model_from_saved = load_quantized_model(empty_model, save_folder=save_folder, device_map="auto")
```

0 comments on commit ae77ffa

Please sign in to comment.