Fix GPTQ doc

huggingface · Aug 11, 2023 · 6f788c6 · 6f788c6
1 parent 9f2943e
commit 6f788c6
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 11 deletions.
diff --git a/docs/source/concept_guides/quantization.mdx b/docs/source/concept_guides/quantization.mdx
@@ -185,7 +185,7 @@ models while respecting accuracy and latency constraints.
 [PyTorch quantization functions](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-quantize-fx)
 to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two
 mentioned above, giving more flexibility, but requiring more work on your end.
-- The `optimum.llm_quantization` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization)
+- The `optimum.gptq` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) with GPTQ.
 
 ## Going further: How do machines represent numbers?
 

diff --git a/docs/source/llm_quantization/usage_guides/quantization.mdx b/docs/source/llm_quantization/usage_guides/quantization.mdx
@@ -4,32 +4,32 @@
 
 🤗 Optimum collaborated with [AutoGPTQ library](https://github.com/PanQiWei/AutoGPTQ) to provide a simple API that apply GPTQ quantization on language models.  With GPTQ quantization, you can quantize your favorite language model to 8, 6, 4 or even 2 bits. This comes without a big drop of performance and with faster inference speed. This is supported by most GPU hardwares.
 
-If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization). 
+If you want to quantize 🤗 Transformers models with GPTQ, follow this [documentation](https://huggingface.co/docs/transformers/main_classes/quantization).
 
-To learn more about the quantization technique used in GPTQ, please refer to: 
+To learn more about the quantization technique used in GPTQ, please refer to:
 - the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
 - the [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
 Note that the AutoGPTQ library provides more advanced usage (triton backend, fused attention, fused MLP) that are not integrated with Optimum. For now, we leverage only the CUDA kernel for GPTQ.
 
 ### Requirements
 
-You need to have the following requirements installed to run the code below: 
+You need to have the following requirements installed to run the code below:
 
 - AutoGPTQ library:
 `pip install auto-gptq`
 
 - Optimum library:
 `pip install --upgrade optimum`
 
-- Install latest `transformers` library from source: 
+- Install latest `transformers` library from source:
 `pip install --upgrade git+https://github.com/huggingface/transformers.git`
 
 - Install latest `accelerate` library:
 `pip install --upgrade accelerate`
 
 ### Load and quantize a model
 
-The [`~optimum.gptq.GPTQQuantizer`] class is used to quantize your model. In order to quantize your model, you need to provide a few arguemnts:
+The [`~gptq.GPTQQuantizer`] class is used to quantize your model. In order to quantize your model, you need to provide a few arguemnts:
 - the number of bits: `bits`
 - the dataset used to calibrate the quantization: `dataset`
 - the model sequence length used to process the dataset: `model_seqlen`
@@ -55,15 +55,15 @@ GPTQ quantization only works for text model for now. Futhermore, the quantizatio
 
 ### Save the model
 
-To save your model, use the save method from [`~optimum.gptq.GPTQQuantizer`] class. It will create a folder with your model state dict along with the quantization config.
+To save your model, use the save method from [`~gptq.GPTQQuantizer`] class. It will create a folder with your model state dict along with the quantization config.
 ```python
 save_folder = "/path/to/save_folder/"
 quantizer.save(model,save_folder)
 ```
 
 ### Load quantized weights
 
-You can load your quantized weights by using the [`~optimum.gptq.load_quantized_model`] function.
+You can load your quantized weights by using the [`~gptq.load_quantized_model`] function.
 Through the Accelerate library, it is possible to load a model faster with a lower memory usage. The model needs to be initialized using empty weights, with weights loaded as a next step.
 ```python
 from accelerate import init_empty_weights
@@ -75,7 +75,7 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev
 
 ### Exllama kernels for faster inference
 
-For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. If you want to change its value, you just need to pass `disable_exllama` in [`~optimum.gptq.load_quantized_model`]. In order to use these kernels, you need to have the entire model on gpus.
+For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. If you want to change its value, you just need to pass `disable_exllama` in [`~gptq.load_quantized_model`]. In order to use these kernels, you need to have the entire model on gpus.
 
 ```py
 from optimum.gptq import GPTQQuantizer, load_quantized_model
@@ -90,9 +90,9 @@ quantized_model = load_quantized_model(empty_model, save_folder=save_folder, dev
 
 Note that only 4-bit models are supported with exllama kernels for now. Furthermore, it is recommended to disable the exllama kernel when you are finetuning your model with peft.
 
-#### Fine-tune a quantized model 
+#### Fine-tune a quantized model
 
-With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ. 
+With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
 Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details.
 
 ### References