diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index e7ed353ffa..7d62f00720 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -126,9 +126,9 @@ title: BetterTransformer isExpanded: false - sections: - - local: optimization_toolbox/usage_guides/quantization + - local: llm_quantization/usage_guides/quantization title: GPTQ quantization - title: Optimization toolbox + title: LLM quantization isExpanded: false - sections: - local: utils/dummy_input_generators diff --git a/docs/source/concept_guides/quantization.mdx b/docs/source/concept_guides/quantization.mdx index f751e9d47a..b9aca25ee9 100644 --- a/docs/source/concept_guides/quantization.mdx +++ b/docs/source/concept_guides/quantization.mdx @@ -185,6 +185,7 @@ models while respecting accuracy and latency constraints. [PyTorch quantization functions](https://pytorch.org/docs/stable/quantization-support.html#torch-quantization-quantize-fx) to allow graph-mode quantization of 🤗 Transformers models in PyTorch. This is a lower-level API compared to the two mentioned above, giving more flexibility, but requiring more work on your end. +- The `optimum.llm_quantization` package allows to [quantize and run LLM models](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) ## Going further: How do machines represent numbers? diff --git a/docs/source/optimization_toolbox/usage_guides/quantization.mdx b/docs/source/llm_quantization/usage_guides/quantization.mdx similarity index 100% rename from docs/source/optimization_toolbox/usage_guides/quantization.mdx rename to docs/source/llm_quantization/usage_guides/quantization.mdx