Add model quantization instructions to ONNX documentation (#2570)

castorini · Aug 22, 2024 · 9b1ba34 · 9b1ba34
1 parent 9960861
commit 9b1ba34
Show file tree

Hide file tree

Showing 2 changed files with 68 additions and 4 deletions.
diff --git a/docs/onnx-conversion.md b/docs/onnx-conversion.md
@@ -44,7 +44,7 @@ To run the script and produce the onnx model, run the following sequence of comm
 # Begin by going to the appropriate directory
 cd src/main/python/onnx
 # Now run the script
-python3 convert_hf_model_to_onnx.py --model_name naver/splade-cocondenser-ensembledistil
+python convert_hf_model_to_onnx.py --model_name naver/splade-cocondenser-ensembledistil
 ```
 
 So what actually happens under the hood? The following sections will discuss the key parts of the above script:
@@ -186,9 +186,9 @@ To run the script and produce the optimized onnx model, run the following sequen
 # Begin by going to the appropriate directory
 cd src/main/python/onnx
 # Now run the script
-python3 optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx
+python optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx
 # To run the script that produces the graph summary for the un-optimized and optimized graphs, run the following:
-python3 optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx --stats
+python optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx --stats
 ```
 
 So what actually happens under the hood? The following sections will discuss the key parts of the above script:
@@ -256,7 +256,7 @@ To run the script for running inference, run the following sequence of commands:
 # Begin by going to the appropriate directory
 cd src/main/python/onnx
 # Now run the script
-python3 run_onnx_model_inference.py --model_path models/splade-cocondenser-ensembledistil-optimized.onnx \
+python run_onnx_model_inference.py --model_path models/splade-cocondenser-ensembledistil-optimized.onnx \
                                     --model_name naver/splade-cocondenser-ensembledistil
 ```
 
@@ -318,6 +318,46 @@ Sparse vector output after thresholding: [[[0.         0.23089279 0.14276895 ...
 
 All of these definitions are modularized in ```run_onnx_inference(model_path, model_name, text, threshold)```.
 
+## Quantization
+
+### Run End-to-End Quantization
+
+Loading and running is done easily with argparse in the following script:
+```
+src/main/python/onnx/quantize_onnx_model.py
+```
+
+For this example, we will continue with the SPLADE++ Ensemble Distil model.
+
+To run the script for running inference, run the following sequence of commands:
+```bash
+# Begin by going to the appropriate directory
+cd src/main/python/onnx
+# Now run the script
+python quantize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil-optimized-8bit.onnx \
+                                    --model_name naver/splade-cocondenser-ensembledistil
+```
+
+So what actually happens under the hood? The following sections will discuss the key parts of the above script:
+
+### Quantizing the Model to 8-bit
+
+As seen below, the model name and extension are extracted from the presented optimized onnx model file, and a custom name with 8-bit is created. 
+
+In terms of the quantization semantics, only the `model_input` and `model_output` are needed as specifications to the target model. The other two arguments are needed for specifying the desired weight datatype with `weight_type=QuantType.QInt8` as well as the default tensor type `extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}`
+
+```python
+base, ext = os.path.splitext(onnx_model_path)
+quantized_model_path = f"{base}-8bit{ext}"
+
+quantize_dynamic(
+    model_input=onnx_model_path,
+    model_output=quantized_model_path,
+    weight_type=QuantType.QInt8,
+    extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}
+)
+```
+
 ## Concluding Remarks
 
 Now that we have successfully gone through a complete reproduction of converting SPLADE++ Ensemble Distil from PyTorch to ONNX, and ran inference with the optimized model, the scripts can be used to reproduce any model available on Huggingface.

diff --git a/src/main/python/onnx/quantize_onnx_model.py b/src/main/python/onnx/quantize_onnx_model.py
@@ -0,0 +1,24 @@
+import onnx
+from onnxruntime.quantization import quantize_dynamic, QuantType
+import argparse
+import os
+
+def quantize_model(onnx_model_path):
+    base, ext = os.path.splitext(onnx_model_path)
+    quantized_model_path = f"{base}-8bit{ext}"
+
+    quantize_dynamic(
+        model_input=onnx_model_path,
+        model_output=quantized_model_path,
+        weight_type=QuantType.QInt8,
+        extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}
+    )
+
+    print(f"Quantized model saved to {quantized_model_path}")
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Quantize ONNX model to 8-bit")
+    parser.add_argument("--model_path", type=str, required=True, help="Path to ONNX model")
+    args = parser.parse_args()
+
+    quantize_model(args.model_path)