Skip to content

Commit

Permalink
Add model quantization instructions to ONNX documentation (#2570)
Browse files Browse the repository at this point in the history
  • Loading branch information
AndreSlavescu committed Aug 22, 2024
1 parent 9960861 commit 9b1ba34
Show file tree
Hide file tree
Showing 2 changed files with 68 additions and 4 deletions.
48 changes: 44 additions & 4 deletions docs/onnx-conversion.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ To run the script and produce the onnx model, run the following sequence of comm
# Begin by going to the appropriate directory
cd src/main/python/onnx
# Now run the script
python3 convert_hf_model_to_onnx.py --model_name naver/splade-cocondenser-ensembledistil
python convert_hf_model_to_onnx.py --model_name naver/splade-cocondenser-ensembledistil
```

So what actually happens under the hood? The following sections will discuss the key parts of the above script:
Expand Down Expand Up @@ -186,9 +186,9 @@ To run the script and produce the optimized onnx model, run the following sequen
# Begin by going to the appropriate directory
cd src/main/python/onnx
# Now run the script
python3 optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx
python optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx
# To run the script that produces the graph summary for the un-optimized and optimized graphs, run the following:
python3 optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx --stats
python optimize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil.onnx --stats
```

So what actually happens under the hood? The following sections will discuss the key parts of the above script:
Expand Down Expand Up @@ -256,7 +256,7 @@ To run the script for running inference, run the following sequence of commands:
# Begin by going to the appropriate directory
cd src/main/python/onnx
# Now run the script
python3 run_onnx_model_inference.py --model_path models/splade-cocondenser-ensembledistil-optimized.onnx \
python run_onnx_model_inference.py --model_path models/splade-cocondenser-ensembledistil-optimized.onnx \
--model_name naver/splade-cocondenser-ensembledistil
```

Expand Down Expand Up @@ -318,6 +318,46 @@ Sparse vector output after thresholding: [[[0. 0.23089279 0.14276895 ...

All of these definitions are modularized in ```run_onnx_inference(model_path, model_name, text, threshold)```.

## Quantization

### Run End-to-End Quantization

Loading and running is done easily with argparse in the following script:
```
src/main/python/onnx/quantize_onnx_model.py
```

For this example, we will continue with the SPLADE++ Ensemble Distil model.

To run the script for running inference, run the following sequence of commands:
```bash
# Begin by going to the appropriate directory
cd src/main/python/onnx
# Now run the script
python quantize_onnx_model.py --model_path models/splade-cocondenser-ensembledistil-optimized-8bit.onnx \
--model_name naver/splade-cocondenser-ensembledistil
```

So what actually happens under the hood? The following sections will discuss the key parts of the above script:

### Quantizing the Model to 8-bit

As seen below, the model name and extension are extracted from the presented optimized onnx model file, and a custom name with 8-bit is created.

In terms of the quantization semantics, only the `model_input` and `model_output` are needed as specifications to the target model. The other two arguments are needed for specifying the desired weight datatype with `weight_type=QuantType.QInt8` as well as the default tensor type `extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}`

```python
base, ext = os.path.splitext(onnx_model_path)
quantized_model_path = f"{base}-8bit{ext}"

quantize_dynamic(
model_input=onnx_model_path,
model_output=quantized_model_path,
weight_type=QuantType.QInt8,
extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}
)
```

## Concluding Remarks

Now that we have successfully gone through a complete reproduction of converting SPLADE++ Ensemble Distil from PyTorch to ONNX, and ran inference with the optimized model, the scripts can be used to reproduce any model available on Huggingface.
Expand Down
24 changes: 24 additions & 0 deletions src/main/python/onnx/quantize_onnx_model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import onnx
from onnxruntime.quantization import quantize_dynamic, QuantType
import argparse
import os

def quantize_model(onnx_model_path):
base, ext = os.path.splitext(onnx_model_path)
quantized_model_path = f"{base}-8bit{ext}"

quantize_dynamic(
model_input=onnx_model_path,
model_output=quantized_model_path,
weight_type=QuantType.QInt8,
extra_options={'DefaultTensorType': onnx.TensorProto.FLOAT}
)

print(f"Quantized model saved to {quantized_model_path}")

if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Quantize ONNX model to 8-bit")
parser.add_argument("--model_path", type=str, required=True, help="Path to ONNX model")
args = parser.parse_args()

quantize_model(args.model_path)

0 comments on commit 9b1ba34

Please sign in to comment.