Skip to content

onnx/neural-compressor

Neural Compressor

An open-source Python library supporting popular model compression techniques for ONNX

python version license


Neural Compressor aims to provide popular model compression techniques inherited from Intel Neural Compressor yet focused on ONNX model quantization such as SmoothQuant, weight-only quantization through ONNX Runtime. In particular, the tool provides the key features, typical examples, and open collaborations as below:

Installation

Install from source

git clone https://github.com/onnx/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
pip install .

Note: Further installation methods can be found under Installation Guide.

Getting Started

Setting up the environment:

pip install onnx-neural-compressor "onnxruntime>=1.17.0" onnx

After successfully installing these packages, try your first quantization program.

Notes: please install from source before the formal pypi release.

Weight-Only Quantization (LLMs)

Following example code demonstrates Weight-Only Quantization on LLMs, device will be selected for efficiency automatically when multiple devices are available.

Run the example:

from onnx_neural_compressor.quantization import matmul_nbits_quantizer

algo_config = matmul_nbits_quantizer.RTNWeightOnlyQuantConfig()
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
    model,
    n_bits=4,
    block_size=32,
    is_symmetric=True,
    algo_config=algo_config,
)
quant.process()
best_model = quant.model

Static Quantization

from onnx_neural_compressor.quantization import quantize, config
from onnx_neural_compressor import data_reader


class DataReader(data_reader.CalibrationDataReader):
    def __init__(self):
        self.encoded_list = []
        # append data into self.encoded_list

        self.iter_next = iter(self.encoded_list)

    def get_next(self):
        return next(self.iter_next, None)

    def rewind(self):
        self.iter_next = iter(self.encoded_list)


data_reader = DataReader()
qconfig = config.StaticQuantConfig(calibration_data_reader=data_reader)
quantize(model, output_model_path, qconfig)

Documentation

Overview
Architecture Workflow Examples
Feature
Quantization SmoothQuant
Weight-Only Quantization (INT8/INT4) Layer-Wise Quantization

Additional Content

Communication

  • GitHub Issues: mainly for bug reports, new feature requests, question asking, etc.
  • Email: welcome to raise any interesting research ideas on model compression techniques by email for collaborations.