Skip to content

Intel® auto-round v0.2 Release

Compare
Choose a tag to compare
@wenhuach21 wenhuach21 released this 30 May 02:13
· 85 commits to main since this release
aafb82e

Overview

We supported the Intel XPU format and implemented lm-head quantization and inference, reducing the model size from 5.4GB to 4.7GB for LLAMA3 at W4G128. Additionally, we supported both local and mixed online datasets for calibration. By optimizing memory usage and tuning costs, the calibration process now takes approximately 20 minutes for 7B models and 2.5 hours for 70B models with 512 samples by setting disable_low_gpu_mem_usage.

Others:

More accuracy data as presented in [paper](https://arxiv.org/pdf/2309.05516) and [low_bit_open_llm_leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard)

More technical details as presented in [paper](https://arxiv.org/pdf/2309.05516)

Known issues:

Large discrepancy between gptq model and qdq model for asymmetric quantization in some scenarios. We are working on it.