[BUG] Qwen 2.5 34B returns garbage at certain quantization levels, but not others #628
Closed
3 tasks done
Labels
bug
Something isn't working
OS
Linux
GPU Library
CUDA 12.x
Python version
3.12
Pytorch version
2.3, 2.4, 2.6 nightly, flash-attn and xformers built from source, exllama built from master branch
Describe the bug
Qwen 2.5 34B returns garbage output with certain quantizations above 4bpw, but not ones below 4bpw.
Possibly related to #621 or #627
What's unusual is that lower quantizations work, but higher ones do not.
These two quants work for me:
https://huggingface.co/Downtown-Case/Qwen_Qwen2.5-32B-Base-exl2-3.92bpw
https://huggingface.co/Downtown-Case/Qwen_Qwen2.5-32B-Base-exl2-3.75bpw
While this one (and a 4.04 I had locally) return garbage:
Here's an example command I used for quantization:
python convert.py --in_dir "/home/down/Models/Raw/Qwen_Qwen2.5-32B" -o "/home/down/FastStorage/scratch2" -m "/home/down/Models/calibration/Q32-base.json" -b 4.0 -hb 6 -cf "/home/down/Models/exllama/Qwen_Qwen2.5-32B-exl2-4.0bpw" -nr --fast_safetensors
Re-doing the calibration from scratch doesn't seem to make a difference, and that same calibration was used for the sub 4bpw quantizations.
I tried quantizing at 4.1/4.04 bpw in multiple pytorch environments, with different versions of flash-attention installed, remaking the measurements json from scratch, and so on. My test is an 75K context story at Q4 cache quantization, simply continuing it in exui. Again, the sub 4bpw quantization continue it coherently while the ones over 4bpw return garbled english, with no errors in the console.
I'm running through more troubleshooting steps now (like trying different levels of cache quantization and making more quantizations), but figured I'd post early since others seem to be having issues with Qwen.
Acknowledgements
The text was updated successfully, but these errors were encountered: