[Dev] Add LowerAllReduce Pass to support cross thread Reduction lower…

…ing (#146) * Refactor BatchMatMulEmitter and BatchMatMulSelector for improved readability and maintainability * Refactor import statements for improved readability and maintainability * Refactor import statements for improved readability and maintainability * disable failure email for ci * remove email notifications. * move relax pass from testing to mlc_llm * Refactor scripts with se check_eual_ref_scripts_with_emitter function * Lint Fix * Refactor scripts with se check_eual_ref_scripts_with_emitter function * bug fix in test * lint fix. * test cuda i4 kernel * Refactor copyright notice in i4matmul.hpp * Refactor BitBLASLinear test module for improved readability and maintainability * refactor test as version below python 3.9 cannot handle int32 overflow. * format lint for test * Refactor test_int4b_fp16_convert.py for improved readability and maintainability * remove unused design file * move tile device from package to base * dummy impl for codegen * Refactor file structure for ladder_permutate module * Refactor backend class and fix typos in comments * Deep refactor Lib related code. * remove ci pull. * LintFix * refactor builder for whl build * Refactor TIRWrapper.wrap() method to include an assertion for the optimized module * Refactor lib_generator to set library and source paths * lint fix * BitNet vllm integration * chore: update codespell to version 2.3.0 * Lintfix * Bump version to 0.0.1.dev13 * lint fix * disable fast decoding [u]int4xint8 by default. * optimize from dict design in Hint * Implement SplitK * bitnet benchmark generation. * Add benchmark script for BitNet integration * AtomicAdd Support * LintFix * ci fix when 3rdparty tvm is initialized. * bug fix for setup * fix a bug in block reduce * typo fix * BUG Fix for block reduce. * Lint fix * Refactor block reduce schedule template * transform branch from bitblas to bitblas_tl * Fix subproject commit reference in 3rdparty/tvm * chore: update submodule branch from bitblas to bitblas_tl * force update config.cmake * Bug fix * Fix subproject commit reference in 3rdparty/cutlass * chore: Add submodule for cutlass library * update tl cutlass path * Refactor BitBLASLinear test module for improved readability and maintainability * format fix * Copy CUTLASS to the package directory * Refactor setup.py to include additional TVM header files * lint fix * bug fix * Refactor BitBLASLinear test module for improved readability and maintainability * Implement Matmul Benchmark Design * chore: Update BitBLAS Matmul benchmark script * lint fix * Refactor BitBLASMatmulOpsBenchmark for improved readability and maintainability * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * lint fix * Benchmark bot test * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * int8 test case * Refactor compare_benchmark.py to handle missing benchmark results gracefully * ci fix * disable ci for test benchmark * Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run * remove cli installation * chore: Create virtual environment and install dependencies for benchmark * chore: Update benchmark workflow to include comparison step * Lint fix * upodate tvm cmmit * Imporve lower warp memory pass * Bug fix * Enhance to support warp schedule. * Enhance LOP3 Instructions * Enhance LOP3 Instructions * add test for stage3 propagate * implement propagate func * Stage3 Ladder Permutate integration * get_ladder_stage3_propagate * comments benchmark scirpts as the setting is too big * ci fix for benchmark * lint fix * chore: Update benchmark workflow to trigger on pull request comments * Add LDMatrix Transform 3 * Support GPTQ Test * Fuse BlockReduce Schedule * Support mma propagate 3 * Support MMA Propagate Stage 3 * Lint Fix * Merge block reduce for dequantze config. * fix codeql * chore: Update submodule reference to latest commit * chore: Disable common subexpression elimination in TIR passes * Lint Fix * 4bit related lop3 updates. * lint fix * gptq test fix * Fix for test * lint fix * lint fix * typofix * QuantCompress Test * chore: Refactor quant_compress_impl.py for readability and maintainability * Enhance docs to update latest works. * Refactor weight executors in Matmul class for improved readability and maintainability * Refactor weight executors in Matmul class for improved readability and maintainability * Refactor weight executors in Matmul class for improved readability and maintainability * removed legacy operator * Refactor weight executors in Matmul class for improved readability and maintainability * LintFix * Fix GPTQ Repack with the latest weight transform * lint fix * bug fix for rescale dequantize * test fix * typo fix * lint fix * Set default weight propagate kind into LDMatrixTransform * lint fix * bug fix * bug fix for test * set default to stage3 * revert change * lint fix * case fix * bug fix * fix for legalize * bug fix * chore: Clear global operator cache before running tests * revert optimize_stratety into SingleBatchDecodeOnly * typofix * update benchmark scripts * chore: Refactor benchmark scripts and fix typos * fix for testing * lint fix * fix import. * typo * operator benchmark * optimize * always with shared.dyn * optimize cache. * dsl fix * tqdm * chore: Add serialize_results method to benchmark_matmul_strategies.py * fix performance issue for dynamic async copy * chore: Refactor benchmark_matmul_strategies.py for improved performance and code readability * bug fix * update readme * disable block reduce for int8 * bugfix for bitnet * annotatte todo. * lint fix * regist fast_decode for int8xint4 * Refactor CUDA code to use sm architecture instead of compute architecture * compress qkv and gate up for bitnet * improve elementwise schedule * Refactor BitNet model checkpoint generation scripts * cross thread reduce for tl
microsoft · Aug 18, 2024 · 01c7a80 · 01c7a80
1 parent 01f8eb7
commit 01c7a80
Show file tree

Hide file tree

Showing 5 changed files with 132 additions and 3 deletions.
diff --git a/3rdparty/tvm b/3rdparty/tvm
diff --git a/bitblas/module/__init__.py b/bitblas/module/__init__.py
@@ -52,8 +52,6 @@ def unpack_qweight(qweight, bits):
         i = col % elems_per_int8
         unpacked_weight[:, col] = (qweight[:, col // elems_per_int8] >> (bits * i))
 
-    # Follow the instruction in AutoGPTQ qlinear_cuda_old.py line 303
-    # NOTE: It appears that casting after the `unpacked_zeros  + 1` is important.
     return torch.bitwise_and(unpacked_weight, 2**bits - 1)
 
 

diff --git a/integration/BitNet/maint/README.md b/integration/BitNet/maint/README.md
@@ -0,0 +1,91 @@
+---
+license: mit
+---
+
+
+This is a BitBLAS Implementation for the reproduced 1.58bit model from [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B). We replaced the original simulated Int8x3bit Quantized Inference Kernel with BitBLAS INT8xINT2 Kernel. We also evaluated the model's correctness and performance through `eval_correctness.py` and `benchmark_inference_latency.py`.
+
+## Latest News
+
+- 08/09/2024 ✨: We provide a more efficient implementation for bitnet with vLLM, which should use special model checkpoints, to make the ckpt and study how to deploy, please checkout [Make Checkpoints for vLLM](#make-checkpoints-for-vllm).
+
+## Make Checkpoints for vLLM
+
+We provide two scripts to make the checkpoints for vLLM. The first script is `generate_bitnet_model_native_format.sh`, which is used to make a checkpoint with fp16 uncompressed metaadta, the main difference with the original checkpoint is the `quant_config.json`, which allow vLLM to load the model and execute with a quant extension.
+
+```bash
+# move to the integration directory
+cd /root/to/BitBLAS/integration/BitNet
+# make the checkpoint
+./maint/generate_bitnet_model_native_format.sh
+# the output ckpy will be saved in the `./models/ckpt_bitnet_b1_58-3B` directory
+```
+
+The second script is `generate_bitnet_model_bitblas_format.sh`, which is used to make a checkpoint with BitBLAS compressed metadata, which can avoid the online dequantize sage for the profiling of vLLM, which lead to more efficient memory utilization.
+
+```bash
+./maint/generate_bitnet_model_bitblas_format.sh ./models/ckpt_bitnet_b1_58-3B ./models/ckpt_bitnet_b1_58-3B_bitblas
+# the output ckpy will be saved in the `./models/ckpt_bitnet_b1_58-3B_bitblas` directory
+```
+
+Finnaly, you can use the ckpt in vLLM with:
+
+```bash
+cd vllm_workspace
+# inference with the ckpt with fp16 uncompressed metadata
+python3 inference_with_native_format.py
+# inference with the ckpt with BitBLAS compressed metadata
+python3 inference_with_bitblas_format.py
+```
+
+## BitBLAS Results
+
+### Performance
+
+**Note:** To reproduce the results of BitBLAS, Please checkout the `benchmark_inference_latency.py`. To reproduce the results of the original model, Please checkout the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) repo.
+
+|      Model      | Device | batchsize | in_seq |   model  | bitnet-1.58b-3b-huggingface | bitnet-1.58b-3b-bitblas |
+|:---------------:|:------:|:---------:|:------:|:--------:|:---------------------------:|:-----------------------:|
+| bitnet_b1_58-3B |  A100  |     1     |    1   | LLAMA-3B |         177.6729107         |       64.17962909       |
+| bitnet_b1_58-3B |  A100  |    128    |    1   | LLAMA-3B |         188.6145592         |       63.48158518       |
+| bitnet_b1_58-3B |  A100  |     1     |  2048  | LLAMA-3B |         348.7066031         |       202.6877999       |
+
+### On-the-Fly GPU Memory Footprint
+
+We measured the GPU memory footprint through the `nvidia-smi` command. Please checkout `nvidia_measure_memory.sh` to get the real-time GPU memory usage. And then start a `benchmark_model_10k_loops.py` workload to measure the overall GPU memory usage.
+
+|    **Model**    | **Device** | **batchsize** | **in_seq** | **bitnet-1.58b-3b-huggingface** | **bitnet-1.58b-3b-bitblas** |
+|:---------------:|:----------:|:-------------:|:----------:|:-------------------------------:|:---------------------------:|
+| bitnet_b1_58-3B |    A100    |       1       |      1     |             7595 MB             |           1729 MB           |
+| bitnet_b1_58-3B |    A100    |      128      |      1     |             7677 MB             |           1789 MB           |
+| bitnet_b1_58-3B |    A100    |       1       |    2048    |             8731 MB             |           3163 MB           |
+
+## PPL and Zero-shot Accuracy
+
+The number is Reported from the [1bitLLM/bitnet_b1_58-3B](https://huggingface.co/1bitLLM/bitnet_b1_58-3B), Please checkout the `eval_ppl.py`.
+
+PPL and zero-shot accuracy:
+| Models | PPL| ARCe| ARCc| HS | BQ | OQ | PQ | WGe | Avg
+|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
+| FP16 700M (reported) | 12.33 | 54.7 | 23.0 | 37.0 | 60.0 | 20.2 | 68.9 | 54.8 | 45.5 |
+| BitNet b1.58 700M (reported) | 12.87 | 51.8 | 21.4 | 35.1 | 58.2 | 20.0 | 68.1 | 55.2 | 44.3 |
+| BitNet b1.58 700M (reproduced) | 12.78 | 51.4 | 21.8 | 35.0 | 59.6 | 20.6 | 67.5 | 55.4 | 44.5 |
+| FP16 1.3B (reported)    | 11.25  | 56.9 | 23.5 | 38.5 | 59.1 | 21.6 | 70.0 | 53.9 | 46.2
+| BitNet b1.58 1.3B (reported)    | 11.29  | 54.9 | 24.2 | 37.7 | 56.7 | 19.6 | 68.8 | 55.8 | 45.4 |
+| BitNet b1.58 1.3B (reproduced)    | 11.19 | 55.8 | 23.7 | 37.6 | 59.0 | 20.2 | 69.2 | 56.0 | 45.9
+| FP16 3B (reported)    | 10.04   | 62.1 | 25.6 | 43.3 | 61.8 | 24.6 | 72.1 | 58.2 | 49.7
+| BitNet b1.58 3B (reported)    | 9.91   | 61.4 | 28.3 | 42.9 | 61.5 | 26.6 | 71.5 | 59.3 | 50.2
+| BitNet b1.58 3B (reproduced)    | 9.88 | 60.9 | 28.0 | 42.3 | 58.3 | 26.0 | 71.4 | 60.3 | 49.6 |
+
+The differences between the reported numbers and the reproduced results are possibly variances from the training data processing, seeds, or other random factors.
+
+## Citations
+
+```bibtex
+@article{ma2024era,
+  title={The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits},
+  author={Ma, Shuming and Wang, Hongyu and Ma, Lingxiao and Wang, Lei and Wang, Wenhui and Huang, Shaohan and Dong, Li and Wang, Ruiping and Xue, Jilong and Wei, Furu},
+  journal={arXiv preprint arXiv:2402.17764},
+  year={2024}
+}
+```
diff --git a/integration/BitNet/maint/generate_bitnet_model_native_format.sh b/integration/BitNet/maint/generate_bitnet_model_native_format.sh
@@ -19,6 +19,9 @@ git clone https://huggingface.co/1bitLLM/bitnet_b1_58-3B ckpt_bitnet_b1_58-3B --
 # copy quantized config into the model directory
 cp ../maint/quantize_config.json ckpt_bitnet_b1_58-3B
 
+# copy README.md into the model directory
+cp ../maint/README.md ckpt_bitnet_b1_58-3B
+
 # get the realpath of the model directory
 MODEL_DIR=$(realpath ckpt_bitnet_b1_58-3B)
 

diff --git a/integration/BitNet/maint/upload_models.sh b/integration/BitNet/maint/upload_models.sh
@@ -0,0 +1,37 @@
+#!/bin/bash
+
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+MODEL_DIR=$1
+REMOTE_DIR=$2
+
+if [ ! -d "$MODEL_DIR" ]; then
+    echo "Model directory does not exist!"
+    exit 1
+fi
+
+cd $MODEL_DIR
+if [ ! -d ".git" ]; then
+    rm -rf .git
+fi
+
+git init
+
+git checkout -b main
+
+git lfs install
+
+git lfs track *.bin
+
+git lfs track *.safetensors
+
+git add .
+
+git commit -m "Initial commit"
+
+git remote add origin $REMOTE_DIR
+
+git fetch origin
+
+git push -f --set-upstream origin main