microsoft · LeiWang1999 · Apr 16, 2024 · Apr 16, 2024 · Apr 16, 2024 · Apr 16, 2024
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@ Some of the key features of BitBLAS include:
     - $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication including FP16xINT4/2/1, INT8xINT4/2/1, etc. Please checkout [support matrix](#support-matrix) for detailed data types support.
     - Matrix multiplication like FP16xFP16 and INT8xINT8.
   - Auto-Tensorization for TensorCore-like hardware instructions.
-  - Implemented [integration](./integration/) to [PyTorch](https://pytorch.org/), [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) and [vLLM](https://github.com/vllm-project/vllm) for LLM deployment. Please checkout [benchmark summary](#benchmark-summary) for detailed end2end LLM inference performance.
+  - Implemented [integration](/integration/) to [PyTorch](https://pytorch.org/), [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ) and [vLLM](https://github.com/vllm-project/vllm) for LLM deployment. Please checkout [benchmark summary](#benchmark-summary) for detailed end2end LLM inference performance.
   - BitBLAS first implemented $W_{INT2}A_{INT8}$ GEMV/GEMM in [BitNet-b1.58](https://arxiv.org/abs/2402.17764) with 8x/2x speedup over cuBLAS $W_{FP16}A_{FP16}$ on A100, please checkout [op_benchmark_a100_int2_scaling](images/figures/op_benchmark_a100_int2_scaling.png) for detailed benchmark results.
   - Support customizing mixed-precision DNN operations for your specific scenarios via the flexible DSL (TIR Script).
 
@@ -68,16 +68,16 @@ We are continuously expanding the support matrix. If you have any specific requi
 
 ## Getting Started
 
-- [Installation](./docs/Installation.md):
-  To install BitBLAS, please checkout the document [installation](./docs/Installation.md). Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can easily install from `pip install bitblas` in the root directory. 
+- [Installation](/docs/Installation.md):
+  To install BitBLAS, please checkout the document [installation](/docs/Installation.md). Also Make sure you already have the cuda toolkit (version >= 11) installed in the system. Or you can easily install from `pip install bitblas` in the root directory. 
 
-- [QuickStart](./docs/QuickStart.md): BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
+- [QuickStart](/docs/QuickStart.md): BitBLAS provides two Python APIs to perform mixed-precision matrix multiplication:
   - ```bitblas.Matmul``` implements the $W_{wdtype}A_{adtype}$ mixed-precision matrix multiplication of $C_{cdtype}[M, N] = A_{adtype}[M, K] \times W_{wdtype}[N, K]$.
   - ```bitblas.Linear``` is a PyTorch ```nn.Linear```-like module to support a Linear of mixed-precision.
 
-- [Integration](./integration/): Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples.
+- [Integration](/integration/): Explore how BitBLAS seamlessly integrates with LLM deployment frameworks through our examples. Discover the ease of integrating BitBLAS with PyTorch, AutoGPTQ, and vLLM in the 3rd-party integration examples.
 
-- [Customization](./docs/ExtendOperatorsWithDSL.md): BitBLAS supports implementing customized mixed-precision DNN operations rather than matrix multiplication with the flexible DSL (TIR Script).
+- [Customization](/docs/ExtendOperatorsWithDSL.md): BitBLAS supports implementing customized mixed-precision DNN operations rather than matrix multiplication with the flexible DSL (TIR Script).
 
 ## Contributing
 

diff --git a/benchmark/README.md b/benchmark/README.md
@@ -80,9 +80,13 @@ The benchmark configurations for each test scenario are detailed below:
 
 ## Benchmark Images
 
-INT8xINT1 Matmul BS Scaling on A100.
+BitNET 1.58B INT8xINT2 Matmul BS Scaling on A100.
 
-![int8xint1_scaling](../images/figures/op_benchmark_a100_int1_scaling.png)
+![int8xiint2_scaling](../images/figures/op_benchmark_a100_int2_scaling.png)
+
+INT8xUINT1 Matmul BS Scaling on A100.
+
+![int8xiint1_scaling](../images/figures/op_benchmark_a100_uint1_scaling.png)
 
 3090 Related benchmark numbers
 

diff --git a/docs/PythonAPI.md b/docs/PythonAPI.md
@@ -16,17 +16,19 @@
 - **K** *(int)*: The common dimension of matrices A and W.
 - **A_dtype** *(str, default='float16')*: The data type of matrix A.
     - Choices: `'float16'`, `'int8'`.
-- **W_dtype** *(str, default='float16')*: The data type of matrix W. Also acts as a wrapper for source_format and bit.
-    - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'fp4_e2m1'`, `'nf4'`.
+- **W_dtype** *(str, optional)*: Data type of the weights. Default: `'float16'`.
+    - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'uint4'`,`'uint2'`, `'uint1'`, `'fp4_e2m1'`, `'nf4'`.
+    - The Range of the INT Format:
+        - `'int4'`: [-8, 7]
+        - `'int2'`: [-2, 1]
+        - `'int1'`: [-1, 1]
 - **accum_dtype** *(str, default='float16')*: The data type used for accumulation during the matrix multiplication.
     - Choices: `'float16'`, `'int32'`.
 - **out_dtype** *(str, default='float16')*: The data type of the output matrix.
     - Choices: `'float32'`, `'float16'`, `'int8'`, `'int32'`.
 - **layout** *(Literal['nn', 'nt', 'tn', 'tt'], default='nt')*: The layout of the matrix multiplication operation. The matrix is stored in row-major.
     - `'nn'`: Both matrices are non-transposed.
     - `'nt'`: Matrix A is non-transposed, and matrix W is transposed.
-    - `'tn'`: Matrix A is transposed, and matrix W is non-transposed.
-    - `'tt'`: Both matrices are transposed.
 - **with_bias** *(bool, default=False)*: Indicates whether a bias vector is added to the output.
 - **group_size** *(int, default=-1)*: The group size for quantization, -1 indicates no grouping.
 - **with_scaling** *(bool, default=False)*: Indicates whether scaling is applied during quantization.
@@ -90,7 +92,11 @@ Applies a linear transformation to the incoming data: $out[M, N] = A[M, K] \time
 - **A_dtype** *(str, optional)*: Data type of the input tensor. Default: `'float16'`.
     - Choices: `'float16'`, `'int8'`.
 - **W_dtype** *(str, optional)*: Data type of the weights. Default: `'float16'`.
-    - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'fp4_e2m1'`, `'af4'`.
+    - Choices: `'float16'`, `'int8'`, `'int4'`, `'int2'`, `'int1'`, `'uint4'`,`'uint2'`, `'uint1'`, `'fp4_e2m1'`, `'nf4'`.
+    - The Range of the INT Format:
+        - `'int4'`: [-8, 7]
+        - `'int2'`: [-2, 1]
+        - `'int1'`: [-1, 1]
 - **accum_dtype** *(str, optional)*: Data type for accumulation. Default: `'float16'`.
     - Choices: `'float16'`, `'int32'`.
 - **out_dtype** *(str, optional)*: Data type of the output tensor. Default: `'float16'`.

diff --git a/images/figures/op_benchmark_a100_int2_scaling.png b/images/figures/op_benchmark_a100_int2_scaling.png
diff --git a/images/figures/op_benchmark_a100_uint1_scaling.png b/images/figures/op_benchmark_a100_uint1_scaling.png
diff --git a/python/bitblas/gpu/intrin/lop3.py b/python/bitblas/gpu/intrin/lop3.py
@@ -633,14 +633,14 @@
     static constexpr uint immLut = (0xf0 & 0xcc) | 0xaa; // 0b11101010
     static constexpr uint BOTTOM_MASK = 0x03030303;      // 0xf -> 0b11 select 0,3
     static constexpr uint I8s_MAGIC_NUM = 0x00000000;    // 1024
-    static constexpr uint MEDIAN_NUM = 0x01010101;
+    static constexpr uint MEDIAN_NUM = 0x02020202;
 #pragma unroll
     for (int i = 0; i < (N / 4); i++)
     {
         asm volatile("lop3.b32 %0, %1, %2, %3, %4;\\n"
                      : "=r"(i8s[i])
                      : "r"(i2b >> (2 * i)), "n"(BOTTOM_MASK), "n"(I8s_MAGIC_NUM), "n"(immLut));
-        i8s[i] = __vsubss4(i8s[i], MEDIAN_NUM);
+        i8s[i] = __vsub4(i8s[i], MEDIAN_NUM);
     }
 }
 template <typename T1, typename T2>