This part presents a benchmark comparison between our custom library, BitBLAS, and various vendor libraries (cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin) across different matrix operation types (GEMM, GEMV) and data formats (float16xfloat16, int8xint8, float16xint4/nf4). The benchmarks are conducted on NVIDIA GPUs - 24GB RTX 3090 and 80GB A100, with CUDA 12.1 installed.
- GEMM (General Matrix Multiply) and GEMV (General Matrix-Vector Multiply)
- Data formats: float16, int8, float16xint4/nf4
- NVIDIA RTX 3090 (24GB)
- NVIDIA A100 (80GB)
- CUDA 12.1
- Compared libraries: cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, Marlin
- Commit ID:
- bitsandbytes == 0.43.0
- vLLM: 865732342b4e3b8a4ef38f28a2a5bdb87cf3f970
- FasterTransformer: 1afbf20129647a35d108152fc6789bc1d029cda5
- TensorRT-LLM: 2bf3a0a4287069ac55ee3304c285b08592d3d1bc
- CUTLASS: 629f4653c3ea3db3264030382956fabe715f3436
- Marlin: 512f1b1ba39ff708bcc95419f11cfd1285cd31b3
- Float16 and Int8 GEMM with Tensorcore: BitBLAS matches the performance of cuBLAS and CUTLASS.
- Float16xnf4 GEMV and GEMM: BitBLAS achieves 2x the speed of bitsandbytes and 4x the base float16 performance.
- Optimal performance in float16xint4 GEMM.
- Int4 Dequantize Performance: BitBLAS outperforms bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin.
The benchmark configurations for each test scenario are detailed below:
config | Provider | M | N | K |
---|---|---|---|---|
V0 | None | 1 | 16384 | 16384 |
V1 | BLOOM | 1 | 43008 | 14336 |
V2 | BLOOM | 1 | 14336 | 14336 |
V3 | BLOOM | 1 | 57344 | 14336 |
V4 | BLOOM | 1 | 14336 | 57344 |
V5 | OPT | 1 | 9216 | 9216 |
V6 | OPT | 1 | 36864 | 9216 |
V7 | OPT | 1 | 9216 | 36864 |
V8 | LLAMA | 1 | 22016 | 8192 |
V9 | LLAMA | 1 | 8192 | 22016 |
V10 | LLAMA-2 | 1 | 8192 | 8192 |
V11 | LLAMA-2 | 1 | 28672 | 8192 |
V12 | LLAMA-2 | 1 | 8192 | 28672 |
M0 | None | 16384 | 16384 | 16384 |
M1 | BLOOM | 8192 | 43008 | 14336 |
M2 | BLOOM | 8192 | 14336 | 14336 |
M3 | BLOOM | 8192 | 57344 | 14336 |
M4 | BLOOM | 8192 | 14336 | 57344 |
M5 | OPT | 8192 | 9216 | 9216 |
M6 | OPT | 8192 | 36864 | 9216 |
M7 | OPT | 8192 | 9216 | 36864 |
M8 | LLAMA | 8192 | 22016 | 8192 |
M9 | LLAMA | 8192 | 8192 | 22016 |
M10 | LLAMA-2 | 8192 | 8192 | 8192 |
M11 | LLAMA-2 | 8192 | 28672 | 8192 |
M12 | LLAMA-2 | 8192 | 8192 | 28672 |
Note: To reproduce the 3rdparty frameworks' benchmark results, please refer to mlc-benchmark.
BitNET 1.58B INT8xINT2 Matmul BS Scaling on A100.
3090 Related benchmark numbers
A100 Related Benchmark Result
INT8xUINT1 Matmul BS Scaling on A100.