Speedup Benchmark vs Vendor Libraries

This part presents a benchmark comparison between our custom library, BitBLAS, and various vendor libraries (cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin) across different matrix operation types (GEMM, GEMV) and data formats (float16xfloat16, int8xint8, float16xint4/nf4). The benchmarks are conducted on NVIDIA GPUs - 24GB RTX 3090 and 80GB A100, with CUDA 12.1 installed.

Benchmark Overview

Tested Operations and Formats

GEMM (General Matrix Multiply) and GEMV (General Matrix-Vector Multiply)
Data formats: float16, int8, float16xint4/nf4

Hardware

NVIDIA RTX 3090 (24GB)
NVIDIA A100 (80GB)

Software

CUDA 12.1
Compared libraries: cuBLAS, CUTLASS, bitsandbytes, faster-transformer, tensorrt-llm, vLLM, Marlin
Commit ID:
- bitsandbytes == 0.43.0
- vLLM: 865732342b4e3b8a4ef38f28a2a5bdb87cf3f970
- FasterTransformer: 1afbf20129647a35d108152fc6789bc1d029cda5
- TensorRT-LLM: 2bf3a0a4287069ac55ee3304c285b08592d3d1bc
- CUTLASS: 629f4653c3ea3db3264030382956fabe715f3436
- Marlin: 512f1b1ba39ff708bcc95419f11cfd1285cd31b3

Results Summary

RTX 3090 Benchmarks

Float16 and Int8 GEMM with Tensorcore: BitBLAS matches the performance of cuBLAS and CUTLASS.
Float16xnf4 GEMV and GEMM: BitBLAS achieves 2x the speed of bitsandbytes and 4x the base float16 performance.
Optimal performance in float16xint4 GEMM.

A100 Benchmarks

Int4 Dequantize Performance: BitBLAS outperforms bitsandbytes, faster-transformer, tensorrt-llm, vLLM, and Marlin.

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

config	Provider	M	N	K
V0	None	1	16384	16384
V1	BLOOM	1	43008	14336
V2	BLOOM	1	14336	14336
V3	BLOOM	1	57344	14336
V4	BLOOM	1	14336	57344
V5	OPT	1	9216	9216
V6	OPT	1	36864	9216
V7	OPT	1	9216	36864
V8	LLAMA	1	22016	8192
V9	LLAMA	1	8192	22016
V10	LLAMA-2	1	8192	8192
V11	LLAMA-2	1	28672	8192
V12	LLAMA-2	1	8192	28672
M0	None	16384	16384	16384
M1	BLOOM	8192	43008	14336
M2	BLOOM	8192	14336	14336
M3	BLOOM	8192	57344	14336
M4	BLOOM	8192	14336	57344
M5	OPT	8192	9216	9216
M6	OPT	8192	36864	9216
M7	OPT	8192	9216	36864
M8	LLAMA	8192	22016	8192
M9	LLAMA	8192	8192	22016
M10	LLAMA-2	8192	8192	8192
M11	LLAMA-2	8192	28672	8192
M12	LLAMA-2	8192	8192	28672

Note: To reproduce the 3rdparty frameworks' benchmark results, please refer to mlc-benchmark.

Benchmark Images

BitNET 1.58B INT8xINT2 Matmul BS Scaling on A100.

3090 Related benchmark numbers

A100 Related Benchmark Result

INT8xUINT1 Matmul BS Scaling on A100.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Speedup Benchmark vs Vendor Libraries

Benchmark Overview

Tested Operations and Formats

Hardware

Software

Results Summary

RTX 3090 Benchmarks

A100 Benchmarks

Benchmark Configuration

Benchmark Images

Files

README.md

Latest commit

History

README.md

File metadata and controls

Speedup Benchmark vs Vendor Libraries

Benchmark Overview

Tested Operations and Formats

Hardware

Software

Results Summary

RTX 3090 Benchmarks

A100 Benchmarks

Benchmark Configuration

Benchmark Images