[Dev][TL] Add TL BaseScheduler and Library Generator #200

LeiWang1999 · 2024-09-29T07:01:46Z

This pull request introduces several changes to the bitblas project, focusing on enhancing the library generation and wrapper functionality, as well as updating the module handling and optimization strategies. The key changes include the addition of a new wrapper for TileLang, updates to the library generation process, and the introduction of a base scheduler. These changes aim to improve code maintainability, extend functionality, and ensure better performance.

Wrapper Enhancements:

Added a new TLWrapper class in bitblas/builder/wrapper/tl.py to support TileLang ([bitblas/builder/wrapper/tl.pyR1-R193](https://github.com/microsoft/BitBLAS/pull/200/files#diff-7a06aea7d0ad014e71fea5e2754bc701039d6257e0bf23ef7e420571e585a064R1-R193)).
Refactored TIRWrapper to use scheduled_ir_module instead of optimized_mod for better clarity and consistency ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L51-R37), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L193-R179), [[3]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L390-R386)).

Library Generation:

Updated compile_lib method in bitblas/builder/lib_generator/__init__.py to include an optional with_tl parameter, enabling support for TileLang ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2L29-R30), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2L48-R64)).
Added import alias for os.path as osp to simplify path handling ([bitblas/builder/lib_generator/__init__.pyR7](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2R7)).

Module Handling:

Modified bitblas/cache/operator.py to use scheduled_ir_module instead of optimized_mod when saving operator configurations ([bitblas/cache/operator.pyL111-R112](https://github.com/microsoft/BitBLAS/pull/200/files#diff-f8a4e09cbf6dfcad69926fd793e0d0e61ce69b4732bfa515c16fded27079b3c5L111-R112)).
Updated bitblas/ops/general_matmul/__init__.py to include a scheduler selection method and refactored backend handling ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL51-L55), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL360-R356), [[3]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL384-R379), [[4]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edR575-R591)).

Optimization Strategies:

Introduced OptimizeStrategy, TransformKind, and BackendKind enums in bitblas/ops/common.py to standardize optimization strategies and transformations ([bitblas/ops/common.pyR1-R21](https://github.com/microsoft/BitBLAS/pull/200/files#diff-b4984b795537d5afdee6f0d9040991fbf45e129353e701f8e028b2085e7345a6R1-R21)).
Updated references to TransformKind in bitblas/gpu/matmul_mma.py and bitblas/gpu/matmul_mma_dequantize.py to use the new common module ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-22e295329a3b938cfdeeb390d41cf0e088c65553911a3d0a2b13d6947b1a5894L11-R11), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-4a893772b7972bd9794f2e7f8a6702f11a047da82505d431b7ad9d14fd9d98fdL12-R12)).

Base Scheduler:

Added a new BaseScheduler class in bitblas/ops/base_scheduler.py to provide a simplified interface for scheduling transformations ([bitblas/ops/base_scheduler.pyR1-R46](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1ce8a161a5b58d85057d647e262724e64934c352b3199a7d085c40c1d70c296dR1-R46)).

…y function

…ps_dynamic

LeiWang1999 · 2024-09-29T07:07:04Z

Some Notes:

TL and TVM Related use CUBin or Fatbin as a bridge between python and source code, which allow nvcc setup global macro __CUDA_ARCH__, but such proc is unavailable for shared library or executable build.

for example,

nvcc -std=c++17 -Xcompiler="-D__CUDA_ARCH__=890" -I/root/BitBLAS/3rdparty/tvm/src/tl -I/root/BitBLAS/bitblas/../3rdparty/cutlass/include -gencode arch=compute_89,code=sm_89 -v /tmp/tmpxq8vr711.cu

# the output is
#$ gcc -std=c++17 -D__CUDA_ARCH_LIST__=890 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__  -I"/root/BitBLAS/3rdparty/tvm/src/tl" -I"/root/BitBLAS/bitblas/../3rdparty/cutlass/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=4 -D__CUDACC_VER_BUILD__=131 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=4 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/tmpxq8vr711.cu" -o "/tmp/tmpxft_00147295_00000000-5_tmpxq8vr711.cpp4.ii"

nvcc -ptx -std=c++17 -I/root/BitBLAS/3rdparty/tvm/src/tl -I/root/BitBLAS/bitblas/../3rdparty/cutlass/include -gencode arch=compute_89,code=sm_89 -v /tmp/tmpxq8vr711.cu
 # the output is
#$ gcc -std=c++17 -D__CUDA_ARCH__=890 -D__CUDA_ARCH_LIST__=890 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  -I"/root/BitBLAS/3rdparty/tvm/src/tl" -I"/root/BitBLAS/bitblas/../3rdparty/cutlass/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=4 -D__CUDACC_VER_BUILD__=131 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=4 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/tmpxq8vr711.cu" -o "/tmp/tmpxft_001476d8_00000000-7_tmpxq8vr711.cpp1.ii"

LeiWang1999 added 7 commits September 28, 2024 07:43

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f3b1eb9

…y function

remove un-implemented code.

730d13e

Implement BaseScheduler to wrap some related items.

8047ee7

lint fix

64db065

test skip

cef04a8

Refactor tilelang dequantize module and add matmul_blocked_weight_onl…

f1652e9

…y function

Merge branch 'main' of https://github.com/microsoft/BitBLAS into tl_o…

4f6c545

…ps_dynamic

test fix

c485b68

LeiWang1999 merged commit 69350cb into microsoft:main Sep 29, 2024
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev][TL] Add TL BaseScheduler and Library Generator #200

[Dev][TL] Add TL BaseScheduler and Library Generator #200

LeiWang1999 commented Sep 29, 2024

LeiWang1999 commented Sep 29, 2024

[Dev][TL] Add TL BaseScheduler and Library Generator #200

[Dev][TL] Add TL BaseScheduler and Library Generator #200

Conversation

LeiWang1999 commented Sep 29, 2024

Wrapper Enhancements:

Library Generation:

Module Handling:

Optimization Strategies:

Base Scheduler:

LeiWang1999 commented Sep 29, 2024