Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dev][TL] Add TL BaseScheduler and Library Generator #200

Merged
merged 8 commits into from
Sep 29, 2024

Conversation

LeiWang1999
Copy link
Contributor

This pull request introduces several changes to the bitblas project, focusing on enhancing the library generation and wrapper functionality, as well as updating the module handling and optimization strategies. The key changes include the addition of a new wrapper for TileLang, updates to the library generation process, and the introduction of a base scheduler. These changes aim to improve code maintainability, extend functionality, and ensure better performance.

Wrapper Enhancements:

  • Added a new TLWrapper class in bitblas/builder/wrapper/tl.py to support TileLang ([bitblas/builder/wrapper/tl.pyR1-R193](https://github.com/microsoft/BitBLAS/pull/200/files#diff-7a06aea7d0ad014e71fea5e2754bc701039d6257e0bf23ef7e420571e585a064R1-R193)).
  • Refactored TIRWrapper to use scheduled_ir_module instead of optimized_mod for better clarity and consistency ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L51-R37), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L193-R179), [[3]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-c10e48719ba0d10a0fd37550438b6d70f459a7b8fab7db66a460bbf5e5960fd5L390-R386)).

Library Generation:

  • Updated compile_lib method in bitblas/builder/lib_generator/__init__.py to include an optional with_tl parameter, enabling support for TileLang ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2L29-R30), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2L48-R64)).
  • Added import alias for os.path as osp to simplify path handling ([bitblas/builder/lib_generator/__init__.pyR7](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1cb4cef867511b586a5170e07b0b282d566ff7726909dab599c31c481db456c2R7)).

Module Handling:

  • Modified bitblas/cache/operator.py to use scheduled_ir_module instead of optimized_mod when saving operator configurations ([bitblas/cache/operator.pyL111-R112](https://github.com/microsoft/BitBLAS/pull/200/files#diff-f8a4e09cbf6dfcad69926fd793e0d0e61ce69b4732bfa515c16fded27079b3c5L111-R112)).
  • Updated bitblas/ops/general_matmul/__init__.py to include a scheduler selection method and refactored backend handling ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL51-L55), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL360-R356), [[3]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edL384-R379), [[4]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-74fe5dd2824cb03a0fb2b0a913a2fc5caeb9c08e5368c318cd32b3af7e6f52edR575-R591)).

Optimization Strategies:

  • Introduced OptimizeStrategy, TransformKind, and BackendKind enums in bitblas/ops/common.py to standardize optimization strategies and transformations ([bitblas/ops/common.pyR1-R21](https://github.com/microsoft/BitBLAS/pull/200/files#diff-b4984b795537d5afdee6f0d9040991fbf45e129353e701f8e028b2085e7345a6R1-R21)).
  • Updated references to TransformKind in bitblas/gpu/matmul_mma.py and bitblas/gpu/matmul_mma_dequantize.py to use the new common module ([[1]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-22e295329a3b938cfdeeb390d41cf0e088c65553911a3d0a2b13d6947b1a5894L11-R11), [[2]](https://github.com/microsoft/BitBLAS/pull/200/files#diff-4a893772b7972bd9794f2e7f8a6702f11a047da82505d431b7ad9d14fd9d98fdL12-R12)).

Base Scheduler:

  • Added a new BaseScheduler class in bitblas/ops/base_scheduler.py to provide a simplified interface for scheduling transformations ([bitblas/ops/base_scheduler.pyR1-R46](https://github.com/microsoft/BitBLAS/pull/200/files#diff-1ce8a161a5b58d85057d647e262724e64934c352b3199a7d085c40c1d70c296dR1-R46)).

@LeiWang1999
Copy link
Contributor Author

Some Notes:

  • TL and TVM Related use CUBin or Fatbin as a bridge between python and source code, which allow nvcc setup global macro __CUDA_ARCH__, but such proc is unavailable for shared library or executable build.

for example,

nvcc -std=c++17 -Xcompiler="-D__CUDA_ARCH__=890" -I/root/BitBLAS/3rdparty/tvm/src/tl -I/root/BitBLAS/bitblas/../3rdparty/cutlass/include -gencode arch=compute_89,code=sm_89 -v /tmp/tmpxq8vr711.cu

# the output is
#$ gcc -std=c++17 -D__CUDA_ARCH_LIST__=890 -D__NV_LEGACY_LAUNCH -E -x c++ -D__CUDACC__ -D__NVCC__  -I"/root/BitBLAS/3rdparty/tvm/src/tl" -I"/root/BitBLAS/bitblas/../3rdparty/cutlass/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=4 -D__CUDACC_VER_BUILD__=131 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=4 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/tmpxq8vr711.cu" -o "/tmp/tmpxft_00147295_00000000-5_tmpxq8vr711.cpp4.ii"
nvcc -ptx -std=c++17 -I/root/BitBLAS/3rdparty/tvm/src/tl -I/root/BitBLAS/bitblas/../3rdparty/cutlass/include -gencode arch=compute_89,code=sm_89 -v /tmp/tmpxq8vr711.cu
 # the output is
#$ gcc -std=c++17 -D__CUDA_ARCH__=890 -D__CUDA_ARCH_LIST__=890 -D__NV_LEGACY_LAUNCH -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__  -I"/root/BitBLAS/3rdparty/tvm/src/tl" -I"/root/BitBLAS/bitblas/../3rdparty/cutlass/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=4 -D__CUDACC_VER_BUILD__=131 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=4 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -m64 "/tmp/tmpxq8vr711.cu" -o "/tmp/tmpxft_001476d8_00000000-7_tmpxq8vr711.cpp1.ii"

@LeiWang1999 LeiWang1999 merged commit 69350cb into microsoft:main Sep 29, 2024
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant