…in Readme (#127)
* Refactor BatchMatMulEmitter and BatchMatMulSelector for improved readability and maintainability
* Refactor import statements for improved readability and maintainability
* Refactor import statements for improved readability and maintainability
* disable failure email for ci
* remove email notifications.
* move relax pass from testing to mlc_llm
* Refactor scripts with se check_eual_ref_scripts_with_emitter function
* Lint Fix
* Refactor scripts with se check_eual_ref_scripts_with_emitter function
* bug fix in test
* lint fix.
* test cuda i4 kernel
* Refactor copyright notice in i4matmul.hpp
* Refactor BitBLASLinear test module for improved readability and maintainability
* refactor test as version below python 3.9 cannot handle int32 overflow.
* format lint for test
* Refactor test_int4b_fp16_convert.py for improved readability and maintainability
* remove unused design file
* move tile device from package to base
* dummy impl for codegen
* Refactor file structure for ladder_permutate module
* Refactor backend class and fix typos in comments
* Deep refactor Lib related code.
* remove ci pull.
* LintFix
* refactor builder for whl build
* Refactor TIRWrapper.wrap() method to include an assertion for the optimized module
* Refactor lib_generator to set library and source paths
* lint fix
* BitNet vllm integration
* chore: update codespell to version 2.3.0
* Lintfix
* Bump version to 0.0.1.dev13
* lint fix
* disable fast decoding [u]int4xint8 by default.
* optimize from dict design in Hint
* Implement SplitK
* bitnet benchmark generation.
* Add benchmark script for BitNet integration
* AtomicAdd Support
* LintFix
* ci fix when 3rdparty tvm is initialized.
* bug fix for setup
* fix a bug in block reduce
* typo fix
* BUG Fix for block reduce.
* Lint fix
* Refactor block reduce schedule template
* transform branch from bitblas to bitblas_tl
* Fix subproject commit reference in 3rdparty/tvm
* chore: update submodule branch from bitblas to bitblas_tl
* force update config.cmake
* Bug fix
* Fix subproject commit reference in 3rdparty/cutlass
* chore: Add submodule for cutlass library
* update tl cutlass path
* Refactor BitBLASLinear test module for improved readability and maintainability
* format fix
* Copy CUTLASS to the package directory
* Refactor setup.py to include additional TVM header files
* lint fix
* bug fix
* Refactor BitBLASLinear test module for improved readability and maintainability
* Implement Matmul Benchmark Design
* chore: Update BitBLAS Matmul benchmark script
* lint fix
* Refactor BitBLASMatmulOpsBenchmark for improved readability and maintainability
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* lint fix
* Benchmark bot test
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* int8 test case
* Refactor compare_benchmark.py to handle missing benchmark results gracefully
* ci fix
* disable ci for test benchmark
* Refactor BitBLASMatmulOpsBenchmark to disable tuning during benchmark run
* remove cli installation
* chore: Create virtual environment and install dependencies for benchmark
* chore: Update benchmark workflow to include comparison step
* Lint fix
* upodate tvm cmmit
* Imporve lower warp memory pass
* Bug fix
* Enhance to support warp schedule.
* Enhance LOP3 Instructions
* Enhance LOP3 Instructions
* add test for stage3 propagate
* implement propagate func
* Stage3 Ladder Permutate integration
* get_ladder_stage3_propagate
* comments benchmark scirpts as the setting is too big
* ci fix for benchmark
* lint fix
* chore: Update benchmark workflow to trigger on pull request comments
* Add LDMatrix Transform 3
* Support GPTQ Test
* Fuse BlockReduce Schedule
* Support mma propagate 3
* Support MMA Propagate Stage 3
* Lint Fix
* Merge block reduce for dequantze config.
* fix codeql
* chore: Update submodule reference to latest commit
* chore: Disable common subexpression elimination in TIR passes
* Lint Fix
* 4bit related lop3 updates.
* lint fix
* gptq test fix
* Fix for test
* lint fix
* lint fix
* typofix
* QuantCompress Test
* chore: Refactor quant_compress_impl.py for readability and maintainability
* Enhance docs to update latest works.