diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a533028c5..fa25b190e 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,14 +7,7 @@ That would be awesome if you want to contribute something to BitBLAS! - [Asking Questions](contributing.md#asking-questions) - [Submitting Pull Requests](contributing.md#submitting-pull-requests) - [Repository Setup](contributing.md#repository-setup) - - [Running Examples](contributing.md#running-examples) - [Running Tests](contributing.md#running-tests) - - [Testing Input Methods](contributing.md#testing-input-methods) - - [Publishing Releases](contributing.md#publishing-releases) - - [Publishing Normal `@latest` Release](contributing.md#publishing-normal-latest-release) - - [Publishing `@next` Release](contributing.md#publishing-next-release) - - [Publishing `@experimental` Release](contributing.md#publishing-experimental-release) - - [Running Prerelease Script](contributing.md#running-prerelease-script) ## Reporting Bugs diff --git a/README.md b/README.md index 4447d112c..24a0d1bf4 100644 --- a/README.md +++ b/README.md @@ -7,14 +7,11 @@ Some of the key features of BitBLAS include: - High Performance (Not only FP16xFP16, INT8xINT8, but also FP16xINT4/2/1, INT8xINT4/2/1). - With the flexible DSL (TIR Script) to effortlessly craft domain-specific kernels for your situations. - Support with dynamic symbolic throuth tvm unity -> generate source code with dynamic shape. - -Latest News 🔥 - -- 2023-03-03: BitBLAS first proposed int8xint1 gemv/gemm with 10x/2x speedup over float16xfloat16 on A100, please checkout [op_benchmark_a100_int1_scaling](images/figures/op_benchmark_a100_int1_scaling.png) for detailed input scaling benchmark results. + - BitBLAS first proposed int8xint1 gemv/gemm with 10x/2x speedup over float16xfloat16 on A100, please checkout [op_benchmark_a100_int1_scaling](images/figures/op_benchmark_a100_int1_scaling.png) for detailed input scaling benchmark results. ## Benchmark -BitBLAS can achieve optimal performance across various compute pattern: +BitBLAS can achieve optimal performance across various compute patterns: - GTX 3090 - FLOAT16xFLOAT16 with TensorCore ![3090-gemm-fp16](./images/figures/op_benchmark_3090_fp16_gemm.png) @@ -52,5 +49,6 @@ This project may contain trademarks or logos for projects, products, or services ## Acknowledgement We learned a lot from the following projects. -- [Apache TVM](https://github.com/apache/tvm): We use TensorIR as our DSL currently, and we customized tvm from unity branch to support some features we needed. -- [Microsoft Roller](https://github.com/microsoft/nnfusion/tree/roller): The design and algo inspiration of hardware aware tuning comes from Roller. + +- [Apache TVM](https://github.com/apache/tvm): BitBLAS havs adopted TensorIR as our DSL. Additionally, we have customized TVM from the unity branch to incorporate specific features that were required for our project. +- [Microsoft Roller](https://github.com/microsoft/nnfusion/tree/roller): The design and algo inspiration of hardware aware tuning in BitBLAS comes from Roller,.