Skip to content

Latest commit

 

History

History
51 lines (35 loc) · 1.87 KB

NOTES.md

File metadata and controls

51 lines (35 loc) · 1.87 KB

Logs/notes

Triton has poor performance for older GPUs like Turing / Volta. (triton-lang/triton#2377)


CUTLASS resources:


The Flash Attention repository (https://github.com/Dao-AILab/flash-attention) is a really good example of CUTLASS in practice.


The Triton implementation spends most of its on CPU. See the profile screenshot below:

Screenshot from 2024-01-15 12-39-28

The CUDA stream (bottom) is almost entirely empty.


matmul-performance-torch

The above plot is for 0% sparsity. We see that for small problem sizes, the Triton implementation is terrible, most likely due to the CPU overhead. For large problem sizes, the Triton implementation doesn't match Dense most likely due to poor support of Triton for Turing GPUs.

All of the above suggests that I need to use CUDA and C++ to remove all the overhead.

Forward & backward pass details

Forward pass:

  • $Y = X W$
    • Shape: (M N) = (M K) (N K)
    • Layout: Row = Row x Row
    • Sparse version: Dense = Sparse x Dense

Backward pass:

  • $\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} W^T$
    • Shape: (M K) = (M N) (K N)
    • Layout: Row = Row x Col
    • Sparse version: Sparse = Dense x Dense
  • $\frac{\partial L}{\partial W}^T = X^T \frac{\partial L}{\partial Y}$
    • Shape: (K N) = (K M) (N M)
    • Layout: Col = Col x Col
    • Sparse version: Dense = Sparse x Dense