Implementation of FlashAttention in pycuda
- Simple attention mechanism implementation in python using numpy
- Include multi-headed attention
- More modular and checks
- Make a PyTorch attention module
- Implement naive attention computation
- Add tiling to blocks for compute
- Fused kernels (matmul, softmax, linear layer)