Introduce autotuning to `conv2d` and `conv_transpose2d` with a new `im2col`/`GEMM` algorithm #2287

wingertge · 2024-09-17T15:20:57Z

Checklist

Confirmed that run-checks all script has been executed.
Made sure the book is up to date with changes in this PR.

Related Issues/PRs

This goes some way to resolving this old enhancement suggestion: #805

Changes

Adds the required infrastructure to autotune conv2d and conv_transpose2d, as well as adding a second algorithm based on im2col which provides significant speedups at the cost of memory usage. I'd like to add more algorithms when I have time, but this will already put in the infrastructure to make that much easier and less breaking.

Update; Now includes implicit GEMM when:

CMMA is available
batch_size * out_h * out_w is divisible by 16
out_channels is divisible by 16
in_channels * kernel_h * kernel_w is divisible by 16

Testing

The autotuned, direct (current) and im2col implementations pass all existing tests (except matches_reference_backend, see below).

Benchmarking

Conv2d

Direct

batch_size = 4

Benchmark	Feature	Backend	Device	Median
conv2d	wgpu	`jit<wgpu>`	BestAvailable	8.688ms
conv2d	wgpu-fusion	`fusion<jit<wgpu>>`	BestAvailable	8.351ms
conv2d	cuda-jit	`jit<cuda>`	CudaDevice { index: 0 }	6.250ms

batch_size = 16

Benchmark	Feature	Backend	Device	Median
conv2d	wgpu	`jit<wgpu>`	BestAvailable	31.056ms
conv2d	wgpu-fusion	`fusion<jit<wgpu>>`	BestAvailable	32.943ms
conv2d	cuda-jit	`jit<cuda>`	CudaDevice { index: 0 }	26.165ms

Im2col

batch_size = 4

Benchmark	Feature	Backend	Device	Median
conv2d	wgpu	`jit<wgpu>`	BestAvailable	4.831ms
conv2d	wgpu-fusion	`fusion<jit<wgpu>>`	BestAvailable	4.759ms
conv2d	cuda-jit	`jit<cuda>`	CudaDevice { index: 0 }	2.617ms

batch_size = 16 (split into 4 sub-batches)

Benchmark	Feature	Backend	Device	Median
conv2d	wgpu	`jit<wgpu>`	BestAvailable	21.284ms
conv2d	wgpu-fusion	`fusion<jit<wgpu>>`	BestAvailable	21.207ms
conv2d	cuda-jit	`jit<cuda>`	CudaDevice { index: 0 }	17.202ms

Implicit GEMM

batch_size = 16

Benchmark	Feature	Backend	Device	Median
conv2d	cuda-jit	`jit<cuda>`	CudaDevice { index: 0 }	4.859ms

codecov · 2024-09-17T18:36:14Z

Codecov Report

Attention: Patch coverage is 49.16733% with 641 lines in your changes missing coverage. Please review.

Project coverage is 85.21%. Comparing base (2c8514c) to head (3cfce1b).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...s/burn-jit/src/kernel/conv/conv2d/implicit_gemm.rs	7.25%	422 Missing ⚠️
crates/burn-jit/src/kernel/conv/conv2d/im2col.rs	53.13%	112 Missing ⚠️
crates/burn-jit/src/kernel/conv/conv2d/col2im.rs	62.91%	89 Missing ⚠️
crates/burn-jit/src/kernel/conv/conv2d/base.rs	77.55%	11 Missing ⚠️
crates/burn-jit/src/fusion/tracing/builder.rs	0.00%	3 Missing ⚠️
crates/burn-jit/src/kernel/conv/conv2d/direct.rs	87.50%	2 Missing ⚠️
crates/burn-jit/src/tune_key.rs	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2287      +/-   ##
==========================================
- Coverage   85.67%   85.21%   -0.47%     
==========================================
  Files         760      766       +6     
  Lines       99082   100293    +1211     
==========================================
+ Hits        84888    85462     +574     
- Misses      14194    14831     +637

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nathanielsimard

LGTM, waiting for @louisfd to review before merging, but thanks a lot 🙏

louisfd

I have a few questions, but overall it's very good, awesome!

crates/burn-jit/src/kernel/conv/conv2d/base.rs

crates/burn-jit/src/kernel/conv/conv2d/direct.rs

crates/burn-jit/src/kernel/conv/conv2d/im2col.rs

crates/burn-jit/src/kernel/conv/conv2d/implicit_gemm.rs

louisfd · 2024-09-23T14:36:26Z

crates/burn-jit/src/kernel/conv/conv2d/implicit_gemm.rs

+
+        /**************************** Bounds Check + CMMA Op*********************************/
+        if a_row < gemm_m && k < gemm_k && b_col < gemm_n {
+            cmma::load(&matrix_a, input_tile.as_slice(), CMMA_K);


When doing the matmul, i found you can safely reuse a fragment for several executions. So this line could go in the outer loop.

I'm not sure what you mean? I am reusing the fragments, but I still need to load a tile and execute the matmul for each k.

crates/burn-jit/src/kernel/conv/conv2d/col2im.rs

louisfd

LGTM

wingertge added 11 commits August 29, 2024 17:32

Bunch of im2col stuff

63962bd

lock

7ffee1c

Merge branch 'main' into im2col

946f10e

Migrate to new macro

6409f18

Remove incomplete winograd implementation for now

321db58

Use shape/stride in kernels now that I know what was breaking it before

73756bb

Fix misleadingly named tests and ignore buggy test

c7cfd59

Re-enable algorithm selector

0991254

Remove leftover changes

4159fe5

Reset backend-comparison Cargo.toml to main

c2fdb95

Fix direct convolution

d445428

nathanielsimard reviewed Sep 17, 2024

View reviewed changes

wingertge added 10 commits September 19, 2024 14:48

Limit batch size to avoid cube count overflows

23a931e

Migrate to tune macro

b24524c

Merge remote-tracking branch 'upstream/main' into im2col

6f639c3

Initial implementation (currently gives incorrect results)

c01cfa4

Fix implicit GEMM and clean up code

17821c8

Disable implicit GEMM for groups > 1

4342675

Add special case to im2col for 1x1 kernels

65fa66d

Add short circuit for 1x1 kernels in im2col as used in GoogLeNet

83c969a

Merge remote-tracking branch 'upstream/main' into im2col

b630723

Resolve invalid merge

0dc4c3d

louisfd requested changes Sep 23, 2024

View reviewed changes

wingertge added 5 commits September 23, 2024 17:17

Fix typos

7e561c0

implicit_gemm changes

cec0d94

Code cleanup and optimization

2d18bcc

Fix im2col, apply bias in col2im kernel

11e75d2

Fix warnings

a5df8cc

wingertge requested a review from louisfd September 23, 2024 18:15

louisfd approved these changes Sep 23, 2024

View reviewed changes

louisfd added 2 commits September 23, 2024 15:23

Merge branch 'main' into im2col

64d776d

fmt

3cfce1b

louisfd merged commit 97af8c6 into tracel-ai:main Sep 23, 2024
11 checks passed

wingertge deleted the im2col branch September 23, 2024 20:32

wingertge mentioned this pull request Sep 24, 2024

Further data locality optimizations for implicit GEMM #2300

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce autotuning to `conv2d` and `conv_transpose2d` with a new `im2col`/`GEMM` algorithm #2287

Introduce autotuning to `conv2d` and `conv_transpose2d` with a new `im2col`/`GEMM` algorithm #2287

wingertge commented Sep 17, 2024 •

edited

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading

nathanielsimard left a comment

louisfd left a comment

louisfd Sep 23, 2024

wingertge Sep 23, 2024

louisfd left a comment

Introduce autotuning to conv2d and conv_transpose2d with a new im2col/GEMM algorithm #2287

Introduce autotuning to conv2d and conv_transpose2d with a new im2col/GEMM algorithm #2287

Conversation

wingertge commented Sep 17, 2024 • edited Loading

Checklist

Related Issues/PRs

Changes

Update; Now includes implicit GEMM when:

Testing

Benchmarking

Conv2d

Direct

Im2col

Implicit GEMM

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

nathanielsimard left a comment

Choose a reason for hiding this comment

louisfd left a comment

Choose a reason for hiding this comment

louisfd Sep 23, 2024

Choose a reason for hiding this comment

wingertge Sep 23, 2024

Choose a reason for hiding this comment

louisfd left a comment

Choose a reason for hiding this comment

Introduce autotuning to `conv2d` and `conv_transpose2d` with a new `im2col`/`GEMM` algorithm #2287

Introduce autotuning to `conv2d` and `conv_transpose2d` with a new `im2col`/`GEMM` algorithm #2287

wingertge commented Sep 17, 2024 •

edited

Loading

codecov bot commented Sep 17, 2024 •

edited

Loading