[LinalgToXeGPU] Lower `linalg.matmul_transpose_b` into `xegpu.dpas` #347

dchigarev · 2024-09-17T12:48:21Z

Closes #340

Support lowering of linalg.matmul_transpose_b as well as linalg.transpose %b + linalg.matmul %a %b into xegpu.dpas.

To make a transposed multiplication we have to load B chunks with xegpu.load_nd ... <transpose = array<i64: 1, 0>> attribute and also change the iteration dimension (from rows to cols).

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-09-18T14:14:23Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

@@ -669,6 +669,9 @@ static SmallVector<Value> createDescriptorTiles(PatternRewriter &rewriter,
    Value newRowOffs = rewriter.create<arith::ConstantIndexOp>(loc, i);
    for (int j = 0; j < loadShape[1]; j += descTile[1] * arrayLength) {
      Value newColOffs = rewriter.create<arith::ConstantIndexOp>(loc, j);
+      if (transpose) {
+        std::swap(newRowOffs, newColOffs);


changes iteration dimension for B chunks

kurapov-peter

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.

kurapov-peter · 2024-09-18T15:39:31Z

lib/gc/Transforms/GPU/LinalgToXeGPU.cpp

+                "Transpose result is used by more than one matmul operation");
+          }
+        } else if (isa<memref::DeallocOp>(trUser) ||
+                   isa<memref::AllocaOp>(trUser)) {


What would alloca mean?

nothing, it was added here by mistake, will remove in the next commit

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-09-25T11:25:38Z

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.

Regarding batched transpose: at the moment there's no support for either linalg.batch_matmul or for linalg.batch_reduce_matmul in linalg-to-xegpu pass. Trying YOLO approach by simply adding those ops to the list of allowed ops causes a segfault, so probably a different handling is needed in those cases. Anyway, the 'transpose_b' logic shouldn't change at all if createDPASKernel() would support batched matmuls sometime. You would just need to add BatchMatmulOp everywhere and it should work.
Regarding 'transpose A': this one also requires special handling. Following the current logic of createDPASKernel() we can't use transpose attribute of xegpu.load_nd to transpose A argument since it only can transpose 32bit blocks (and we only support f16 in our lowering). This means, that instead of using xegpu.load_nd <transpose> we would need to use vector.transpose on the loaded vector. Since it doesn't correlate with the changes in this PR and seems like a separate feature, I would propose to consider it as a separate issue ([LinalgToXeGPU] Support linalg.matmul_transpose_a #361). Anyway, the general logic of detecting "if there's a transpose before matmul" stays the same. Added an argument to findAndReplaceTranspose() function indicating which matmul operand we're processing.
Example of why 32bit transpose causes problems for f16 data
```
A(f16): // original matrix
[1,   2,  3,  4]
[5,   6,  7,  8]
[9,  10, 11, 12]
[13, 14, 15, 16]

xegpu.load_nd A(f16) <transpose>: // transposed 32bit blocks
[1, 2,  9, 10]
[5, 6, 13, 14]
[3, 4, 11, 12]
[7, 8, 15, 16]

vector.transpose A(f16): // expected transpose result
[1, 5,  9, 13]
[2, 6, 10, 14]
[3, 7, 11, 15]
[4, 8, 12, 16]
```
Why it doesn't cause any problems for B argument

In contrast with 'A' argument we load 'B' with 'packed/vnni' attribute which apparently involves some special handling which is properly aligned with our needs of B transpose.

P.S. there's transpose_bit_width attribute that should in theory solve this problem, but setting it to 16 does nothing. Not sure if this an intended behavior or just a bug in xegpu-to-vc-func lowering. Submitted a question to IMEX ([xegpu-to-vc-func] Is transpose_bit_width=16 supported? mlir-extensions#895).

kurapov-peter · 2024-09-25T11:51:52Z

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.
Regarding batched transpose: at the moment there's no support for either linalg.batch_matmul or for linalg.batch_reduce_matmul in linalg-to-xegpu pass. Trying YOLO approach by simply adding those ops to the list of allowed ops causes a segfault, so probably a different handling is needed in those cases. Anyway, the 'transpose_b' logic shouldn't change at all if createDPASKernel() would support batched matmuls sometime. You would just need to add BatchMatmulOp everywhere and it should work.
Regarding 'transpose A': this one also requires special handling. Following the current logic of createDPASKernel() we can't use transpose attribute of xegpu.load_nd to transpose A argument since it only can transpose 32bit blocks (and we only support f16 in our lowering). This means, that instead of using xegpu.load_nd <transpose> we would need to use vector.transpose on the loaded vector. Since it doesn't correlate with the changes in this PR and seems like a separate feature, I would propose to consider it as a separate issue ([LinalgToXeGPU] Support linalg.matmul_transpose_a #361). Anyway, the general logic of detecting "if there's a transpose before matmul" stays the same. Added an argument to findAndReplaceTranspose() function indicating which matmul operand we're processing.
Example of why 32bit transpose causes problems for f16 data
A(f16): // original matrix
[1,   2,  3,  4]
[5,   6,  7,  8]
[9,  10, 11, 12]
[13, 14, 15, 16]

xegpu.load_nd A(f16) <transpose>: // transposed 32bit blocks
[1, 2,  9, 10]
[5, 6, 13, 14]
[3, 4, 11, 12]
[7, 8, 15, 16]

vector.transpose A(f16): // expected transpose result
[1, 5,  9, 13]
[2, 6, 10, 14]
[3, 7, 11, 15]
[4, 8, 12, 16]
Why it doesn't cause any problems for B argument
In contrast with 'A' argument we load 'B' with 'packed/vnni' attribute which apparently involves some special handling which is properly aligned with our needs of B transpose.

P.S. there's transpose_bit_width attribute that should in theory solve this problem, but setting it to 16 does nothing. Not sure if this an intended behavior or just a bug in xegpu-to-vc-func lowering. Submitted a question to IMEX ([xegpu-to-vc-func] Is transpose_bit_width=16 supported? mlir-extensions#895).

Sure, I did not suggest adding support for other flavors right away, only to make the interfaces friendly. Supporting other matmuls is a separate issue.

kurapov-peter

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

dchigarev · 2024-09-25T12:06:39Z

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

This bool parameter is used in functions that also take a matmul operand, so this parameter can be used with both of the operands (and with any matmul op). Is this generic enough or you had something else on your mind?

bool transposeA = true;
bool transposeB = false;

auto tilesA = createTiles(matA, ..., /*transpose=*/transposeA);
auto tilesB = createTiles(matB, ..., /*transpose=*/transposeB);

kurapov-peter · 2024-09-25T17:42:02Z

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

This bool parameter is used in functions that also take a matmul operand, so this parameter can be used with both of the operands (and with any matmul op). Is this generic enough or you had something else on your mind?
bool transposeA = true;

bool transposeB = false;



auto tilesA = createTiles(matA, ..., /*transpose=*/transposeA);

auto tilesB = createTiles(matB, ..., /*transpose=*/transposeB);

Yup. Sounds good!

[LinalgToXeGPU] Support linalg.matmul_transpose_d via xegpu.dpas

f3313a6

Signed-off-by: dchigarev <[email protected]>

dchigarev changed the title ~~[LinalgToXeGPU] Support linalg.matmul_transpose_b via xegpu.dpas~~ [LinalgToXeGPU] Lower linalg.matmul_transpose_b into xegpu.dpas Sep 17, 2024

dchigarev mentioned this pull request Sep 17, 2024

OV GPU integration #207

Closed

dchigarev added 4 commits September 17, 2024 17:15

add a test with an asymmetric matrix

018e823

Signed-off-by: dchigarev <[email protected]>

add more tests

297c95c

Signed-off-by: dchigarev <[email protected]>

Merge remote-tracking branch 'origin/main' into xegpu_transpose

b70ef49

prettify

a804544

Signed-off-by: dchigarev <[email protected]>

dchigarev commented Sep 18, 2024

View reviewed changes

dchigarev marked this pull request as ready for review September 18, 2024 14:26

dchigarev added GPU ready to review labels Sep 18, 2024

dchigarev requested review from AndreyPavlenko and kurapov-peter September 18, 2024 14:27

kurapov-peter reviewed Sep 18, 2024

View reviewed changes

remove 'alloca' as an allowed user of transposeOp

85db736

Signed-off-by: dchigarev <[email protected]>

dchigarev removed the ready to review label Sep 24, 2024

dchigarev added 3 commits September 25, 2024 10:56

add 'operandIdx' arg to 'findAndReplaceTranspose'

1bfac5f

Signed-off-by: dchigarev <[email protected]>

Merge remote-tracking branch 'origin/main' into xegpu_transpose

0a2901a

add is-usm-args=false arg to tests

900e7b1

Signed-off-by: dchigarev <[email protected]>

dchigarev requested a review from kurapov-peter September 25, 2024 11:40

kurapov-peter approved these changes Sep 25, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into xegpu_transpose

67dadc2

dchigarev mentioned this pull request Sep 30, 2024

Implemented GPU OpenCL runtime #343

Merged

Merge remote-tracking branch 'origin/main' into xegpu_transpose

1d2bd12

AndreyPavlenko approved these changes Sep 30, 2024

View reviewed changes

dchigarev merged commit 1fee896 into intel:main Sep 30, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LinalgToXeGPU] Lower `linalg.matmul_transpose_b` into `xegpu.dpas` #347

[LinalgToXeGPU] Lower `linalg.matmul_transpose_b` into `xegpu.dpas` #347

dchigarev commented Sep 17, 2024 •

edited

Loading

dchigarev Sep 18, 2024

kurapov-peter left a comment

kurapov-peter Sep 18, 2024

dchigarev Sep 23, 2024

dchigarev commented Sep 25, 2024

kurapov-peter commented Sep 25, 2024

kurapov-peter left a comment

dchigarev commented Sep 25, 2024

kurapov-peter commented Sep 25, 2024

[LinalgToXeGPU] Lower linalg.matmul_transpose_b into xegpu.dpas #347

[LinalgToXeGPU] Lower linalg.matmul_transpose_b into xegpu.dpas #347

Conversation

dchigarev commented Sep 17, 2024 • edited Loading

dchigarev Sep 18, 2024

Choose a reason for hiding this comment

kurapov-peter left a comment

Choose a reason for hiding this comment

kurapov-peter Sep 18, 2024

Choose a reason for hiding this comment

dchigarev Sep 23, 2024

Choose a reason for hiding this comment

dchigarev commented Sep 25, 2024

kurapov-peter commented Sep 25, 2024

kurapov-peter left a comment

Choose a reason for hiding this comment

dchigarev commented Sep 25, 2024

kurapov-peter commented Sep 25, 2024

[LinalgToXeGPU] Lower `linalg.matmul_transpose_b` into `xegpu.dpas` #347

[LinalgToXeGPU] Lower `linalg.matmul_transpose_b` into `xegpu.dpas` #347

dchigarev commented Sep 17, 2024 •

edited

Loading