Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LinalgToXeGPU] Lower linalg.matmul_transpose_b into xegpu.dpas #347

Merged
merged 11 commits into from
Sep 30, 2024

Conversation

dchigarev
Copy link
Contributor

@dchigarev dchigarev commented Sep 17, 2024

Closes #340

Support lowering of linalg.matmul_transpose_b as well as linalg.transpose %b + linalg.matmul %a %b into xegpu.dpas.

To make a transposed multiplication we have to load B chunks with xegpu.load_nd ... <transpose = array<i64: 1, 0>> attribute and also change the iteration dimension (from rows to cols).

@dchigarev dchigarev changed the title [LinalgToXeGPU] Support linalg.matmul_transpose_b via xegpu.dpas [LinalgToXeGPU] Lower linalg.matmul_transpose_b into xegpu.dpas Sep 17, 2024
@dchigarev dchigarev mentioned this pull request Sep 17, 2024
@@ -669,6 +669,9 @@ static SmallVector<Value> createDescriptorTiles(PatternRewriter &rewriter,
Value newRowOffs = rewriter.create<arith::ConstantIndexOp>(loc, i);
for (int j = 0; j < loadShape[1]; j += descTile[1] * arrayLength) {
Value newColOffs = rewriter.create<arith::ConstantIndexOp>(loc, j);
if (transpose) {
std::swap(newRowOffs, newColOffs);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes iteration dimension for B chunks

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.

"Transpose result is used by more than one matmul operation");
}
} else if (isa<memref::DeallocOp>(trUser) ||
isa<memref::AllocaOp>(trUser)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would alloca mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing, it was added here by mistake, will remove in the next commit

@dchigarev
Copy link
Contributor Author

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.

  1. Regarding batched transpose: at the moment there's no support for either linalg.batch_matmul or for linalg.batch_reduce_matmul in linalg-to-xegpu pass. Trying YOLO approach by simply adding those ops to the list of allowed ops causes a segfault, so probably a different handling is needed in those cases. Anyway, the 'transpose_b' logic shouldn't change at all if createDPASKernel() would support batched matmuls sometime. You would just need to add BatchMatmulOp everywhere and it should work.

  2. Regarding 'transpose A': this one also requires special handling. Following the current logic of createDPASKernel() we can't use transpose attribute of xegpu.load_nd to transpose A argument since it only can transpose 32bit blocks (and we only support f16 in our lowering). This means, that instead of using xegpu.load_nd <transpose> we would need to use vector.transpose on the loaded vector. Since it doesn't correlate with the changes in this PR and seems like a separate feature, I would propose to consider it as a separate issue ([LinalgToXeGPU] Support linalg.matmul_transpose_a #361). Anyway, the general logic of detecting "if there's a transpose before matmul" stays the same. Added an argument to findAndReplaceTranspose() function indicating which matmul operand we're processing.

    Example of why 32bit transpose causes problems for f16 data
    A(f16): // original matrix
    [1,   2,  3,  4]
    [5,   6,  7,  8]
    [9,  10, 11, 12]
    [13, 14, 15, 16]
    
    xegpu.load_nd A(f16) <transpose>: // transposed 32bit blocks
    [1, 2,  9, 10]
    [5, 6, 13, 14]
    [3, 4, 11, 12]
    [7, 8, 15, 16]
    
    vector.transpose A(f16): // expected transpose result
    [1, 5,  9, 13]
    [2, 6, 10, 14]
    [3, 7, 11, 15]
    [4, 8, 12, 16]
    
    Why it doesn't cause any problems for B argument

    In contrast with 'A' argument we load 'B' with 'packed/vnni' attribute which apparently involves some special handling which is properly aligned with our needs of B transpose.

    P.S. there's transpose_bit_width attribute that should in theory solve this problem, but setting it to 16 does nothing. Not sure if this an intended behavior or just a bug in xegpu-to-vc-func lowering. Submitted a question to IMEX ([xegpu-to-vc-func] Is transpose_bit_width=16 supported? mlir-extensions#895).

@kurapov-peter
Copy link
Contributor

This seem to be almost generic enough to support transpose A and the batched transposed versions. Could you try making the interfaces amenable to all the matmul variations? Implementation can stay as is.

  1. Regarding batched transpose: at the moment there's no support for either linalg.batch_matmul or for linalg.batch_reduce_matmul in linalg-to-xegpu pass. Trying YOLO approach by simply adding those ops to the list of allowed ops causes a segfault, so probably a different handling is needed in those cases. Anyway, the 'transpose_b' logic shouldn't change at all if createDPASKernel() would support batched matmuls sometime. You would just need to add BatchMatmulOp everywhere and it should work.

  2. Regarding 'transpose A': this one also requires special handling. Following the current logic of createDPASKernel() we can't use transpose attribute of xegpu.load_nd to transpose A argument since it only can transpose 32bit blocks (and we only support f16 in our lowering). This means, that instead of using xegpu.load_nd <transpose> we would need to use vector.transpose on the loaded vector. Since it doesn't correlate with the changes in this PR and seems like a separate feature, I would propose to consider it as a separate issue ([LinalgToXeGPU] Support linalg.matmul_transpose_a #361). Anyway, the general logic of detecting "if there's a transpose before matmul" stays the same. Added an argument to findAndReplaceTranspose() function indicating which matmul operand we're processing.
    Example of why 32bit transpose causes problems for f16 data

    A(f16): // original matrix
    [1,   2,  3,  4]
    [5,   6,  7,  8]
    [9,  10, 11, 12]
    [13, 14, 15, 16]
    
    xegpu.load_nd A(f16) <transpose>: // transposed 32bit blocks
    [1, 2,  9, 10]
    [5, 6, 13, 14]
    [3, 4, 11, 12]
    [7, 8, 15, 16]
    
    vector.transpose A(f16): // expected transpose result
    [1, 5,  9, 13]
    [2, 6, 10, 14]
    [3, 7, 11, 15]
    [4, 8, 12, 16]
    

    Why it doesn't cause any problems for B argument
    In contrast with 'A' argument we load 'B' with 'packed/vnni' attribute which apparently involves some special handling which is properly aligned with our needs of B transpose.

    P.S. there's transpose_bit_width attribute that should in theory solve this problem, but setting it to 16 does nothing. Not sure if this an intended behavior or just a bug in xegpu-to-vc-func lowering. Submitted a question to IMEX ([xegpu-to-vc-func] Is transpose_bit_width=16 supported? mlir-extensions#895).

Sure, I did not suggest adding support for other flavors right away, only to make the interfaces friendly. Supporting other matmuls is a separate issue.

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

@dchigarev
Copy link
Contributor Author

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

This bool parameter is used in functions that also take a matmul operand, so this parameter can be used with both of the operands (and with any matmul op). Is this generic enough or you had something else on your mind?

bool transposeA = true;
bool transposeB = false;

auto tilesA = createTiles(matA, ..., /*transpose=*/transposeA);
auto tilesB = createTiles(matB, ..., /*transpose=*/transposeB);

@kurapov-peter
Copy link
Contributor

There's also the bool transpose param in a couple places that could be generalized but that's a minor thing.

This bool parameter is used in functions that also take a matmul operand, so this parameter can be used with both of the operands (and with any matmul op). Is this generic enough or you had something else on your mind?

bool transposeA = true;

bool transposeB = false;



auto tilesA = createTiles(matA, ..., /*transpose=*/transposeA);

auto tilesB = createTiles(matB, ..., /*transpose=*/transposeB);

Yup. Sounds good!

@dchigarev dchigarev merged commit 1fee896 into intel:main Sep 30, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[LinalgToXeGPU] Support conversion for linalg.matmul with transpose_b
3 participants