-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][scf] Extend fuse producer to multi-level candidates case #97803
Open
Yun-Fly
wants to merge
2
commits into
llvm:main
Choose a base branch
from
Yun-Fly:yunfei/fuse_producer_multilevel_slice
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
86 changes: 86 additions & 0 deletions
86
mlir/test/Interfaces/TilingInterface/tile-and-fuse-producer.mlir
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
// RUN: mlir-opt --transform-interpreter --cse --split-input-file %s | FileCheck %s | ||
|
||
#map = affine_map<(d0) -> (d0 * 128)> | ||
module { | ||
func.func @gemm_fill_fusion_multi_level_extract_slice(%arg0: tensor<256x512xf32>, %arg1: tensor<512x256xf32>, %arg2: tensor<256x256xf32>) -> tensor<256x256xf32> { | ||
%c0 = arith.constant 0 : index | ||
%c64 = arith.constant 64 : index | ||
%c128 = arith.constant 128 : index | ||
%cst = arith.constant 0.000000e+00 : f32 | ||
%dest0 = tensor.empty() : tensor<256x256xf32> | ||
%dest1 = linalg.fill ins(%cst : f32) outs(%dest0 : tensor<256x256xf32>) -> tensor<256x256xf32> | ||
%1 = scf.forall (%arg3, %arg4) in (2, 2) shared_outs(%arg5 = %dest1) -> tensor<256x256xf32> { | ||
%iv0 = affine.apply #map(%arg3) | ||
%iv1 = affine.apply #map(%arg4) | ||
%extracted_slice_1 = tensor.extract_slice %arg5[%iv0, %iv1] [128, 128] [1, 1] : tensor<256x256xf32> to tensor<128x128xf32> | ||
%extracted_slice_2 = tensor.extract_slice %arg0[%iv0, 0] [128, 512] [1, 1] : tensor<256x512xf32> to tensor<128x512xf32> | ||
%extracted_slice_3 = tensor.extract_slice %arg1[0, %iv1] [512, 128] [1, 1] : tensor<512x256xf32> to tensor<512x128xf32> | ||
%2 = scf.for %arg6 = %c0 to %c128 step %c64 iter_args(%arg7 = %extracted_slice_1) -> (tensor<128x128xf32>) { | ||
%3 = scf.for %arg8 = %c0 to %c128 step %c64 iter_args(%arg9 = %arg7) -> (tensor<128x128xf32>) { | ||
%extracted_slice_4 = tensor.extract_slice %arg9[%arg6, %arg8] [64, 64] [1, 1] : tensor<128x128xf32> to tensor<64x64xf32> | ||
%extracted_slice_5 = tensor.extract_slice %extracted_slice_2[%arg6, 0] [64, 512] [1, 1] : tensor<128x512xf32> to tensor<64x512xf32> | ||
%extracted_slice_6 = tensor.extract_slice %extracted_slice_3[0, %arg8] [512, 64] [1, 1] : tensor<512x128xf32> to tensor<512x64xf32> | ||
%4 = linalg.matmul ins(%extracted_slice_5, %extracted_slice_6 : tensor<64x512xf32>, tensor<512x64xf32>) outs(%extracted_slice_4 : tensor<64x64xf32>) -> tensor<64x64xf32> | ||
%insert_slice = tensor.insert_slice %4 into %arg9[%arg6, %arg8] [64, 64] [1, 1] : tensor<64x64xf32> into tensor<128x128xf32> | ||
scf.yield %insert_slice : tensor<128x128xf32> | ||
} | ||
scf.yield %3 : tensor<128x128xf32> | ||
} | ||
scf.forall.in_parallel { | ||
tensor.parallel_insert_slice %2 into %arg5[%iv0, %iv1] [128, 128] [1, 1] : tensor<128x128xf32> into tensor<256x256xf32> | ||
} | ||
} | ||
return %1 : tensor<256x256xf32> | ||
} | ||
} | ||
|
||
module attributes {transform.with_named_sequence} { | ||
transform.named_sequence @__transform_main(%arg1 : !transform.any_op {transform.readonly}) { | ||
%matmul = transform.structured.match ops{["linalg.matmul"]} in %arg1 | ||
: (!transform.any_op) -> !transform.any_op | ||
%yield = transform.get_producer_of_operand %matmul[2] | ||
: (!transform.any_op) -> !transform.any_op | ||
%a, %b = transform.test.fuse_producer %yield | ||
: (!transform.any_op) -> (!transform.any_op, !transform.any_op) | ||
transform.yield | ||
} | ||
} | ||
|
||
// CHECK: #[[MAP0:.*]] = affine_map<(d0) -> (d0 * 128)> | ||
// CHECK: func.func @gemm_fill_fusion_multi_level_extract_slice( | ||
// CHECK-SAME: %[[ARG0:[a-zA-Z0-9]+]]: tensor<256x512xf32> | ||
// CHECK-SAME: %[[ARG1:[a-zA-Z0-9]+]]: tensor<512x256xf32> | ||
// CHECK-SAME: %[[ARG2:[a-zA-Z0-9]+]]: tensor<256x256xf32> | ||
// CHECK: %[[C0:.*]] = arith.constant 0 : index | ||
// CHECK: %[[dest0:.*]] = tensor.empty() : tensor<256x256xf32> | ||
// CHECK: %[[FORALL_RESULT:.*]] = scf.forall (%[[IV1:.*]], %[[IV2:.*]]) in (2, 2) | ||
// CHECK-SAME: shared_outs(%[[INIT_ARG0:.*]] = %[[dest0]]) | ||
// CHECK-SAME: { | ||
// CHECK: %[[AFFINE_IV1:.*]] = affine.apply #[[MAP0]](%[[IV1]]) | ||
// CHECK: %[[AFFINE_IV2:.*]] = affine.apply #[[MAP0]](%[[IV2]]) | ||
// CHECK: %[[FILL_OUT_SLICE0:.*]] = tensor.extract_slice %[[INIT_ARG0]][%[[AFFINE_IV1]], %[[AFFINE_IV2]]] [128, 128] [1, 1] | ||
// CHECK: %[[INPUT_SLICE0:.*]] = tensor.extract_slice %[[ARG0]][%[[AFFINE_IV1]], 0] [128, 512] [1, 1] | ||
// CHECK: %[[WEIGHT_SLICE0:.*]] = tensor.extract_slice %[[ARG1]][0, %[[AFFINE_IV2]]] [512, 128] [1, 1] | ||
// CHECK: %[[LOOP_RESULT1:.*]] = scf.for %[[IV3:.*]] = %[[C0]] | ||
// CHECK-SAME: iter_args(%[[INIT_ARG1:.*]] = %[[FILL_OUT_SLICE0]]) | ||
// CHECK-SAME: { | ||
// CHECK: %[[LOOP_RESULT2:.*]] = scf.for %[[IV4:.*]] = %[[C0]] | ||
// CHECK-SAME: iter_args(%[[INIT_ARG2:.*]] = %[[INIT_ARG1]]) | ||
// CHECK-SAME: { | ||
// CHECK: %[[FILL_OUT_SLICE1:.*]] = tensor.extract_slice %[[INIT_ARG2]][%[[IV3]], %[[IV4]]] [64, 64] [1, 1] | ||
// CHECK: %[[TILED_FILL_OUT:.*]] = linalg.fill | ||
// CHECK-SAME: outs(%[[FILL_OUT_SLICE1]] : | ||
// CHECK: %[[INPUT_SLICE1:.*]] = tensor.extract_slice %[[INPUT_SLICE0]][%[[IV3]], 0] [64, 512] [1, 1] | ||
// CHECK: %[[WEIGHT_SLICE1:.*]] = tensor.extract_slice %[[WEIGHT_SLICE0]][0, %[[IV4]]] [512, 64] [1, 1] | ||
// CHECK: %[[TILED_MAT_OUT:.*]] = linalg.matmul | ||
// CHECK-SAME: outs(%[[TILED_FILL_OUT]] : | ||
// CHECK: %[[INSERT_MAT:.*]] = tensor.insert_slice %[[TILED_MAT_OUT]] into %[[INIT_ARG2]][%[[IV3]], %[[IV4]]] [64, 64] [1, 1] | ||
// CHECK: scf.yield %[[INSERT_MAT]] : | ||
// CHECK: } | ||
// CHECK: scf.yield %[[LOOP_RESULT2]] : | ||
// CHECK: } | ||
// CHECK: scf.forall.in_parallel { | ||
// CHECK: tensor.parallel_insert_slice %[[LOOP_RESULT1]] into %[[INIT_ARG0]][%[[AFFINE_IV1]], %[[AFFINE_IV2]]] [128, 128] [1, 1] | ||
// CHECK: } | ||
// CHECK: } | ||
// CHECK: return %[[FORALL_RESULT]] : |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this, would it help if you just folded the
extract_slices
using something like this patternThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that we can merge consecutive
extractSlice
. But, I am sorry that we have to just keep these consecutiveextractSlice
for selection. The most intuitive example is as below:In some cases, the producer could be only fused into the outer candidate slice due to semantic of producer, i.e. reduce/pack/unpack those ops have specific demands of tiling size. If we merge the two consecutive
extractSlice
, it may break this fusion. And that is why we emphasize multi-level candidates here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is OK, but the point here is to fuse with
%2
you dont need to look throughextract_slices
.... Just not sure of why there would be a case where you need to look through extract_slices to get to the producer, when you cant collapse all the extract slices to begin with.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is just simple test. If we have one more candidate slice, saying:
where
%2
and%4
are both valid candidate to fuse producer. And the user just want to manually pass%4
totileAndFuseProducerOfSlice
because%4
usually has smaller tileSize in most cases. Then, the currentgetUntiledProducerFromSliceSource
will fail to get real producer without looking throughextract_slices
.Another use case is that if someone want to enable automatic fusion by starting with
tiled_consumer
just like whattileConsumerAndFuseProducersUsingSCF
did. The available extract slice we can find via operand oftiled_consumer
is only the nearest candidate%6
. Then, we also need to look through allextract_slices
to get real producer.Meanwhile, we dont know which candidate slice could be fuse until we know what the exact producer is. Thus, it is unreasonable to collapse all the extract slices before fusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this reasoning holds. If you want to fuse
%4
and not%2
you are effectively "fusing" the slices and the doing the producer fusion. It would be better to stage it so that you fuse the slices so that you get the producer directly and then do the fusion. So effectively this change is trying to do both of thisin code. That is just added complexity. It would be easier to just combine the extract slices first and then fuse the producer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, I am worried about merging two candidates interleaved by a
scf.for
with itsinits
involved is not trivial...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @MaheshRavishankar. Since PR #108318 has been broken down into three parts:
scf.for
(merged, great thanks for your review!)I think the second part is similar to this patch, no matter producer or consumer fusion. In general, this patch focus on how to deal with consecutive candidates interleaved by a
scf.for
.Lets look back what the major argument remains so far to speed up our next stage of review:
The answer is listed in above several threads. The root cause behind are two different solutions to solve multi-level candidates cases:
IMO, the most concern about the first solution is that it may introduce too many additional transform regarding
scf.for
. E.g..As I explained above, this scenario is a little different from what
MergeConsecutiveExtractSlice
expects due toscf.for
.If we just merge consecutive extract slice(BTW, here we also need look through to penetrate
scf.for
), the possible resultant IR looks like below:Then the problem is that how we modify new
inits
ofscf.for
with latestdpsInit
of fusedproducerOp
? (getUntiledProducerFromSliceSource
will fail to finddestinationIterArg
)On the other hand, if we modify
%3=%2
at the same time, the resultant IR would become:First of all, this transform has already introduced more code change and there is an assumption that
%3
has only single one user. Moreover, considering the semantic ofscf.for
, it seems that we also need to modify yield value%x
(and even its producer, likeinsert_slice
) to make the result ofscf.for
matched with itsinits
.In summary, I am afraid that this kind of complexity is much more than the second solution based on simple iterative process of reusing existed function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanka for the explanation. I see the issues a bit more clearly now. I havent reviewed the code yet. I dont know if this has been rebased on top of what was submitted.
But here is the complexity that I am concerned about. In you example, you need to somewhere keep track of the sequence of extract slices that you need to walk to get the producer because the actual offset and size you need is obtained by "combining" the offsets and sizes of all the slices to get the "real offset" and size. Also I am not convinced that fusing the first extract slice + producer and then doing the second extract slice + producer is not feasible. That should be always possible.
So if you start with this
you should always be able to do
and then you do
The amount of state you need to carry during the transformation to fuse with one extract slice is prohibitively high IMO. That will make it hard to change/fix the transformation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not rebased yet, because I think it is more import to reach agreement on solution before pushing...
Yes, from the view of consequentialism, the real offset is indeed combined. However, there are something else we need to address, like what I explained above regarding loop transform. In this way, each two of extract_slice (interleaved by a
scf.for
) merged, the certainscf.for
(and its yieldValue, etc...) need transformed. Is the amount of state need to carry similar to what I do in iterative fashion?It is decided by concrete semantic of tilable op, such as
reduce
/pack
/unpack
op. Lets say fusing producerunpack
into following loop:As you can see, the
tileSize
comes fromextract_slice2
is<32,16>
, buttensor.unpack
prefer perfect tiling case, i.e. tileSize should be exactly divided byinner_tiles
. So, it may be not feasible in this case.BTW, fusing consumer
reduce
maybe more intuitive:We could not furtherly fuse
reduce
into the inner-most insert slice, otherwise, it will lead to partial reduce(its another topic).Yes, it is exactly what I do now. IIRC, that is also what you suggest before in nested consumer fusion thread... (In fact, my first version of implement regarding nested consumer fusion is just
combining
offset and size of several candidate slice, however, it got negated because of too many code changes...)Current iterative method break down the whole transformation into several repeated logic by existed
tiledAndFuseProducerOfSlice
, which should be much easier to debug, at least from my view..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping