-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Transform][Vectorization] canonicalize vector with physical vector #100
base: main
Are you sure you want to change the base?
Conversation
Example: Give matmul + relu: func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
%bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
-> tensor<512x512xf32> {
%matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise addition.
%biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
// Elementwise max with 0 (ReLU).
%c0f = arith.constant 0.0 : f32
// expected-remark @below {{elementwise binary}}
%relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
ins(%biased, %c0f : tensor<512x512xf32>, f32)
outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
func.return %relued : tensor<512x512xf32>
} // -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
%3 = arith.addf %1, %2 : vector<512x512xf32>
%4 = arith.maximumf %3, %cst : vector<512x512xf32>
%5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
return %5 : tensor<512x512xf32>
} // -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- // func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
%c16 = arith.constant 16 : index
%c512 = arith.constant 512 : index
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%cst = arith.constant dense<0.000000e+00> : vector<16xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
// remain matmul op to brgemm to do the optimization
%0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
%1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
%2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
%3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
%5 = arith.addf %4, %3 : vector<16xf32>
%6 = arith.maximumf %5, %cst : vector<16xf32>
%7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
scf.yield %7 : tensor<512x512xf32>
}
scf.yield %2 : tensor<512x512xf32>
}
return %1 : tensor<512x512xf32>
} |
bccf326
to
d29f038
Compare
55348fb
to
8d65cbf
Compare
@BRUCE11111, do we need microkernel definition for this first? |
Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it? |
I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering. Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm. |
lib/gc/Transforms/Pipeline.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add related pass into the pipeline
lib/gc/Transforms/TilingVector.h
Outdated
Value forResult); | ||
}; | ||
|
||
class VectorOperationAnalyzer : virtual public CanonicalizerCommonUsedData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to implement it as an analysis
in mlir and seperate it into analysis folder. https://mlir.llvm.org/docs/PassManagement/#pass-manager
arith::TruncFOp, arith::TruncIOp | ||
|
||
#define NOT_NEED_TO_PROCESS_OP \ | ||
linalgx::BatchReduceMatmulVnniOp, linalgx::MultiBatchMatmulOp, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linalgx op will be deprecated and please switch to linalg.generic op
// on it. Therefore, `empty tensor`, `transfer_write` and `transfer_read` | ||
// need to be inserted at target place. | ||
if (enableDebugPrinter) { | ||
printGroupOps(getFusionStrategy().getOpGroups()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use LLVM_DEBUG(llvm::dbgs() << ...)
to do the debug print
|
||
// invalid hardware | ||
LDBG("Please check the hardware information."); | ||
assert(false && "Invalid hardware."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use llvm_unreachable("...")
return; | ||
} | ||
// affineApply operation is always used by other operations. | ||
std::function<bool(Operation *)> candidateFunc = isUsedByOtherOp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to rename this function or use a lambda function here. isUsedByOtherOp
is confusing here
|
||
TypedAttr attr; | ||
if (isa<FloatType>(newOperandType.getElementType())) { | ||
getConstantDenseAttr<DenseFPElementsAttr>(attr, newOperandType, valueType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bracket is unneed
mlir::FailureOr<VectorType> baseType = getOperationVectorType(op); | ||
if (failed(baseType)) { | ||
LDBG("Failed to get vector type for operation: " << *op << "\n"); | ||
assert(0 && "Failed to get vector type for operation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llvm_unreachable(...)
|
||
/// Constructs the 16 bit representation for a half precision value from a float | ||
/// value. This implementation is adapted from Eigen. | ||
uint16_t float2half(float floatValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to move these code to a util.cpp
Waiting for the community's PR merge to fix the remaining errors on CI. |
Performance data:
|
Tracking issue 331
Tasks:
vector.multi_reduction
with graph compiler reduce implementation.vector.transpose
with graph compiler transpose implementation.vector.broadcast
.vector.shapecast
with graph compiler reorder implementation.Performance data: