Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform][Vectorization] canonicalize vector with physical vector #100

Open
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

BRUCE11111
Copy link
Contributor

@BRUCE11111 BRUCE11111 commented May 27, 2024

Tracking issue 331

Tasks:

  • lower linalg named operations and some tensor operations to math/arith.
  • Elementwise operations lower to physical vector.
  • Elementwise operations fusion.
  • Migrate vector.multi_reduction with graph compiler reduce implementation.
  • Migrate vector.transpose with graph compiler transpose implementation.
  • Optimize vector.broadcast.
  • Migrate vector.shapecast with graph compiler reorder implementation.
  • Reduce operation fusion with elementwise operation.
  • Reorder operation fusion.
  • Transpose operation fusion.
  • Broadcast operation fusion.

Performance data:

operations shape no_current_PR current PR performance comments
linalg.transpose 16x1024xf32 0.006 0.003 +50% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose 1024x1024xf32 0.99 0.75 +24% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add 1024xf32 broadcast 1024x1024 add 1.13 0.9 +20% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack 1x1024x1024 1.91 0.6 +68% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce 128x1024x1024 xf32→ 128xf32 5.28 2.8 +46% Testing on ICX, current branch, compare with main branch

@BRUCE11111 BRUCE11111 added the WIP work in progress label May 27, 2024
@BRUCE11111 BRUCE11111 changed the title [Transform][Vectorize] lower linalg named op to math/arith and canonicalize vector with physical vector [Transform][Vectorization] lower linalg named op to math/arith and canonicalize vector with physical vector May 27, 2024
@BRUCE11111
Copy link
Contributor Author

BRUCE11111 commented May 27, 2024

Example:

Give matmul + relu:

func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
                   %bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
                   -> tensor<512x512xf32> {
  %matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
                          outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise addition.
  %biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
    ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise max with 0 (ReLU).
  %c0f = arith.constant 0.0 : f32
  // expected-remark @below {{elementwise binary}}
  %relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
    ins(%biased, %c0f : tensor<512x512xf32>, f32)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
  func.return %relued : tensor<512x512xf32>
}

// -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %c0 = arith.constant 0 : index
  // remain matmul op to brgemm to do the optimization
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
  %1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %3 = arith.addf %1, %2 : vector<512x512xf32>
  %4 = arith.maximumf %3, %cst : vector<512x512xf32>
  %5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
  return %5 : tensor<512x512xf32>
}

// -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
    %c16 = arith.constant 16 : index
    %c512 = arith.constant 512 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant dense<0.000000e+00> : vector<16xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    // remain matmul op to brgemm to do the optimization
    %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
    %1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
      %2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
        %3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %5 = arith.addf %4, %3 : vector<16xf32>
        %6 = arith.maximumf %5, %cst : vector<16xf32>
        %7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
        scf.yield %7 : tensor<512x512xf32>
      }
      scf.yield %2 : tensor<512x512xf32>
    }
    return %1 : tensor<512x512xf32>
  }

@kurapov-peter
Copy link
Contributor

@BRUCE11111, do we need microkernel definition for this first?

@BRUCE11111
Copy link
Contributor Author

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

@ZhennanQin
Copy link
Contributor

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering.

Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm.

@BRUCE11111 BRUCE11111 added ready to review and removed WIP work in progress labels Sep 10, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add related pass into the pipeline

Value forResult);
};

class VectorOperationAnalyzer : virtual public CanonicalizerCommonUsedData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to implement it as an analysis in mlir and seperate it into analysis folder. https://mlir.llvm.org/docs/PassManagement/#pass-manager

arith::TruncFOp, arith::TruncIOp

#define NOT_NEED_TO_PROCESS_OP \
linalgx::BatchReduceMatmulVnniOp, linalgx::MultiBatchMatmulOp, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linalgx op will be deprecated and please switch to linalg.generic op

// on it. Therefore, `empty tensor`, `transfer_write` and `transfer_read`
// need to be inserted at target place.
if (enableDebugPrinter) {
printGroupOps(getFusionStrategy().getOpGroups());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use LLVM_DEBUG(llvm::dbgs() << ...) to do the debug print


// invalid hardware
LDBG("Please check the hardware information.");
assert(false && "Invalid hardware.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use llvm_unreachable("...")

return;
}
// affineApply operation is always used by other operations.
std::function<bool(Operation *)> candidateFunc = isUsedByOtherOp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to rename this function or use a lambda function here. isUsedByOtherOp is confusing here


TypedAttr attr;
if (isa<FloatType>(newOperandType.getElementType())) {
getConstantDenseAttr<DenseFPElementsAttr>(attr, newOperandType, valueType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bracket is unneed

mlir::FailureOr<VectorType> baseType = getOperationVectorType(op);
if (failed(baseType)) {
LDBG("Failed to get vector type for operation: " << *op << "\n");
assert(0 && "Failed to get vector type for operation");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llvm_unreachable(...)


/// Constructs the 16 bit representation for a half precision value from a float
/// value. This implementation is adapted from Eigen.
uint16_t float2half(float floatValue) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to move these code to a util.cpp

@BRUCE11111
Copy link
Contributor Author

Waiting for the community's PR merge to fix the remaining errors on CI.

@BRUCE11111
Copy link
Contributor Author

BRUCE11111 commented Sep 20, 2024

Performance data:

operations shape no_current_PR current PR performance comments
linalg.transpose 16x1024xf32 0.006 0.003 +50% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose 1024x1024xf32 0.99 0.75 +24% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add 1024xf32 broadcast 1024x1024 add 1.13 0.9 +20% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack 1x1024x1024 1.91 0.6 +68% Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce 128x1024x1024 xf32→ 128xf32 5.28 2.8 +46% Testing on ICX, current branch, compare with main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants