[Transform][Vectorization] canonicalize vector with physical vector #100

BRUCE11111 · 2024-05-27T07:41:53Z

Tasks:

Performance data:

operations	shape	no_current_PR	current PR	performance	comments
linalg.transpose	16x1024xf32	0.006	0.003	+50%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.transpose	1024x1024xf32	0.99	0.75	+24%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.broadcast + linalg.add	1024xf32 broadcast 1024x1024 add	1.13	0.9	+20%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
tensor.pack	1x1024x1024	1.91	0.6	+68%	Testing on SPR machine-10223, test on branch yifei/mlp_benching_new
linalg.reduce	128x1024x1024 xf32→ 128xf32	5.28	2.8	+46%	Testing on ICX， current branch, compare with main branch

BRUCE11111 · 2024-05-27T07:55:15Z

Example:

Give matmul + relu:

func.func @fc_relu(%lhs: tensor<512x512xf32>, %rhs: tensor<512x512xf32>,
                   %bias: tensor<512x512xf32>, %output: tensor<512x512xf32>)
                   -> tensor<512x512xf32> {
  %matmul = linalg.matmul ins(%lhs, %rhs: tensor<512x512xf32>, tensor<512x512xf32>)
                          outs(%output: tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise addition.
  %biased = linalg.elemwise_binary { fun = #linalg.binary_fn<add> }
    ins(%matmul, %bias : tensor<512x512xf32>, tensor<512x512xf32>)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>

  // Elementwise max with 0 (ReLU).
  %c0f = arith.constant 0.0 : f32
  // expected-remark @below {{elementwise binary}}
  %relued = linalg.elemwise_binary { fun = #linalg.binary_fn<max_signed> }
    ins(%biased, %c0f : tensor<512x512xf32>, f32)
    outs(%output : tensor<512x512xf32>) -> tensor<512x512xf32>
  func.return %relued : tensor<512x512xf32>
}

// -----// IR Dump After LowerToTileVector (lower-to-tile-vector) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
  %cst = arith.constant dense<0.000000e+00> : vector<512x512xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %c0 = arith.constant 0 : index
  // remain matmul op to brgemm to do the optimization
  %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
  %1 = vector.transfer_read %0[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %2 = vector.transfer_read %arg2[%c0, %c0], %cst_0 {in_bounds = [true, true]} : tensor<512x512xf32>, vector<512x512xf32>
  %3 = arith.addf %1, %2 : vector<512x512xf32>
  %4 = arith.maximumf %3, %cst : vector<512x512xf32>
  %5 = vector.transfer_write %4, %arg3[%c0, %c0] {in_bounds = [true, true]} : vector<512x512xf32>, tensor<512x512xf32>
  return %5 : tensor<512x512xf32>
}

// -----// IR Dump After CPUPhysicalRegisterPass (CPU-physical-register-pass) //----- //

func.func @fc_relu(%arg0: tensor<512x512xf32>, %arg1: tensor<512x512xf32>, %arg2: tensor<512x512xf32>, %arg3: tensor<512x512xf32>) -> tensor<512x512xf32> {
    %c16 = arith.constant 16 : index
    %c512 = arith.constant 512 : index
    %c1 = arith.constant 1 : index
    %c0 = arith.constant 0 : index
    %cst = arith.constant dense<0.000000e+00> : vector<16xf32>
    %cst_0 = arith.constant 0.000000e+00 : f32
    // remain matmul op to brgemm to do the optimization
    %0 = linalg.matmul ins(%arg0, %arg1 : tensor<512x512xf32>, tensor<512x512xf32>) outs(%arg3 : tensor<512x512xf32>) -> tensor<512x512xf32>
    %1 = scf.for %arg4 = %c0 to %c512 step %c1 iter_args(%arg5 = %arg3) -> (tensor<512x512xf32>) {
      %2 = scf.for %arg6 = %c0 to %c512 step %c16 iter_args(%arg7 = %arg5) -> (tensor<512x512xf32>) {
        %3 = vector.transfer_read %arg2[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %4 = vector.transfer_read %0[%arg4, %arg6], %cst_0 {in_bounds = [true]} : tensor<512x512xf32>, vector<16xf32>
        %5 = arith.addf %4, %3 : vector<16xf32>
        %6 = arith.maximumf %5, %cst : vector<16xf32>
        %7 = vector.transfer_write %6, %arg7[%arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<512x512xf32>
        scf.yield %7 : tensor<512x512xf32>
      }
      scf.yield %2 : tensor<512x512xf32>
    }
    return %1 : tensor<512x512xf32>
  }

kurapov-peter · 2024-05-28T10:50:39Z

@BRUCE11111, do we need microkernel definition for this first?

BRUCE11111 · 2024-05-29T00:49:16Z

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

ZhennanQin · 2024-05-29T01:22:08Z

@BRUCE11111, do we need microkernel definition for this first?

Hi~ Peter! Thanks for the suggestion! What does microkernel mean here? And why do you think we need it?

I think Petr's question is from your example, to fully handle matmul lowering, we need microkernel definition to provide brgemm lowering.

Matmul lowering and brgemm lowering is not a part of this PR. Please consider providing another example to avoid confusion, like RMSNorm.

zhczhong · 2024-09-10T02:32:01Z

lib/gc/Transforms/Pipeline.cpp

Please add related pass into the pipeline

zhczhong · 2024-09-10T02:43:17Z

lib/gc/Transforms/TilingVector.h

+                            Value forResult);
+};
+
+class VectorOperationAnalyzer : virtual public CanonicalizerCommonUsedData {


It's better to implement it as an analysis in mlir and seperate it into analysis folder. https://mlir.llvm.org/docs/PassManagement/#pass-manager

zhczhong · 2024-09-10T02:59:42Z