Using microkernels for single core code. #153

powderluv · 2024-02-10T21:53:50Z

No description provided.

MaheshRavishankar · 2024-02-14T06:43:25Z

IREE CPU already uses microkernels, use that as the baseline to get this flow working.

Here is a gist that explains how this is plumbed through on the CPU side https://gist.github.com/bjacob/2c160c102cee33562826c730945b49f2

Abhishek-Varma · 2024-02-20T11:09:03Z

I've raised a PR to tackle the first step : --iree-amdaie-lower-to-ukernels : #175

Abhishek-Varma · 2024-02-22T11:35:04Z

So, I experimented with simple-pack pipeline (enabling ukernel config lowering by default) and have the passes stacked like :-

    <ALL TILE+FUSE+PACK+TRANSPOSE PASSES>
    
    iree-amdaie-lower-to-ukernels

    <IREEComprehensiveBufferize PASSES>

    iree-lower-ukernel-ops-to-calls

I confirm that we are getting func.call @iree_amdaie_uk_matmul .

It was failing in the air-dependency pass.

I then worked on triaging the same and have added a fix in the mlir-air repository.

With the fix added we are able to get the IR all the way until that stage where basically all of our e2e lit tests get to.

I've added the fix as well as mlir-print-ir-before-all here : Fix + IR log.

Abhishek-Varma · 2024-02-23T11:48:57Z

Short summary:
I'm able to generate the following construct now :-

aie.device(ipu) {
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
}

(As expected each core will have a reference of the microkernel they individually are running).

We are successfully able to reach this stage starting from the following input IR :-

func.func @matmul_small(%lhs : tensor<8x16xbf16>,
    %rhs : tensor<16x32xbf16>) -> tensor<8x32xbf16> {
  %empty = tensor.empty() : tensor<8x32xbf16>
  %cst = arith.constant 0.0 : bf16
  %fill = linalg.fill ins(%cst : bf16) outs(%empty : tensor<8x32xbf16>) -> tensor<8x32xbf16>
  %2 = linalg.matmul ins(%lhs, %rhs : tensor<8x16xbf16>, tensor<16x32xbf16>)
      outs(%fill : tensor<8x32xbf16>) -> tensor<8x32xbf16>
  return %2 : tensor<8x32xbf16>
}

Long summary:

I went through some basics of the AIR/AIE dialect to understand better how to position the func.call thingy.
I figured the first instance to fiddle was air-to-aie - was able to make some progress with that by changing the aie::CoreOp creation itself.
Tried backtracking and found the actual source where a fix would get us through the above construct. It was in air-par-to-herd pass - updated the creation of air::HerdOp itself.
I've added all my local fix to MLIR-AIR along with the current IR log in the same gist as yesterday's: Fix + IR log.

Abhishek-Varma · 2024-02-26T14:32:59Z

Updates:

Added support for linalg.matmul along with the already added support for linalg.generic.
Here is the WIP branch.
After installing mlir-aie - I had difficulty getting hold of mm.o.
Was able to get through the issue with a workaround - root cause xchesscc_wrapper doesn't seem to be using the aietools path passed and xchesscc is not found in the xcorad* server.
I've added the workaround here as a_workaround_for_generating_mm.o.sh.
Updated the function signature of the matmul_scalar_* - both at mm.c as well as the IREE ukernel op I'm forming.
I've added the updated mm.c file here mm.c.
Was able to generate mm.o file after updating too (with the fix in point 2 above).
Tried the e2e invocation but it fails at the XCLBin generation.
Have added the e2e log here as ukernel_e2e_runtime.mlir.
I even tried setting the absolute path to mm.o at the IR's link_with construct but to no avail.

Following is the command which MLIR-AIE seems to be using for testing microkernels via RYZEN AI CHESS COMPILER:-

1. RUN: xchesscc_wrapper aie2 -I aietools/include -c mm.cc -o ./mm.o
2. RUN: python aie2.py > ./aie.mlir
3. RUN: python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu \
                                       --no-compile-host --xclbin-name=aie.xclbin \
                                       --ipu-insts-name=insts.txt ./aie.mlir
4. RUN: g++-13 test.cpp -o test.exe -std=c++23 -Wall \
                            xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
5. RUN: run_on_ipu ./test.exe -x aie.xclbin -k MLIR_AIE -i insts.txt

NOTE: The links attached is the same gist link (with updated content) which I've previously used to provide an update to this tracker.

Abhishek-Varma · 2024-02-27T15:17:21Z

Updates :

The reason for the failure was error: Dialect 'hal' not found for custom op.
On triaging the issue I found the main reason why this is happening. Take a look at the following :-

func.func @matmul_small_dispatch_0_matmul_8x32x16_bf16(%arg0: memref<64xi32>, %arg1: memref<256xi32>, %arg2: memref<128xi32>) {
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xbf16>
  memref.assume_alignment %0, 64 : memref<8x16xbf16>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x32xbf16>
  memref.assume_alignment %1, 64 : memref<16x32xbf16>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x32xbf16>
  memref.assume_alignment %2, 64 : memref<8x32xbf16>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 1, 8, 8][0, 0, 8]) {id = 0 : i64, metadata = @airMemcpyId4} : memref<64xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 4, 16, 4][0, 4, 16]) {id = 1 : i64, metadata = @airMemcpyId5} : memref<256xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][1, 4, 8, 4][0, 4, 16]) {id = 2 : i64, metadata = @airMemcpyId20} : memref<128xi32>
  aiex.ipu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
  return
}

The hal constructs above weren't getting canonicalized away. Ideally, the following IR is what we should've gotten :-

func.func @matmul_small_dispatch_0_matmul_8x32x16_bf16(%arg0: memref<64xbf16>, %arg1: memref<256xbf16>, %arg2: memref<128xbf16>) {
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xbf16>
  memref.assume_alignment %arg0, 64 : memref<8x16xbf16>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x32xbf16>
  memref.assume_alignment %arg1, 64 : memref<16x32xbf16>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x32xbf16>
  memref.assume_alignment %arg2, 64 : memref<8x32xbf16>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 1, 8, 8][0, 0, 8]) {id = 0 : i64, metadata = @airMemcpyId4} : memref<64xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 4, 16, 4][0, 4, 16]) {id = 1 : i64, metadata = @airMemcpyId5} : memref<256xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][1, 4, 8, 4][0, 4, 16]) {id = 2 : i64, metadata = @airMemcpyId20} : memref<128xi32>
  aiex.ipu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
  return
}

Note the changes I made above :-
a. i32 -> bf16 for function arguments.
b. memref.assume_alignment %arg*, 64 : memref<8x32xbf16> instead of memref.assume_alignment %2, 64 : memref<8x32xbf16>.

You may take a look at the e2e log for this stage.

On further triaging AIRRtToIpuPass seemed to be the culprit. And therefore I tried commenting out this as well as this. It lead to the following error : error: 'aiex.ipu.dma_memcpy_nd' op must be used with memref type with element width 32.

You may take a look at the e2e log for this stage.

I switched over to instead using i32 as the element type.
This entailed modifying the --iree-amdaie-lower-to-ukernel as well as changing mm.c and regenerating mm.o.

IT WORKED!

You may take a look at the IR log for the e2e matmul scalar i32.

Note: As for the next step - I'm going to chase preparing PRs to ensure the matmul_scalar_i32 is checked in as the first ukernel based approach. For this I also need to generalize certain hardcoded parameters + need to see how to get mm.o to build. The above works for Pad based pipeline. After it gets checked in - I'll target the pack based approach for matmul i32.

jtuyls · 2024-02-27T16:35:49Z

Hi @Abhishek-Varma, my understanding is that aiex.ipu.dma_memcpy_nd always has to operate on 32 bit data types right now (as AIE DMAs have 4 byte granularity), so the bf16 data type is interpreted as i32. I think we have following options to solve this:

Update the hal ops to work on i32 as well
Update mlir-air/aie to accept other data types as well, but under certain constraints, like strides and sizes need to be multiples of 4 byte words.

Option 1 is probably the easiest to get things working as air/aie seems to have taken the 'reinterpret data type' approach to ensure multiples of 4 byte words in DMA operations. But not sure what the plans are there longer term? @erwei-xilinx Any thoughts?

cc @MaheshRavishankar

newling · 2024-02-28T01:04:08Z

memref.assume_alignment %2, 64 : memref<8x32xbf16>

Oh no you went down the same rabbit hole as I did 😄 The pass I added to fix it was Xilinx/mlir-air@e8a32b0 (it should be live in the latest iree-amd-aie now).

@MaheshRavishankar also mentioned a pass in IREE for this, could be https://github.com/openxla/iree/blob/09deadfb8a58d17a4cf136ce916a661836eff2cf/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp#L595

newling · 2024-02-28T01:07:19Z

Hi @Abhishek-Varma, my understanding is that aiex.ipu.dma_memcpy_nd always has to operate on 32 bit data types right now (as AIE DMAs have 4 byte granularity), so the bf16 data type is interpreted as i32. I think we have following options to solve this:

Update the hal ops to work on i32 as well

Update mlir-air/aie to accept other data types as well, but under certain constraints, like strides and sizes need to be multiples of 4 byte words.

Option 1 is probably the easiest to get things working as air/aie seems to have taken the 'reinterpret data type' approach to ensure multiples of 4 byte words in DMA operations. But not sure what the plans are there longer term? @erwei-xilinx Any thoughts?

cc @MaheshRavishankar

My understanding is that option 2 is already done

MaheshRavishankar · 2024-02-28T01:42:02Z

https://github.com/openxla/iree/blob/main/compiler%2Fsrc%2Firee%2Fcompiler%2FCodegen%2FCommon%2FCleanupBufferAllocViewPass.cpp is the lass in IREE that removes hal.interface.binding that are dead (i.e. uses are only in alignment operations)

I haven't looked at this PR itself to get all the details

Abhishek-Varma · 2024-02-28T11:58:08Z

So reinitializing the mlir-air submodules in iree-amd-aie helped.

The error reported earlier is not there anymore. @newling @MaheshRavishankar

For bf16 I'm now getting :-

ld.lld: error: undefined symbol: me_primitive::control_rnd
>>> referenced by mm.o
>>>               /proj/xcohdstaff6/abhvarma/iree/build/mm.o:(void matmul_scalar<bfloat16, bfloat16, 4, 4, 4>(bfloat16*, unsigned int, bfloat16*, unsigned int, bfloat16*, unsigned int))
clang-17: error: ld.lld command failed with exit code 1 (use -v to see invocation)

I'm not sure why it complains within the same mm.o that I used for i32's case too.

I'm attaching the IR e2e log for reference.

Abhishek-Varma · 2024-02-28T12:02:22Z

Updates besides the bf16's case :-

I've updated the revision incorporating all the "generalizability" I was aiming for in the iree-amdaie-lower-to-ukernel pass.
It's ready for re-review : [AMD-AIE] Add a pass to lower to Ukernels #175
I've raised a draft PR on MLIR-AIR : [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463
Here we need to converge on two points :-
a. How do we glue in mm.o (building and all) to verify e2e.
b. Who amongst IREE/MLIR-AIR/MLIR-AIE should take the onus of adding the path - in the draft PR I've listed possible solutions and have gone ahead with one of them - we can discuss about this during sync.

Abhishek-Varma · 2024-02-29T11:06:23Z

Updates :

I worked on updating --iree-amdaie-lower-to-ukernel to handle the linalg.generic we get at the end of simple-pack pipeline.
I also worked on figuring out how to eventually build the e2e test which requires mm.o - since we need link_with - I figured the answer to yesterday's question that I posed - I'm indeed letting IREE take the onus of spilling out where the Ukernel is. And the best way to do this was updating getFnAndDefAttrs of --iree-amdaie-lower-to-ukernel.

I'm attaching the IR e2e log - you'd see that now the link_with construct beautifully gets relayed from iree_codegen.ukernel -> func.call.

I'm happy because this further cleanly integrates with MLIR-AIR changes for which I raised draft PR. :)

We need to discuss how to glue in mm.o (and the corresponding changes) for e2e - I've created the following write up for us to discuss better.

RFC - DISCUSS UKERNEL INTEGRATION.

So, we are using mm.c to put down pieces for the ukernel based lowering via IREE codegen. The idea is to first do the bare minimum required i.e. a non-performant matmul run via the microkernel.

Section A (walk-through):

Following is how it has been tested to work (taking example of matmul_scalar_i32_i32 API) :-

Through iree-amdaie-lower-to-ukernel we will create :

iree_codegen.ukernel "matmul_scalar_i32_i32" (
                      lhs_memref_base_pointer, lhs_memref_offset,
                      rhs_memref_base_pointer, rhs_memref_offset,
                      out_memref_base_pointer, out_memref_offset )

The above op is converted to a func.call @matmul_scalar_i32_i32(.....).
This is then 1:1 linked with the microkernel defined in the object file of mm.c and it works.

Section B (fiddling with mlir-aie):

Now, let's look at the changes required to make this happen :-

The function declaration/definition of matmul_scalar was changed to include *_memref_offset arguments.

template <typename T_in, typename T_out, int M, int K, int N>
void matmul_scalar(T_in *a, unsigned offsetA, T_in *b, unsigned offsetB, T_out *c, unsigned offsetC) {
  event0();
  for (int row = 0; row < M; row++) {
    for (int col = 0; col < N; col++) {
      T_out running_sum = 0;
      for (int i = 0; i < K; i++) {
        running_sum += a[offsetA + row * K + i] * b[offsetB + i * N + col];
      }
      c[offsetC + row * N + col] += running_sum;
    }
  }
  event1();
}

Arrangements to include/support i32 via the microkernel was done (commented out the matmul_variant related APIs)

#define combos(X)                                                              \
  X(int32, i32, int32, i32, 4, 4, 4)                                           \
  X(int16, i16, int16, i16, 4, 4, 4)                                           \
  X(bfloat16, bf16, bfloat16, bf16, 4, 8, 4)                                   \
  X(bfloat16, bf16, float, f32, 4, 8, 4)

The following command was run to generate mm.o (I might've missed it but I couldn't find mm.o being generated while building mlir-aie - therefore had to generate it explicitly after fighting with mlir-aie/build/bin/xchesscc_wrapper):-

./build/bin/xchesscc_wrapper AIE2 -I \
         /proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/include \
         -c reference_designs/ipu-xrt/matrix_multiplication/mm.cc -o mm.o

Section C (point-of-convergence)

We need to converge on how we need would want to use microkernels (starting with the non-performant variant in iree-amd-aie's CI).

We clearly need to deal with the Section B.1 and B.2 - can this be handled in MLIR-AIE or do we have the required version in iree-amd-aie?
With respect to Section B.3 - can we simply take the xchess_wrapper and keep it within iree-amd-aie? And then invoke that with Vitis and the microkernel architecture to generate mm.o (object files) while we build IREE?
With respect to Section B.2 (required for simple-pack pipeline) - we also need to perform a similar modification to matmul_vector - any suggestions?

The following comment I added in my WIP branch would help reinforce why I'm suggesting Section C.2 :-

static std::string getPathToUKernelObjectFile(std::string ukernelObjectFile) {
  // TODO(avarma): The idea is that we build the microkernel object files while
  // building IREE with iree-amd-aie plugin. This way we know where the object
  // files is going to reside and by extension the onus of spilling out the path
  // to the same is going to be on IREE(iree-amd-aie); therefore we would want
  // this path to bear that onus. For now, since it is a WIP/RFC, I'm hardcoding
  // it to the path on my xcorad* machine.
  std::string pathToParentFolder = "/proj/xcohdstaff6/abhvarma/iree/build/";
  return pathToParentFolder + ukernelObjectFile;
}

The above function I've added to the iree-amdaie-lower-to-ukernel pass and that's the "beautiful" aspect of it since now all information required by the lower stack is captured by iree_codegen.ukernel.

Abhishek-Varma · 2024-03-01T10:19:42Z

Updates :-

While trying to integrate yesterday's --iree-amdaie-lower-to-ukernels with mlir-air - I found that one more round of update was required in MLIR-AIR to be able to fetch link_with construct properly.
I worked on implementing that and have pushed my changes to MLIR-AIR along with the test cases : [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463
Was working on experimenting with my RFC suggestion yesterday and have added a WIP which we can use while trying to discuss in sync - Compile microkernel during build.

Abhishek-Varma · 2024-03-04T10:44:53Z

Updates:

Was working on basic integration - figured it's best to add another flag to point to the ukernel path.
Cleaned up the implementation and have raised a PR : [AMD-AIE] Update lower-to-ukernel pass #193
Was trying to build up MLIR-AIR - ran into a few build issues - would ask around.
Tried instead to look into the CI failure log and have addressed it - [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463
Trying to see how to use iree_cc_library instead of add_custom_target/function to generate .o file - the base idea is going to remain the same (for now) w.r.t my proposed design - WIP of .o file's generation via iree-amd-aie.

Abhishek-Varma · 2024-03-06T14:32:20Z

My updates :-

Using @daveliddell 's pointers - I first tried getting the xclbin's generation via aiecc.py - no issues there.
With the same system setup I retried generating xclbin via IREE - it still led to the same issue reported earlier :-

    ld.lld: error: undefined symbol: me_primitive::control_rnd
>>> referenced by mm.o
>>>               /proj/xcohdstaff6/abhvarma/iree/build/mm.o:(void matmul_scalar<bfloat16, bfloat16, 4, 4, 4>(bfloat16*, unsigned int, bfloat16*, unsigned int, bfloat16*, unsigned int))
clang-17: error: ld.lld command failed with exit code 1 (use -v to see invocation)

NOTE: `me_primitive::control_rnd` is a Vitis specific thing.

On testing the updated MLIR-AIR logic for Pad based pipeline, I bumped into an issue of the link_with not getting relayed to aie.core.
I have raised a PR that fixes the same : [AIR][AMD-AIE] Fix link_with relaying for nested func.call ops Xilinx/mlir-air#473

Abhishek-Varma · 2024-03-08T09:53:03Z

My updates for the day are as follows :-

Added required changes and updated mlir-air/mlir-aie submodules - [AMD-AIE] Update mlir-air/mlir-aie submodules #203
Worked on the CMakeLists.txt that I was adding to help compile .o microkernel object file while building MLIR-AIE : [AMD-AIE] Build microkernels for AMD-AIE Xilinx/mlir-aie#1109
Used 1 and 2 to further confirm that the first e2e vanilla microkernel of i32 type for the Pad based pipeline can soon be reproducible by all of us!

CC: @MaheshRavishankar

Abhishek-Varma · 2024-03-12T08:34:03Z

My updates :-

I have addressed the review comment on MLIR-AIE .o ukernel building PR and it is merged now - [AMD-AIE] Build microkernels for AMD-AIE Xilinx/mlir-aie#1109
The issue with bf16 was that the compile command required --iree-amd-aie-enable-chess - it took time to figure this out since I was chasing red-herrings.
This wasn't the case with i32 type since no chess intrinsic were involved here - and it worked well with Peano!
Despite the above, there was the following error :-

WorkersError in "/home/user/mlir-aie/reference_designs/ipu-xrt/matrix_multiplication/build/aie.mlir.prj/segment_0_core_0_2.bcf" 
line 41, column 15: syntax error

I worked on narrowing it down by parallely executing aiecc.py and manually comparing the logs.
Compared the failing command followed by individual content of the files and was able to figure out that the issue is with the parsing of absolute path by Chess. This again wasn't the case with Peano - but since this is the first time the link_with construct is being used with Chess, this issue has popped up.

I further confirmed my speculation by repro-ing it in the case of aiecc.py itself which was working - I have therefore raised an issue in MLIR-AIE since I wasn't aware of the right place to do so. Chess issue with absolute path link_with | _include _file Xilinx/mlir-aie#1120
I then used mm.o instead of the absolute path to mm.o and further confirm that we are able to generate .vmfb for bf16 types! It is getting stuck at runtime though.
Here is the e2e IR for bf16.

CC: @MaheshRavishankar

Abhishek-Varma · 2024-03-21T10:19:47Z

Adding an update for the current state of ukernel support of Pad-Pack :-

We will have a performant Pad-Pack ukernel working once Generalize tile/pack PR and Update Pad-Pack ukernel in MLIR-AIE build gets merged.
@erwei-xilinx pointed out a performant mismatch on the current implementation and rightly pointed out that's because of the non-performant linalg.fill.

We have a performant ukernel variant that addresses the slow zero fill issue.

I worked on updating the --iree-amdaie-lower-to-ukernels pass to lower a zero-filling linalg.fill to a ukernel.

Attaching my WIP branch which would rightly lower the linalg.fill to the ukernel.
The above leads to an issue in MLIR-AIR - I worked on triaging it and it's originating at air-dependency pass (I had to add a fix to this pass previously while working on ukernel and looks like I might have to update it further).

Input that is leading to the issue :-

%base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %alloc_9 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
func.call @zero_bf16(%base_buffer, %c0_7) : (memref<bf16, 2 : i32>, index) -> ()
scf.for %arg19 = %c0_7 to %c256_8 step %c8 {
  %7 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg19]
  %alloc_10 = memref.alloc() : memref<8x16x4x8xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_10[%c0_7] [%c4096] [%c1_6], %arg16[%c0_7, %5, %7] [%c8, %c64, %c8] [%c8, %c2048_5, %c1_6]) : (memref<8x16x4x8xbf16, 2 : i32>, memref<128x2048xbf16, 1 : i32>)
  %alloc_11 = memref.alloc() : memref<16x8x8x4xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_11[%c0_7] [%c4096] [%c1_6], %arg17[%c0_7, %7, %6] [%c16_4, %c64, %c4] [%c4, %c128_3, %c1_6]) : (memref<16x8x8x4xbf16, 2 : i32>, memref<2048x128xbf16, 1 : i32>)
  %base_buffer_12, %offset_13, %sizes_14:4, %strides_15:4 = memref.extract_strided_metadata %alloc_10 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_16, %offset_17, %sizes_18:4, %strides_19:4 = memref.extract_strided_metadata %alloc_11 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  func.call @matmul_bf16_bf16(%base_buffer_12, %c0_7, %base_buffer_16, %c0_7, %base_buffer, %c0_7) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  memref.dealloc %alloc_10 : memref<8x16x4x8xbf16, 2 : i32>
  memref.dealloc %alloc_11 : memref<16x8x8x4xbf16, 2 : i32>
}

Input that generates correct IR :-

linalg.fill ins(%cst : bf16) outs(%alloc_9 : memref<16x16x4x4xbf16, 2 : i32>)
scf.for %arg19 = %c0_7 to %c256_8 step %c8 {
  %7 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg19]
  %alloc_10 = memref.alloc() : memref<8x16x4x8xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_10[%c0_7] [%c4096] [%c1_6], %arg16[%c0_7, %5, %7] [%c8, %c64, %c8] [%c8, %c2048_5, %c1_6]) : (memref<8x16x4x8xbf16, 2 : i32>, memref<128x2048xbf16, 1 : i32>)
  %alloc_11 = memref.alloc() : memref<16x8x8x4xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_11[%c0_7] [%c4096] [%c1_6], %arg17[%c0_7, %7, %6] [%c16_4, %c64, %c4] [%c4, %c128_3, %c1_6]) : (memref<16x8x8x4xbf16, 2 : i32>, memref<2048x128xbf16, 1 : i32>)
  %base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %alloc_10 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_12, %offset_13, %sizes_14:4, %strides_15:4 = memref.extract_strided_metadata %alloc_11 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_16, %offset_17, %sizes_18:4, %strides_19:4 = memref.extract_strided_metadata %alloc_9 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  func.call @matmul_bf16_bf16(%base_buffer, %c0_7, %base_buffer_12, %c0_7, %base_buffer_16, %c0_7) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  memref.dealloc %alloc_10 : memref<8x16x4x8xbf16, 2 : i32>
  memref.dealloc %alloc_11 : memref<16x8x8x4xbf16, 2 : i32>
}

What is happening here is that while figuring out air.execute it doesn't take into account the air.execute that wraps the output buffer and as a result it fails at a very later stage in the pipeline in DmaToChannel pass.
Both the above inputs are VALID - air-dependency is failing to handle the former case.

Pasting below the invalid air.execute :-

Current case (I intentionally adding the beautiful/assembly format of the invalid IR) :-

  %async_token_47 = air.execute [%arg20, %async_token_45, %async_token_43] {
    func.call @matmul_bf16_bf16(%results_42#0, %c0_25, %results_44#0, %c0_25, %results_34#0, %c0_25) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  } {id = 22 : i32}

Expected case :-

  %async_token_47 = air.execute [%arg20, %async_token_45, %async_token_43, %async_token_31] {
    func.call @matmul_bf16_bf16(%results_42#0, %c0_25, %results_44#0, %c0_25, %results_34#0, %c0_25) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  } {id = 22 : i32}

I'll dig further on this and try adding a fix for the same as well.

erwei-xilinx · 2024-03-21T15:57:56Z

@Abhishek-Varma Thanks for looking into this.

MaheshRavishankar assigned Abhishek-Varma Feb 12, 2024

MaheshRavishankar changed the title ~~Using microkernels for single core code. IREE CPU already uses microkernels, use that as the baseline to get this flow working~~ Using microkernels for single core code. Feb 14, 2024

Abhishek-Varma mentioned this issue Mar 12, 2024

me_primitive::control_rnd linking error in aie2xclbin Xilinx/mlir-aie#1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using microkernels for single core code. #153

Using microkernels for single core code. #153

powderluv commented Feb 10, 2024

MaheshRavishankar commented Feb 14, 2024

Abhishek-Varma commented Feb 20, 2024

Abhishek-Varma commented Feb 22, 2024

Abhishek-Varma commented Feb 23, 2024

Abhishek-Varma commented Feb 26, 2024 •

edited

Loading

Abhishek-Varma commented Feb 27, 2024

jtuyls commented Feb 27, 2024

newling commented Feb 28, 2024 •

edited

Loading

newling commented Feb 28, 2024

MaheshRavishankar commented Feb 28, 2024 •

edited

Loading

Abhishek-Varma commented Feb 28, 2024

Abhishek-Varma commented Feb 28, 2024

Abhishek-Varma commented Feb 29, 2024

Abhishek-Varma commented Mar 1, 2024

Abhishek-Varma commented Mar 4, 2024

Abhishek-Varma commented Mar 6, 2024

Abhishek-Varma commented Mar 8, 2024

Abhishek-Varma commented Mar 12, 2024

Abhishek-Varma commented Mar 21, 2024

erwei-xilinx commented Mar 21, 2024

Using microkernels for single core code. #153

Using microkernels for single core code. #153

Comments

powderluv commented Feb 10, 2024

MaheshRavishankar commented Feb 14, 2024

Abhishek-Varma commented Feb 20, 2024

Abhishek-Varma commented Feb 22, 2024

Abhishek-Varma commented Feb 23, 2024

Abhishek-Varma commented Feb 26, 2024 • edited Loading

Abhishek-Varma commented Feb 27, 2024

jtuyls commented Feb 27, 2024

newling commented Feb 28, 2024 • edited Loading

newling commented Feb 28, 2024

MaheshRavishankar commented Feb 28, 2024 • edited Loading

Abhishek-Varma commented Feb 28, 2024

Abhishek-Varma commented Feb 28, 2024

Abhishek-Varma commented Feb 29, 2024

RFC - DISCUSS UKERNEL INTEGRATION.

Section A (walk-through):

Section B (fiddling with mlir-aie):

Section C (point-of-convergence)

Abhishek-Varma commented Mar 1, 2024

Abhishek-Varma commented Mar 4, 2024

Abhishek-Varma commented Mar 6, 2024

Abhishek-Varma commented Mar 8, 2024

Abhishek-Varma commented Mar 12, 2024

Abhishek-Varma commented Mar 21, 2024

erwei-xilinx commented Mar 21, 2024

Abhishek-Varma commented Feb 26, 2024 •

edited

Loading

newling commented Feb 28, 2024 •

edited

Loading

MaheshRavishankar commented Feb 28, 2024 •

edited

Loading