Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using microkernels for single core code. #153

Open
powderluv opened this issue Feb 10, 2024 · 20 comments
Open

Using microkernels for single core code. #153

powderluv opened this issue Feb 10, 2024 · 20 comments
Assignees

Comments

@powderluv
Copy link
Contributor

No description provided.

@MaheshRavishankar
Copy link
Collaborator

IREE CPU already uses microkernels, use that as the baseline to get this flow working.

Here is a gist that explains how this is plumbed through on the CPU side https://gist.github.com/bjacob/2c160c102cee33562826c730945b49f2

@MaheshRavishankar MaheshRavishankar changed the title Using microkernels for single core code. IREE CPU already uses microkernels, use that as the baseline to get this flow working Using microkernels for single core code. Feb 14, 2024
@Abhishek-Varma
Copy link
Contributor

I've raised a PR to tackle the first step : --iree-amdaie-lower-to-ukernels : #175

@Abhishek-Varma
Copy link
Contributor

So, I experimented with simple-pack pipeline (enabling ukernel config lowering by default) and have the passes stacked like :-

    <ALL TILE+FUSE+PACK+TRANSPOSE PASSES>
    
    iree-amdaie-lower-to-ukernels

    <IREEComprehensiveBufferize PASSES>

    iree-lower-ukernel-ops-to-calls

I confirm that we are getting func.call @iree_amdaie_uk_matmul .

It was failing in the air-dependency pass.

I then worked on triaging the same and have added a fix in the mlir-air repository.

With the fix added we are able to get the IR all the way until that stage where basically all of our e2e lit tests get to.

I've added the fix as well as mlir-print-ir-before-all here : Fix + IR log.

@Abhishek-Varma
Copy link
Contributor

Short summary:
I'm able to generate the following construct now :-

aie.device(ipu) {
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
    aie.core(...) {
          func.call @matmul_scalar_bf16_bf16(...)
    } { link_with = "mm.o" }
}

(As expected each core will have a reference of the microkernel they individually are running).

We are successfully able to reach this stage starting from the following input IR :-

func.func @matmul_small(%lhs : tensor<8x16xbf16>,
    %rhs : tensor<16x32xbf16>) -> tensor<8x32xbf16> {
  %empty = tensor.empty() : tensor<8x32xbf16>
  %cst = arith.constant 0.0 : bf16
  %fill = linalg.fill ins(%cst : bf16) outs(%empty : tensor<8x32xbf16>) -> tensor<8x32xbf16>
  %2 = linalg.matmul ins(%lhs, %rhs : tensor<8x16xbf16>, tensor<16x32xbf16>)
      outs(%fill : tensor<8x32xbf16>) -> tensor<8x32xbf16>
  return %2 : tensor<8x32xbf16>
}

Long summary:

  1. I went through some basics of the AIR/AIE dialect to understand better how to position the func.call thingy.

  2. I figured the first instance to fiddle was air-to-aie - was able to make some progress with that by changing the aie::CoreOp creation itself.

  3. Tried backtracking and found the actual source where a fix would get us through the above construct. It was in air-par-to-herd pass - updated the creation of air::HerdOp itself.

  4. I've added all my local fix to MLIR-AIR along with the current IR log in the same gist as yesterday's: Fix + IR log.

@Abhishek-Varma
Copy link
Contributor

Abhishek-Varma commented Feb 26, 2024

Updates:

  1. Added support for linalg.matmul along with the already added support for linalg.generic.
    Here is the WIP branch.

  2. After installing mlir-aie - I had difficulty getting hold of mm.o.
    Was able to get through the issue with a workaround - root cause xchesscc_wrapper doesn't seem to be using the aietools path passed and xchesscc is not found in the xcorad* server.
    I've added the workaround here as a_workaround_for_generating_mm.o.sh.

  3. Updated the function signature of the matmul_scalar_* - both at mm.c as well as the IREE ukernel op I'm forming.
    I've added the updated mm.c file here mm.c.
    Was able to generate mm.o file after updating too (with the fix in point 2 above).

  4. Tried the e2e invocation but it fails at the XCLBin generation.
    Have added the e2e log here as ukernel_e2e_runtime.mlir.
    I even tried setting the absolute path to mm.o at the IR's link_with construct but to no avail.

Following is the command which MLIR-AIE seems to be using for testing microkernels via RYZEN AI CHESS COMPILER:-

1. RUN: xchesscc_wrapper aie2 -I aietools/include -c mm.cc -o ./mm.o
2. RUN: python aie2.py > ./aie.mlir
3. RUN: python aiecc.py --xbridge --aie-generate-cdo --aie-generate-ipu \
                                       --no-compile-host --xclbin-name=aie.xclbin \
                                       --ipu-insts-name=insts.txt ./aie.mlir
4. RUN: g++-13 test.cpp -o test.exe -std=c++23 -Wall \
                            xrt_flags -lrt -lstdc++ -lboost_program_options -lboost_filesystem
5. RUN: run_on_ipu ./test.exe -x aie.xclbin -k MLIR_AIE -i insts.txt

NOTE: The links attached is the same gist link (with updated content) which I've previously used to provide an update to this tracker.

@Abhishek-Varma
Copy link
Contributor

Updates :

  1. The reason for the failure was error: Dialect 'hal' not found for custom op.
    On triaging the issue I found the main reason why this is happening. Take a look at the following :-
func.func @matmul_small_dispatch_0_matmul_8x32x16_bf16(%arg0: memref<64xi32>, %arg1: memref<256xi32>, %arg2: memref<128xi32>) {
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xbf16>
  memref.assume_alignment %0, 64 : memref<8x16xbf16>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x32xbf16>
  memref.assume_alignment %1, 64 : memref<16x32xbf16>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x32xbf16>
  memref.assume_alignment %2, 64 : memref<8x32xbf16>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 1, 8, 8][0, 0, 8]) {id = 0 : i64, metadata = @airMemcpyId4} : memref<64xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 4, 16, 4][0, 4, 16]) {id = 1 : i64, metadata = @airMemcpyId5} : memref<256xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][1, 4, 8, 4][0, 4, 16]) {id = 2 : i64, metadata = @airMemcpyId20} : memref<128xi32>
  aiex.ipu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
  return
}

The hal constructs above weren't getting canonicalized away. Ideally, the following IR is what we should've gotten :-

func.func @matmul_small_dispatch_0_matmul_8x32x16_bf16(%arg0: memref<64xbf16>, %arg1: memref<256xbf16>, %arg2: memref<128xbf16>) {
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<8x16xbf16>
  memref.assume_alignment %arg0, 64 : memref<8x16xbf16>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<16x32xbf16>
  memref.assume_alignment %arg1, 64 : memref<16x32xbf16>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<8x32xbf16>
  memref.assume_alignment %arg2, 64 : memref<8x32xbf16>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][4, 1, 8, 8][0, 0, 8]) {id = 0 : i64, metadata = @airMemcpyId4} : memref<64xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 4, 16, 4][0, 4, 16]) {id = 1 : i64, metadata = @airMemcpyId5} : memref<256xi32>
  aiex.ipu.dma_memcpy_nd(0, 0, %arg2[0, 0, 0, 0][1, 4, 8, 4][0, 4, 16]) {id = 2 : i64, metadata = @airMemcpyId20} : memref<128xi32>
  aiex.ipu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
  return
}

Note the changes I made above :-
a. i32 -> bf16 for function arguments.
b. memref.assume_alignment %arg*, 64 : memref<8x32xbf16> instead of memref.assume_alignment %2, 64 : memref<8x32xbf16>.

You may take a look at the e2e log for this stage.

  1. On further triaging AIRRtToIpuPass seemed to be the culprit. And therefore I tried commenting out this as well as this. It lead to the following error : error: 'aiex.ipu.dma_memcpy_nd' op must be used with memref type with element width 32.

You may take a look at the e2e log for this stage.

  1. I switched over to instead using i32 as the element type.
    This entailed modifying the --iree-amdaie-lower-to-ukernel as well as changing mm.c and regenerating mm.o.

    IT WORKED!

    You may take a look at the IR log for the e2e matmul scalar i32.

Note: As for the next step - I'm going to chase preparing PRs to ensure the matmul_scalar_i32 is checked in as the first ukernel based approach. For this I also need to generalize certain hardcoded parameters + need to see how to get mm.o to build. The above works for Pad based pipeline. After it gets checked in - I'll target the pack based approach for matmul i32.

@jtuyls
Copy link
Contributor

jtuyls commented Feb 27, 2024

Hi @Abhishek-Varma, my understanding is that aiex.ipu.dma_memcpy_nd always has to operate on 32 bit data types right now (as AIE DMAs have 4 byte granularity), so the bf16 data type is interpreted as i32. I think we have following options to solve this:

  1. Update the hal ops to work on i32 as well
  2. Update mlir-air/aie to accept other data types as well, but under certain constraints, like strides and sizes need to be multiples of 4 byte words.

Option 1 is probably the easiest to get things working as air/aie seems to have taken the 'reinterpret data type' approach to ensure multiples of 4 byte words in DMA operations. But not sure what the plans are there longer term? @erwei-xilinx Any thoughts?

cc @MaheshRavishankar

@newling
Copy link
Contributor

newling commented Feb 28, 2024

memref.assume_alignment %2, 64 : memref<8x32xbf16>

Oh no you went down the same rabbit hole as I did 😄 The pass I added to fix it was Xilinx/mlir-air@e8a32b0 (it should be live in the latest iree-amd-aie now).

@MaheshRavishankar also mentioned a pass in IREE for this, could be https://github.com/openxla/iree/blob/09deadfb8a58d17a4cf136ce916a661836eff2cf/compiler/src/iree/compiler/Codegen/Transforms/Transforms.cpp#L595

@newling
Copy link
Contributor

newling commented Feb 28, 2024

Hi @Abhishek-Varma, my understanding is that aiex.ipu.dma_memcpy_nd always has to operate on 32 bit data types right now (as AIE DMAs have 4 byte granularity), so the bf16 data type is interpreted as i32. I think we have following options to solve this:

  1. Update the hal ops to work on i32 as well
  2. Update mlir-air/aie to accept other data types as well, but under certain constraints, like strides and sizes need to be multiples of 4 byte words.

Option 1 is probably the easiest to get things working as air/aie seems to have taken the 'reinterpret data type' approach to ensure multiples of 4 byte words in DMA operations. But not sure what the plans are there longer term? @erwei-xilinx Any thoughts?

cc @MaheshRavishankar

My understanding is that option 2 is already done

@MaheshRavishankar
Copy link
Collaborator

MaheshRavishankar commented Feb 28, 2024

https://github.com/openxla/iree/blob/main/compiler%2Fsrc%2Firee%2Fcompiler%2FCodegen%2FCommon%2FCleanupBufferAllocViewPass.cpp is the lass in IREE that removes hal.interface.binding that are dead (i.e. uses are only in alignment operations)

I haven't looked at this PR itself to get all the details

@Abhishek-Varma
Copy link
Contributor

So reinitializing the mlir-air submodules in iree-amd-aie helped.

The error reported earlier is not there anymore. @newling @MaheshRavishankar

For bf16 I'm now getting :-

ld.lld: error: undefined symbol: me_primitive::control_rnd
>>> referenced by mm.o
>>>               /proj/xcohdstaff6/abhvarma/iree/build/mm.o:(void matmul_scalar<bfloat16, bfloat16, 4, 4, 4>(bfloat16*, unsigned int, bfloat16*, unsigned int, bfloat16*, unsigned int))
clang-17: error: ld.lld command failed with exit code 1 (use -v to see invocation)

I'm not sure why it complains within the same mm.o that I used for i32's case too.

I'm attaching the IR e2e log for reference.

@Abhishek-Varma
Copy link
Contributor

Updates besides the bf16's case :-

  1. I've updated the revision incorporating all the "generalizability" I was aiming for in the iree-amdaie-lower-to-ukernel pass.
    It's ready for re-review : [AMD-AIE] Add a pass to lower to Ukernels #175

  2. I've raised a draft PR on MLIR-AIR : [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463
    Here we need to converge on two points :-
    a. How do we glue in mm.o (building and all) to verify e2e.
    b. Who amongst IREE/MLIR-AIR/MLIR-AIE should take the onus of adding the path - in the draft PR I've listed possible solutions and have gone ahead with one of them - we can discuss about this during sync.

@Abhishek-Varma
Copy link
Contributor

Updates :

  1. I worked on updating --iree-amdaie-lower-to-ukernel to handle the linalg.generic we get at the end of simple-pack pipeline.
  2. I also worked on figuring out how to eventually build the e2e test which requires mm.o - since we need link_with - I figured the answer to yesterday's question that I posed - I'm indeed letting IREE take the onus of spilling out where the Ukernel is. And the best way to do this was updating getFnAndDefAttrs of --iree-amdaie-lower-to-ukernel.

I'm attaching the IR e2e log - you'd see that now the link_with construct beautifully gets relayed from iree_codegen.ukernel -> func.call.

I'm happy because this further cleanly integrates with MLIR-AIR changes for which I raised draft PR. :)

We need to discuss how to glue in mm.o (and the corresponding changes) for e2e - I've created the following write up for us to discuss better.

RFC - DISCUSS UKERNEL INTEGRATION.

So, we are using mm.c to put down pieces for the ukernel based lowering via IREE codegen. The idea is to first do the bare minimum required i.e. a non-performant matmul run via the microkernel.

Section A (walk-through):

Following is how it has been tested to work (taking example of matmul_scalar_i32_i32 API) :-

  1. Through iree-amdaie-lower-to-ukernel we will create :
iree_codegen.ukernel "matmul_scalar_i32_i32" (
                      lhs_memref_base_pointer, lhs_memref_offset,
                      rhs_memref_base_pointer, rhs_memref_offset,
                      out_memref_base_pointer, out_memref_offset )
  1. The above op is converted to a func.call @matmul_scalar_i32_i32(.....).
  2. This is then 1:1 linked with the microkernel defined in the object file of mm.c and it works.

Section B (fiddling with mlir-aie):

Now, let's look at the changes required to make this happen :-

  1. The function declaration/definition of matmul_scalar was changed to include *_memref_offset arguments.
template <typename T_in, typename T_out, int M, int K, int N>
void matmul_scalar(T_in *a, unsigned offsetA, T_in *b, unsigned offsetB, T_out *c, unsigned offsetC) {
  event0();
  for (int row = 0; row < M; row++) {
    for (int col = 0; col < N; col++) {
      T_out running_sum = 0;
      for (int i = 0; i < K; i++) {
        running_sum += a[offsetA + row * K + i] * b[offsetB + i * N + col];
      }
      c[offsetC + row * N + col] += running_sum;
    }
  }
  event1();
}
  1. Arrangements to include/support i32 via the microkernel was done (commented out the matmul_variant related APIs)
#define combos(X)                                                              \
  X(int32, i32, int32, i32, 4, 4, 4)                                           \
  X(int16, i16, int16, i16, 4, 4, 4)                                           \
  X(bfloat16, bf16, bfloat16, bf16, 4, 8, 4)                                   \
  X(bfloat16, bf16, float, f32, 4, 8, 4)
  1. The following command was run to generate mm.o (I might've missed it but I couldn't find mm.o being generated while building mlir-aie - therefore had to generate it explicitly after fighting with mlir-aie/build/bin/xchesscc_wrapper):-
./build/bin/xchesscc_wrapper AIE2 -I \
         /proj/xbuilds/SWIP/2023.2_1013_2256/installs/lin64/Vitis/2023.2/aietools/include \
         -c reference_designs/ipu-xrt/matrix_multiplication/mm.cc -o mm.o

Section C (point-of-convergence)

We need to converge on how we need would want to use microkernels (starting with the non-performant variant in iree-amd-aie's CI).

  1. We clearly need to deal with the Section B.1 and B.2 - can this be handled in MLIR-AIE or do we have the required version in iree-amd-aie?
  2. With respect to Section B.3 - can we simply take the xchess_wrapper and keep it within iree-amd-aie? And then invoke that with Vitis and the microkernel architecture to generate mm.o (object files) while we build IREE?
  3. With respect to Section B.2 (required for simple-pack pipeline) - we also need to perform a similar modification to matmul_vector - any suggestions?

The following comment I added in my WIP branch would help reinforce why I'm suggesting Section C.2 :-

static std::string getPathToUKernelObjectFile(std::string ukernelObjectFile) {
  // TODO(avarma): The idea is that we build the microkernel object files while
  // building IREE with iree-amd-aie plugin. This way we know where the object
  // files is going to reside and by extension the onus of spilling out the path
  // to the same is going to be on IREE(iree-amd-aie); therefore we would want
  // this path to bear that onus. For now, since it is a WIP/RFC, I'm hardcoding
  // it to the path on my xcorad* machine.
  std::string pathToParentFolder = "/proj/xcohdstaff6/abhvarma/iree/build/";
  return pathToParentFolder + ukernelObjectFile;
}

The above function I've added to the iree-amdaie-lower-to-ukernel pass and that's the "beautiful" aspect of it since now all information required by the lower stack is captured by iree_codegen.ukernel.

@Abhishek-Varma
Copy link
Contributor

Updates :-

  1. While trying to integrate yesterday's --iree-amdaie-lower-to-ukernels with mlir-air - I found that one more round of update was required in MLIR-AIR to be able to fetch link_with construct properly.
    I worked on implementing that and have pushed my changes to MLIR-AIR along with the test cases : [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463

  2. Was working on experimenting with my RFC suggestion yesterday and have added a WIP which we can use while trying to discuss in sync - Compile microkernel during build.

@Abhishek-Varma
Copy link
Contributor

Updates:

  1. Was working on basic integration - figured it's best to add another flag to point to the ukernel path.
  2. Cleaned up the implementation and have raised a PR : [AMD-AIE] Update lower-to-ukernel pass #193
  3. Was trying to build up MLIR-AIR - ran into a few build issues - would ask around.
  4. Tried instead to look into the CI failure log and have addressed it - [AIR][AMD-AIE] Add changes to enable IREE microkernel based lowering Xilinx/mlir-air#463
  5. Trying to see how to use iree_cc_library instead of add_custom_target/function to generate .o file - the base idea is going to remain the same (for now) w.r.t my proposed design - WIP of .o file's generation via iree-amd-aie.

@Abhishek-Varma
Copy link
Contributor

My updates :-

  1. Using @daveliddell 's pointers - I first tried getting the xclbin's generation via aiecc.py - no issues there.
    With the same system setup I retried generating xclbin via IREE - it still led to the same issue reported earlier :-
    ld.lld: error: undefined symbol: me_primitive::control_rnd
>>> referenced by mm.o
>>>               /proj/xcohdstaff6/abhvarma/iree/build/mm.o:(void matmul_scalar<bfloat16, bfloat16, 4, 4, 4>(bfloat16*, unsigned int, bfloat16*, unsigned int, bfloat16*, unsigned int))
clang-17: error: ld.lld command failed with exit code 1 (use -v to see invocation)
NOTE: `me_primitive::control_rnd` is a Vitis specific thing.
  1. On testing the updated MLIR-AIR logic for Pad based pipeline, I bumped into an issue of the link_with not getting relayed to aie.core.
    I have raised a PR that fixes the same : [AIR][AMD-AIE] Fix link_with relaying for nested func.call ops Xilinx/mlir-air#473

@Abhishek-Varma
Copy link
Contributor

My updates for the day are as follows :-

  1. Added required changes and updated mlir-air/mlir-aie submodules - [AMD-AIE] Update mlir-air/mlir-aie submodules #203
  2. Worked on the CMakeLists.txt that I was adding to help compile .o microkernel object file while building MLIR-AIE : [AMD-AIE] Build microkernels for AMD-AIE Xilinx/mlir-aie#1109
  3. Used 1 and 2 to further confirm that the first e2e vanilla microkernel of i32 type for the Pad based pipeline can soon be reproducible by all of us!

CC: @MaheshRavishankar

@Abhishek-Varma
Copy link
Contributor

My updates :-

  1. I have addressed the review comment on MLIR-AIE .o ukernel building PR and it is merged now - [AMD-AIE] Build microkernels for AMD-AIE Xilinx/mlir-aie#1109

  2. The issue with bf16 was that the compile command required --iree-amd-aie-enable-chess - it took time to figure this out since I was chasing red-herrings.
    This wasn't the case with i32 type since no chess intrinsic were involved here - and it worked well with Peano!

  3. Despite the above, there was the following error :-

WorkersError in "/home/user/mlir-aie/reference_designs/ipu-xrt/matrix_multiplication/build/aie.mlir.prj/segment_0_core_0_2.bcf" 
line 41, column 15: syntax error

I worked on narrowing it down by parallely executing aiecc.py and manually comparing the logs.
Compared the failing command followed by individual content of the files and was able to figure out that the issue is with the parsing of absolute path by Chess. This again wasn't the case with Peano - but since this is the first time the link_with construct is being used with Chess, this issue has popped up.

  1. I further confirmed my speculation by repro-ing it in the case of aiecc.py itself which was working - I have therefore raised an issue in MLIR-AIE since I wasn't aware of the right place to do so. Chess issue with absolute path link_with | _include _file Xilinx/mlir-aie#1120

  2. I then used mm.o instead of the absolute path to mm.o and further confirm that we are able to generate .vmfb for bf16 types! It is getting stuck at runtime though.
    Here is the e2e IR for bf16.

CC: @MaheshRavishankar

@Abhishek-Varma
Copy link
Contributor

Adding an update for the current state of ukernel support of Pad-Pack :-

  1. We will have a performant Pad-Pack ukernel working once Generalize tile/pack PR and Update Pad-Pack ukernel in MLIR-AIE build gets merged.

  2. @erwei-xilinx pointed out a performant mismatch on the current implementation and rightly pointed out that's because of the non-performant linalg.fill.

    We have a performant ukernel variant that addresses the slow zero fill issue.

    I worked on updating the --iree-amdaie-lower-to-ukernels pass to lower a zero-filling linalg.fill to a ukernel.

    Attaching my WIP branch which would rightly lower the linalg.fill to the ukernel.

  3. The above leads to an issue in MLIR-AIR - I worked on triaging it and it's originating at air-dependency pass (I had to add a fix to this pass previously while working on ukernel and looks like I might have to update it further).

Input that is leading to the issue :-

%base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %alloc_9 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
func.call @zero_bf16(%base_buffer, %c0_7) : (memref<bf16, 2 : i32>, index) -> ()
scf.for %arg19 = %c0_7 to %c256_8 step %c8 {
  %7 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg19]
  %alloc_10 = memref.alloc() : memref<8x16x4x8xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_10[%c0_7] [%c4096] [%c1_6], %arg16[%c0_7, %5, %7] [%c8, %c64, %c8] [%c8, %c2048_5, %c1_6]) : (memref<8x16x4x8xbf16, 2 : i32>, memref<128x2048xbf16, 1 : i32>)
  %alloc_11 = memref.alloc() : memref<16x8x8x4xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_11[%c0_7] [%c4096] [%c1_6], %arg17[%c0_7, %7, %6] [%c16_4, %c64, %c4] [%c4, %c128_3, %c1_6]) : (memref<16x8x8x4xbf16, 2 : i32>, memref<2048x128xbf16, 1 : i32>)
  %base_buffer_12, %offset_13, %sizes_14:4, %strides_15:4 = memref.extract_strided_metadata %alloc_10 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_16, %offset_17, %sizes_18:4, %strides_19:4 = memref.extract_strided_metadata %alloc_11 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  func.call @matmul_bf16_bf16(%base_buffer_12, %c0_7, %base_buffer_16, %c0_7, %base_buffer, %c0_7) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  memref.dealloc %alloc_10 : memref<8x16x4x8xbf16, 2 : i32>
  memref.dealloc %alloc_11 : memref<16x8x8x4xbf16, 2 : i32>
}

Input that generates correct IR :-

linalg.fill ins(%cst : bf16) outs(%alloc_9 : memref<16x16x4x4xbf16, 2 : i32>)
scf.for %arg19 = %c0_7 to %c256_8 step %c8 {
  %7 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg19]
  %alloc_10 = memref.alloc() : memref<8x16x4x8xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_10[%c0_7] [%c4096] [%c1_6], %arg16[%c0_7, %5, %7] [%c8, %c64, %c8] [%c8, %c2048_5, %c1_6]) : (memref<8x16x4x8xbf16, 2 : i32>, memref<128x2048xbf16, 1 : i32>)
  %alloc_11 = memref.alloc() : memref<16x8x8x4xbf16, 2 : i32>
  air.dma_memcpy_nd (%alloc_11[%c0_7] [%c4096] [%c1_6], %arg17[%c0_7, %7, %6] [%c16_4, %c64, %c4] [%c4, %c128_3, %c1_6]) : (memref<16x8x8x4xbf16, 2 : i32>, memref<2048x128xbf16, 1 : i32>)
  %base_buffer, %offset, %sizes:4, %strides:4 = memref.extract_strided_metadata %alloc_10 : memref<8x16x4x8xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_12, %offset_13, %sizes_14:4, %strides_15:4 = memref.extract_strided_metadata %alloc_11 : memref<16x8x8x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  %base_buffer_16, %offset_17, %sizes_18:4, %strides_19:4 = memref.extract_strided_metadata %alloc_9 : memref<16x16x4x4xbf16, 2 : i32> -> memref<bf16, 2 : i32>, index, index, index, index, index, index, index, index, index
  func.call @matmul_bf16_bf16(%base_buffer, %c0_7, %base_buffer_12, %c0_7, %base_buffer_16, %c0_7) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  memref.dealloc %alloc_10 : memref<8x16x4x8xbf16, 2 : i32>
  memref.dealloc %alloc_11 : memref<16x8x8x4xbf16, 2 : i32>
}

What is happening here is that while figuring out air.execute it doesn't take into account the air.execute that wraps the output buffer and as a result it fails at a very later stage in the pipeline in DmaToChannel pass.
Both the above inputs are VALID - air-dependency is failing to handle the former case.

Pasting below the invalid air.execute :-

Current case (I intentionally adding the beautiful/assembly format of the invalid IR) :-

  %async_token_47 = air.execute [%arg20, %async_token_45, %async_token_43] {
    func.call @matmul_bf16_bf16(%results_42#0, %c0_25, %results_44#0, %c0_25, %results_34#0, %c0_25) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  } {id = 22 : i32}

Expected case :-

  %async_token_47 = air.execute [%arg20, %async_token_45, %async_token_43, %async_token_31] {
    func.call @matmul_bf16_bf16(%results_42#0, %c0_25, %results_44#0, %c0_25, %results_34#0, %c0_25) : (memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index, memref<bf16, 2 : i32>, index) -> ()
  } {id = 22 : i32}

I'll dig further on this and try adding a fix for the same as well.

@erwei-xilinx
Copy link
Contributor

@Abhishek-Varma Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

6 participants