-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Fix an issue that wrong index is used in calculation of byte provider when the op is extract_vector_elt #91697
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Shilei Tian (shiltian) ChangesFull diff: https://github.com/llvm/llvm-project/pull/91697.diff 4 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 33bdd6195a040..4d3662f672687 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -12071,12 +12071,8 @@ calculateByteProvider(const SDValue &Op, unsigned Index, unsigned Depth,
return std::nullopt;
auto VecIdx = IdxOp->getZExtValue();
auto ScalarSize = Op.getScalarValueSizeInBits();
- if (ScalarSize != 32) {
- Index = ScalarSize == 8 ? VecIdx : VecIdx * 2 + Index;
- }
-
- return calculateSrcByte(ScalarSize == 32 ? Op : Op.getOperand(0),
- StartingIndex, Index);
+ return calculateSrcByte(Op.getOperand(0), StartingIndex,
+ ScalarSize / 8 * VecIdx + Index);
}
case AMDGPUISD::PERM: {
diff --git a/llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll b/llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll
index 1ba2491d2210e..5fd9b1d855095 100644
--- a/llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll
+++ b/llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll
@@ -2706,25 +2706,33 @@ define amdgpu_kernel void @v_insertelement_v16i16_6(ptr addrspace(1) %out, ptr a
; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
; VI-NEXT: s_load_dword s4, s[4:5], 0x10
; VI-NEXT: v_lshlrev_b32_e32 v8, 5, v0
-; VI-NEXT: v_mov_b32_e32 v12, 0x3020504
+; VI-NEXT: v_mov_b32_e32 v10, 0x3020504
; VI-NEXT: s_waitcnt lgkmcnt(0)
-; VI-NEXT: v_mov_b32_e32 v1, s3
-; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v8
-; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
-; VI-NEXT: v_add_u32_e32 v4, vcc, 16, v0
-; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc
-; VI-NEXT: flat_load_dwordx4 v[0:3], v[0:1]
+; VI-NEXT: v_mov_b32_e32 v0, s3
+; VI-NEXT: v_add_u32_e32 v4, vcc, s2, v8
+; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v0, vcc
+; VI-NEXT: flat_load_dwordx4 v[0:3], v[4:5]
+; VI-NEXT: v_add_u32_e32 v4, vcc, 16, v4
+; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v5, vcc
; VI-NEXT: flat_load_dwordx4 v[4:7], v[4:5]
+; VI-NEXT: s_mov_b32 s2, 0x3020504
; VI-NEXT: v_mov_b32_e32 v9, s1
; VI-NEXT: v_add_u32_e32 v8, vcc, s0, v8
; VI-NEXT: v_addc_u32_e32 v9, vcc, 0, v9, vcc
-; VI-NEXT: v_add_u32_e32 v10, vcc, 16, v8
-; VI-NEXT: v_addc_u32_e32 v11, vcc, 0, v9, vcc
; VI-NEXT: s_waitcnt vmcnt(1)
-; VI-NEXT: v_perm_b32 v3, s4, v3, v12
-; VI-NEXT: s_waitcnt vmcnt(0)
-; VI-NEXT: flat_store_dwordx4 v[10:11], v[4:7]
+; VI-NEXT: v_perm_b32 v3, s4, v3, v10
+; VI-NEXT: v_perm_b32 v2, v2, v2, s2
+; VI-NEXT: v_perm_b32 v1, v1, v1, s2
+; VI-NEXT: v_perm_b32 v0, v0, v0, s2
; VI-NEXT: flat_store_dwordx4 v[8:9], v[0:3]
+; VI-NEXT: s_waitcnt vmcnt(1)
+; VI-NEXT: v_perm_b32 v7, v7, v7, s2
+; VI-NEXT: v_add_u32_e32 v0, vcc, 16, v8
+; VI-NEXT: v_perm_b32 v6, v6, v6, s2
+; VI-NEXT: v_perm_b32 v5, v5, v5, s2
+; VI-NEXT: v_perm_b32 v4, v4, v4, s2
+; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v9, vcc
+; VI-NEXT: flat_store_dwordx4 v[0:1], v[4:7]
; VI-NEXT: s_endpgm
;
; CI-LABEL: v_insertelement_v16i16_6:
diff --git a/llvm/test/CodeGen/AMDGPU/match-perm-extract-vector-elt-bug.ll b/llvm/test/CodeGen/AMDGPU/match-perm-extract-vector-elt-bug.ll
new file mode 100644
index 0000000000000..486ddba799dfe
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/match-perm-extract-vector-elt-bug.ll
@@ -0,0 +1,126 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -verify-machineinstrs -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -verify-machineinstrs -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -verify-machineinstrs -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX11 %s
+
+define protected amdgpu_kernel void @test(ptr addrspace(1) %srcA, ptr addrspace(1) %srcB, ptr addrspace(1) %dst) {
+; GFX9-LABEL: test:
+; GFX9: ; %bb.0: ; %entry
+; GFX9-NEXT: s_load_dword s7, s[4:5], 0x24
+; GFX9-NEXT: s_load_dword s8, s[4:5], 0x40
+; GFX9-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX9-NEXT: s_load_dwordx2 s[2:3], s[4:5], 0x10
+; GFX9-NEXT: s_waitcnt lgkmcnt(0)
+; GFX9-NEXT: s_and_b32 s4, s7, 0xffff
+; GFX9-NEXT: s_mul_i32 s6, s6, s4
+; GFX9-NEXT: s_add_i32 s8, s8, s6
+; GFX9-NEXT: v_add_u32_e32 v0, s8, v0
+; GFX9-NEXT: v_ashrrev_i32_e32 v1, 31, v0
+; GFX9-NEXT: v_lshlrev_b64 v[4:5], 4, v[0:1]
+; GFX9-NEXT: v_mov_b32_e32 v1, s1
+; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v4
+; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v5, vcc
+; GFX9-NEXT: global_load_dwordx4 v[0:3], v[0:1], off
+; GFX9-NEXT: s_mov_b32 s0, 0x3020504
+; GFX9-NEXT: v_mov_b32_e32 v6, s3
+; GFX9-NEXT: v_add_co_u32_e32 v4, vcc, s2, v4
+; GFX9-NEXT: v_addc_co_u32_e32 v5, vcc, v6, v5, vcc
+; GFX9-NEXT: s_waitcnt vmcnt(0)
+; GFX9-NEXT: v_perm_b32 v2, v2, v2, s0
+; GFX9-NEXT: v_perm_b32 v0, v0, v0, s0
+; GFX9-NEXT: v_not_b32_e32 v3, v3
+; GFX9-NEXT: v_not_b32_e32 v1, v1
+; GFX9-NEXT: v_not_b32_e32 v2, v2
+; GFX9-NEXT: v_not_b32_e32 v0, v0
+; GFX9-NEXT: global_store_dwordx4 v[4:5], v[0:3], off
+; GFX9-NEXT: s_endpgm
+;
+; GFX10-LABEL: test:
+; GFX10: ; %bb.0: ; %entry
+; GFX10-NEXT: s_clause 0x1
+; GFX10-NEXT: s_load_dword s0, s[4:5], 0x24
+; GFX10-NEXT: s_load_dword s2, s[4:5], 0x40
+; GFX10-NEXT: s_waitcnt lgkmcnt(0)
+; GFX10-NEXT: s_and_b32 s3, s0, 0xffff
+; GFX10-NEXT: s_load_dwordx2 s[0:1], s[4:5], 0x0
+; GFX10-NEXT: s_mul_i32 s6, s6, s3
+; GFX10-NEXT: v_add3_u32 v0, s2, s6, v0
+; GFX10-NEXT: s_load_dwordx2 s[2:3], s[4:5], 0x10
+; GFX10-NEXT: v_ashrrev_i32_e32 v1, 31, v0
+; GFX10-NEXT: v_lshlrev_b64 v[4:5], 4, v[0:1]
+; GFX10-NEXT: s_waitcnt lgkmcnt(0)
+; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, s0, v4
+; GFX10-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s1, v5, vcc_lo
+; GFX10-NEXT: v_add_co_u32 v4, vcc_lo, s2, v4
+; GFX10-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s3, v5, vcc_lo
+; GFX10-NEXT: global_load_dwordx4 v[0:3], v[0:1], off
+; GFX10-NEXT: s_waitcnt vmcnt(0)
+; GFX10-NEXT: v_perm_b32 v2, v2, v2, 0x3020504
+; GFX10-NEXT: v_perm_b32 v0, v0, v0, 0x3020504
+; GFX10-NEXT: v_not_b32_e32 v3, v3
+; GFX10-NEXT: v_not_b32_e32 v1, v1
+; GFX10-NEXT: v_not_b32_e32 v2, v2
+; GFX10-NEXT: v_not_b32_e32 v0, v0
+; GFX10-NEXT: global_store_dwordx4 v[4:5], v[0:3], off
+; GFX10-NEXT: s_endpgm
+;
+; GFX11-LABEL: test:
+; GFX11: ; %bb.0: ; %entry
+; GFX11-NEXT: s_clause 0x1
+; GFX11-NEXT: s_load_b32 s2, s[0:1], 0x24
+; GFX11-NEXT: s_load_b32 s4, s[0:1], 0x40
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_and_b32 s5, s2, 0xffff
+; GFX11-NEXT: s_load_b64 s[2:3], s[0:1], 0x0
+; GFX11-NEXT: s_mul_i32 s15, s15, s5
+; GFX11-NEXT: s_load_b64 s[0:1], s[0:1], 0x10
+; GFX11-NEXT: v_add3_u32 v0, s4, s15, v0
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX11-NEXT: v_ashrrev_i32_e32 v1, 31, v0
+; GFX11-NEXT: v_lshlrev_b64 v[4:5], 4, v[0:1]
+; GFX11-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX11-NEXT: v_add_co_u32 v0, vcc_lo, s2, v4
+; GFX11-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s3, v5, vcc_lo
+; GFX11-NEXT: v_add_co_u32 v4, vcc_lo, s0, v4
+; GFX11-NEXT: v_add_co_ci_u32_e32 v5, vcc_lo, s1, v5, vcc_lo
+; GFX11-NEXT: global_load_b128 v[0:3], v[0:1], off
+; GFX11-NEXT: s_waitcnt vmcnt(0)
+; GFX11-NEXT: v_perm_b32 v2, v2, v2, 0x3020504
+; GFX11-NEXT: v_perm_b32 v0, v0, v0, 0x3020504
+; GFX11-NEXT: v_not_b32_e32 v3, v3
+; GFX11-NEXT: v_not_b32_e32 v1, v1
+; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_4)
+; GFX11-NEXT: v_not_b32_e32 v2, v2
+; GFX11-NEXT: v_not_b32_e32 v0, v0
+; GFX11-NEXT: global_store_b128 v[4:5], v[0:3], off
+; GFX11-NEXT: s_nop 0
+; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
+; GFX11-NEXT: s_endpgm
+entry:
+ %test.kernarg.segment = call nonnull align 16 dereferenceable(280) ptr addrspace(4) @llvm.amdgcn.kernarg.segment.ptr()
+ %srcA.kernarg.offset4 = bitcast ptr addrspace(4) %test.kernarg.segment to ptr addrspace(4)
+ %srcA.load = load ptr addrspace(1), ptr addrspace(4) %srcA.kernarg.offset4, align 16
+ %dst.kernarg.offset = getelementptr inbounds i8, ptr addrspace(4) %test.kernarg.segment, i64 16
+ %dst.load = load ptr addrspace(1), ptr addrspace(4) %dst.kernarg.offset, align 16
+ %0 = tail call ptr addrspace(4) @llvm.amdgcn.implicitarg.ptr()
+ %1 = getelementptr inbounds i8, ptr addrspace(4) %0, i64 40
+ %2 = load i64, ptr addrspace(4) %1, align 8
+ %3 = tail call i32 @llvm.amdgcn.workgroup.id.x()
+ %4 = getelementptr inbounds i8, ptr addrspace(4) %0, i64 12
+ %5 = load i16, ptr addrspace(4) %4, align 4
+ %6 = zext i16 %5 to i32
+ %7 = mul i32 %3, %6
+ %8 = tail call i32 @llvm.amdgcn.workitem.id.x()
+ %9 = add i32 %7, %8
+ %10 = zext i32 %9 to i64
+ %11 = add i64 %2, %10
+ %sext = shl i64 %11, 32
+ %idxprom = ashr exact i64 %sext, 32
+ %arrayidx = getelementptr inbounds <16 x i8>, ptr addrspace(1) %srcA.load, i64 %idxprom
+ %12 = load <16 x i8>, ptr addrspace(1) %arrayidx, align 16
+ %not = xor <16 x i8> %12, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
+ %arrayidx2 = getelementptr inbounds <16 x i8>, ptr addrspace(1) %dst.load, i64 %idxprom
+ store <16 x i8> %not, ptr addrspace(1) %arrayidx2, align 16
+ ret void
+}
diff --git a/llvm/test/CodeGen/AMDGPU/permute_i8.ll b/llvm/test/CodeGen/AMDGPU/permute_i8.ll
index 8ac332197215f..7ca9ae359a499 100644
--- a/llvm/test/CodeGen/AMDGPU/permute_i8.ll
+++ b/llvm/test/CodeGen/AMDGPU/permute_i8.ll
@@ -3816,13 +3816,15 @@ define hidden void @extract_v13i64(ptr addrspace(1) %in0, ptr addrspace(1) %in1,
; GFX10-LABEL: extract_v13i64:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX10-NEXT: s_clause 0x1
-; GFX10-NEXT: global_load_dwordx4 v[8:11], v[0:1], off
-; GFX10-NEXT: global_load_dwordx4 v[12:15], v[0:1], off offset:16
+; GFX10-NEXT: s_clause 0x2
+; GFX10-NEXT: global_load_dwordx4 v[8:11], v[0:1], off offset:48
+; GFX10-NEXT: global_load_dwordx4 v[11:14], v[0:1], off
+; GFX10-NEXT: global_load_dwordx4 v[14:17], v[0:1], off offset:64
+; GFX10-NEXT: ; kill: killed $vgpr0 killed $vgpr1
; GFX10-NEXT: s_waitcnt vmcnt(1)
-; GFX10-NEXT: v_perm_b32 v0, v9, v8, 0x3020504
+; GFX10-NEXT: v_perm_b32 v0, v12, v13, 0x1000504
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_perm_b32 v1, v11, v12, 0x1000706
+; GFX10-NEXT: v_perm_b32 v1, v10, v14, 0x1000504
; GFX10-NEXT: global_store_dword v[4:5], v0, off
; GFX10-NEXT: global_store_dword v[6:7], v1, off
; GFX10-NEXT: s_setpc_b64 s[30:31]
@@ -3830,14 +3832,15 @@ define hidden void @extract_v13i64(ptr addrspace(1) %in0, ptr addrspace(1) %in1,
; GFX9-LABEL: extract_v13i64:
; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT: global_load_dwordx4 v[8:11], v[0:1], off
-; GFX9-NEXT: global_load_dwordx4 v[12:15], v[0:1], off offset:16
-; GFX9-NEXT: s_mov_b32 s4, 0x3020504
-; GFX9-NEXT: s_mov_b32 s5, 0x1000706
+; GFX9-NEXT: global_load_dwordx4 v[8:11], v[0:1], off offset:48
+; GFX9-NEXT: global_load_dwordx4 v[11:14], v[0:1], off
+; GFX9-NEXT: global_load_dwordx4 v[14:17], v[0:1], off offset:64
+; GFX9-NEXT: s_mov_b32 s4, 0x1000504
+; GFX9-NEXT: ; kill: killed $vgpr0 killed $vgpr1
; GFX9-NEXT: s_waitcnt vmcnt(1)
-; GFX9-NEXT: v_perm_b32 v0, v9, v8, s4
+; GFX9-NEXT: v_perm_b32 v0, v12, v13, s4
; GFX9-NEXT: s_waitcnt vmcnt(0)
-; GFX9-NEXT: v_perm_b32 v1, v11, v12, s5
+; GFX9-NEXT: v_perm_b32 v1, v10, v14, s4
; GFX9-NEXT: global_store_dword v[4:5], v0, off
; GFX9-NEXT: global_store_dword v[6:7], v1, off
; GFX9-NEXT: s_waitcnt vmcnt(0)
|
return calculateSrcByte(ScalarSize == 32 ? Op : Op.getOperand(0), | ||
StartingIndex, Index); | ||
return calculateSrcByte(Op.getOperand(0), StartingIndex, | ||
ScalarSize / 8 * VecIdx + Index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel I might miss some corner case here but I can't figure that out. However, this approach does look pretty straightforward.
I see a regression of class: define hidden void @new(ptr addrspace(1) %in0, ptr addrspace(1) %in1, ptr addrspace(1) %out0) { |
@jrbyrnes I tried your case but didn't find anything wrong. I set a breakpoint at Here is when it was hit the first time. It was trying to match
Here is the trace of the byte:
This looks correct to me. Here is when the break point was hit the second time. Still, it was trying to match
The trace is still the same:
|
Should include the Fixes: SWDEV-... comment in description |
Is this one of the tests in the file? What do you mean by regression? |
Do we want to include that information in upstream PR? |
Yes |
ping |
ping +1 |
Specifically, the format "Fixes: SWDEV-xxxxxx" |
What was this regression? Is this one of the existing tests? |
Thanks for the ping -- this got buried in emails while I was away. By regression, I mean quality of code, not correctness. Using the example provided llc -mtriple=amdgcn -mcpu=gfx908 -verify-machineinstrs test.ll git fetch origin pull/91697/head:fix-cbp (with PR)
git revert bb779e6 (without PR)
I have seen a similar pattern in existing real code (CK). I don't think it is covered in existing lit tests.
With the PR, we have source bytes: t22, t25, t25, and t19. Without the PR, I see source bytes: t22, t25, t25, t25. The issue here seems to be too many source nodes, and trying to combine them is not profitable. It seems the answer is that if we are going to look through extract_vector_elt in calculateByteProvider, we should also look them in calculateSrcByte. But, doing such may cause your crash to reappear (I don't understand how this is causing a crash?) |
It didn't cause a crash. It causes wrong results in just one OpenCL conformance test. That is because wrong register is used in Correct asm:
Wrong asm:
I narrowed it down to this place and found out that the index was wrong when calculating the source byte.
That might be beyond the scope of this PR. We can figure out when is better to look into |
Ah, I see the issue. Thanks. It is because the current index calculation condition assumes less than 32 bit. Can you try
|
It works. Why do we want to differentiate them here? |
We don't gain anything from looking into the vectors for scalarsize >= 32. calculateSrcByte will just return the Op since vector handling hasn't been built yet, and then we will need to re-extract the dword when building the perm https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIISelLowering.cpp#L12193 -- by disallowing them, we avoid some unecessary computation. Additionally, it does not introduce the regression mentioned above. On the other hand, we want to continue looking through to the vector for 8 and 16 bit scalars, as the vector that we are extracting from is usually the common source of permutation (and not the extracted value itself). |
… provider when the op is extract_vector_elt
Good to know. Thanks for your explanation. I have updated the patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
… provider when the op is extract_vector_elt (llvm#91697) Fixes: SWDEV-460097 Change-Id: I9c86711bf403f617bc01b352f685ede98e91f4f9
… provider when the op is extract_vector_elt (llvm#91697) Fixes: SWDEV-460097 Change-Id: I441c8b595da4b71a1f84a3be43481a6847832b21
Fixes: SWDEV-460097