-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Enable unaligned scratch accesses #110219
Conversation
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Fabian Ritter (ritter-x2a) ChangesThis allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: Code has already assumed that unaligned accesses with the specialized flat scratch instructions are allowed independent of FeatureUnalignedScratchAccess at some places. This patch always uses this interpretation. Part of SWDEV-455845. Patch is 1.19 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110219.diff 11 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 3626fd8bc78c15..bb548575a3e083 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1171,9 +1171,9 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
FeatureAddNoCarryInsts, FeatureGFX8Insts, FeatureGFX7GFX8GFX9Insts,
FeatureScalarFlatScratchInsts, FeatureScalarAtomics, FeatureR128A16,
FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
- FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess,
- FeatureNegativeScratchOffsetBug, FeatureGWS, FeatureDefaultComponentZero,
- FeatureVmemWriteVgprInOrder
+ FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+ FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
+ FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder
]
>;
@@ -1192,9 +1192,9 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
FeatureVOP3Literal, FeatureDPP8, FeatureExtendedImageInsts,
FeatureNoDataDepHazard, FeaturePkFmacF16Inst,
FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureG16,
- FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, FeatureImageInsts,
- FeatureGDS, FeatureGWS, FeatureDefaultComponentZero,
- FeatureMaxHardClauseLength63,
+ FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+ FeatureUnalignedDSAccess, FeatureImageInsts, FeatureGDS, FeatureGWS,
+ FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
FeatureVmemWriteVgprInOrder
@@ -1216,9 +1216,9 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
FeatureVOP3Literal, FeatureDPP8, FeatureExtendedImageInsts,
FeatureNoDataDepHazard, FeaturePkFmacF16Inst,
FeatureA16, FeatureFastDenormalF32, FeatureG16,
- FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, FeatureGDS,
- FeatureGWS, FeatureDefaultComponentZero,
- FeatureMaxHardClauseLength32,
+ FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+ FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
+ FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
FeatureVmemWriteVgprInOrder
]
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 4cf7733a260ff0..909b0e2f1c9861 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -387,8 +387,9 @@ bool GCNTTIImpl::isLegalToVectorizeMemChain(unsigned ChainSizeInBytes,
// them later if they may access private memory. We don't have enough context
// here, and legalization can handle it.
if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
- return (Alignment >= 4 || ST->hasUnalignedScratchAccess()) &&
- ChainSizeInBytes <= ST->getMaxPrivateElementSize();
+ return (Alignment >= 4 || ST->hasUnalignedScratchAccessEnabled() ||
+ ST->enableFlatScratch()) &&
+ ChainSizeInBytes <= ST->getMaxPrivateElementSize();
}
return true;
}
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e6b7342d5fffcf..45d3ded9fa7bd5 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -590,6 +590,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
return UnalignedScratchAccess;
}
+ bool hasUnalignedScratchAccessEnabled() const {
+ return UnalignedScratchAccess && UnalignedAccessMode;
+ }
+
bool hasUnalignedAccessMode() const {
return UnalignedAccessMode;
}
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 885ecab891b1f5..bf0f806ca50011 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1831,26 +1831,17 @@ bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
Subtarget->hasUnalignedDSAccessEnabled();
}
- if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
- bool AlignedBy4 = Alignment >= Align(4);
- if (IsFast)
- *IsFast = AlignedBy4;
-
- return AlignedBy4 ||
- Subtarget->enableFlatScratch() ||
- Subtarget->hasUnalignedScratchAccess();
- }
-
// FIXME: We have to be conservative here and assume that flat operations
// will access scratch. If we had access to the IR function, then we
// could determine if any private memory was used in the function.
- if (AddrSpace == AMDGPUAS::FLAT_ADDRESS &&
- !Subtarget->hasUnalignedScratchAccess()) {
+ if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS ||
+ AddrSpace == AMDGPUAS::FLAT_ADDRESS) {
bool AlignedBy4 = Alignment >= Align(4);
if (IsFast)
*IsFast = AlignedBy4;
- return AlignedBy4;
+ return AlignedBy4 || Subtarget->enableFlatScratch() ||
+ Subtarget->hasUnalignedScratchAccessEnabled();
}
// So long as they are correct, wide global memory operations perform better
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
index e67f3620d013c7..bc33dc2a9d1926 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
@@ -1,10 +1,10 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=bonaire -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=CI %s
# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=VI %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9 %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX9PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX11PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 -O0 -run-pass=legalizer %s -o - | FileCheck -check-prefix=GFX11PLUS %s
---
name: test_load_flat_s1_align1
@@ -30,14 +30,23 @@ body: |
; VI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
; VI-NEXT: $vgpr0 = COPY [[AND]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s1_align1
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
- ; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
- ; GFX9-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
- ; GFX9-NEXT: $vgpr0 = COPY [[AND]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s1_align1
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
+ ; GFX9PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s1_align1
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX11PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
+ ; GFX11PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s1) = G_LOAD %0 :: (load (s1), align 1, addrspace 0)
%2:_(s32) = G_ZEXT %1
@@ -68,14 +77,23 @@ body: |
; VI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
; VI-NEXT: $vgpr0 = COPY [[AND]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s2_align1
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
- ; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
- ; GFX9-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
- ; GFX9-NEXT: $vgpr0 = COPY [[AND]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s2_align1
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
+ ; GFX9PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s2_align1
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX11PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
+ ; GFX11PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s2) = G_LOAD %0 :: (load (s2), align 1, addrspace 0)
%2:_(s32) = G_ZEXT %1
@@ -102,12 +120,19 @@ body: |
; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s8_align4
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
- ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s8_align4
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s8_align4
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s8) = G_LOAD %0 :: (load (s8), align 4, addrspace 0)
%2:_(s32) = G_ANYEXT %1
@@ -134,12 +159,19 @@ body: |
; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s8_align1
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
- ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s8_align1
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s8_align1
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s8) = G_LOAD %0 :: (load (s8), align 1, addrspace 0)
%2:_(s32) = G_ANYEXT %1
@@ -166,12 +198,19 @@ body: |
; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s16_align4
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
- ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s16_align4
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s16_align4
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s16) = G_LOAD %0 :: (load (s16), align 4, addrspace 0)
%2:_(s32) = G_ANYEXT %1
@@ -198,12 +237,19 @@ body: |
; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s16_align2
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
- ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s16_align2
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s16_align2
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s16) = G_LOAD %0 :: (load (s16), align 2, addrspace 0)
%2:_(s32) = G_ANYEXT %1
@@ -242,18 +288,25 @@ body: |
; VI-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
; VI-NEXT: $vgpr0 = COPY [[OR]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s16_align1
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s8))
- ; GFX9-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
- ; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s8) from unknown-address + 1)
- ; GFX9-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
- ; GFX9-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
- ; GFX9-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
- ; GFX9-NEXT: $vgpr0 = COPY [[OR]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s16_align1
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s8))
+ ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+ ; GFX9PLUS-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s8) from unknown-address + 1)
+ ; GFX9PLUS-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
+ ; GFX9PLUS-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
+ ; GFX9PLUS-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[OR]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s16_align1
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 1)
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s16) = G_LOAD %0 :: (load (s16), align 1, addrspace 0)
%2:_(s32) = G_ANYEXT %1
@@ -280,12 +333,19 @@ body: |
; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s32_align4
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
- ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s32_align4
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s32_align4
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
+ ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
%0:_(p0) = COPY $vgpr0_vgpr1
%1:_(s32) = G_LOAD %0 :: (load (s32), align 4, addrspace 0)
$vgpr0 = COPY %1
@@ -323,18 +383,25 @@ body: |
; VI-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
; VI-NEXT: $vgpr0 = COPY [[OR]](s32)
;
- ; GFX9-LABEL: name: test_load_flat_s32_align2
- ; GFX9: liveins: $vgpr0_vgpr1
- ; GFX9-NEXT: {{ $}}
- ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
- ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s16))
- ; GFX9-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
- ; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
- ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s16) from unknown-address + 2)
- ; GFX9-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
- ; GFX9-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
- ; GFX9-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
- ; GFX9-NEXT: $vgpr0 = COPY [[OR]](s32)
+ ; GFX9PLUS-LABEL: name: test_load_flat_s32_align2
+ ; GFX9PLUS: liveins: $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: {{ $}}
+ ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX9PLUS-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s16))
+ ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+ ; GFX9PLUS-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
+ ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s16) from unknown-address + 2)
+ ; GFX9PLUS-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
+ ; GFX9PLUS-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
+ ; GFX9PLUS-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
+ ; GFX9PLUS-NEXT: $vgpr0 = COPY [[OR]](s32)
+ ;
+ ; GFX11PLUS-LABEL: name: test_load_flat_s32_align2
+ ; GFX11PLUS: liveins: $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: {{ $}}
+ ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+ ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32),...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
bool AlignedBy4 = Alignment >= Align(4); | ||
if (IsFast) | ||
*IsFast = AlignedBy4; | ||
|
||
return AlignedBy4; | ||
return AlignedBy4 || Subtarget->enableFlatScratch() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enableFlatScratch should be checked, the scratch instructions still should respect whether the unaligned access mode is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we don't do that in the current trunk state: see the old line 1840. Removing Subtarget->enableFlatScratch()
from the OR would make code generation more pessimistic for cases where unaligned scratch is not explicitly enabled but flat scratch is, e.g., gfx12. Should we change that anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is just a bug. The question of whether we should assume unaligned access is enabled or not is a question of the driver (it really should just always be enabled, but I'm not sure it is). The place to fix that is somewhere else, and not in these low level usage checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean (1) that Subtarget->enableFlatScratch()
should be dropped from the OR, or (2) that it should become (Subtarget->enableFlatScratch() && Subtarget->hasUnalignedAccessMode())
(i.e., it would then depend on whether the mode bit is set, but would not require the UnalignedScratchAccess feature)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped from the or. The unaligned access mode is orthogonal to which scratch ABI is in use.
IIRC the difficulty is linux always enables unaligned access, and windows does not. As an incorrect way of dealing with this, the subtarget constructor assumes unaligned access is enabled for amdhsa (which will be wrong if executed on the compatibility layer on top of pal on windows), and not for amdpal. Don't remember what we do for unknown / mesa3d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 2e43c4c.
return (Alignment >= 4 || ST->hasUnalignedScratchAccess()) && | ||
ChainSizeInBytes <= ST->getMaxPrivateElementSize(); | ||
return (Alignment >= 4 || ST->hasUnalignedScratchAccessEnabled() || | ||
ST->enableFlatScratch()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
d75f9a3
to
2e43c4c
Compare
; GFX9-NEXT: s_waitcnt vmcnt(0) | ||
; GFX9-NEXT: scratch_load_ubyte v4, v1, off glc | ||
; GFX9-NEXT: s_waitcnt vmcnt(0) | ||
; GFX9-NEXT: scratch_load_ubyte v4, v7, off glc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests should probably be switched to using an amdhsa triple in a precommit, so we still end up using the aligned accesses. We probably want dedicated aligned and unaligned cases in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #110672 to do that.
As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.
…110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR #110219, which will generate different code depending on the unaligned-access-mode.
2e43c4c
to
0d0c4e0
Compare
Rebased on top of the merged PR #110672 to update the GlobalISel tests. |
; GFX12-NEXT: s_wait_storecnt 0x0 | ||
; GFX12-NEXT: scratch_load_b64 v[0:1], v0, off scope:SCOPE_SYS | ||
; GFX12-NEXT: scratch_store_b8 v0, v2, off offset:1 scope:SCOPE_SYS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These cases shouldn't have been broken up? This is the unaligned enabled path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's because this PR adds the FeatureUnalignedScratchAccess
feature only to gfx9-11. I'd need to check how and when to update it for gfx12.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there was any change here in gfx12
…lvm#110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.
…lvm#110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.
…lvm#110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.
…lvm#110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.
0e137e4
to
aa8ba73
Compare
This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: Code has already assumed that unaligned accesses with the specialized flat scratch instructions are allowed independent of FeatureUnalignedScratchAccess at some places. This patch always uses this interpretation. Part of SWDEV-455845.
make flat scratch not imply that unaligned scratch accesses are valid
add FeatureUnalignedScratchAccess to gfx12
aa8ba73
to
c128ad8
Compare
This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.
This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.
This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics.
Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places.
Part of SWDEV-455845.