[AMDGPU] Enable unaligned scratch accesses #110219

ritter-x2a · 2024-09-27T08:07:14Z

This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places.

Part of SWDEV-455845.

llvmbot · 2024-09-27T08:07:36Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Fabian Ritter (ritter-x2a)

Changes

This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: Code has already assumed that unaligned accesses with the specialized flat scratch instructions are allowed independent of FeatureUnalignedScratchAccess at some places. This patch always uses this interpretation.

Part of SWDEV-455845.

Patch is 1.19 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110219.diff

11 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+9-9)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+3-2)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+4)
(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+4-13)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir (+2638-1716)
(modified) llvm/test/CodeGen/AMDGPU/chain-hi-to-lo.ll (+11-12)
(modified) llvm/test/CodeGen/AMDGPU/flat-address-space.ll (+4-8)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-crash-issue63986.ll (+14-84)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-libcall.ll (+258-2180)
(modified) llvm/test/CodeGen/AMDGPU/memcpy-param-combinations.ll (+950-5566)
(modified) llvm/test/CodeGen/AMDGPU/memmove-param-combinations.ll (+759-4437)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 3626fd8bc78c15..bb548575a3e083 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1171,9 +1171,9 @@ def FeatureGFX9 : GCNSubtargetFeatureGeneration<"GFX9",
    FeatureAddNoCarryInsts, FeatureGFX8Insts, FeatureGFX7GFX8GFX9Insts,
    FeatureScalarFlatScratchInsts, FeatureScalarAtomics, FeatureR128A16,
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureSupportsXNACK,
-   FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess,
-   FeatureNegativeScratchOffsetBug, FeatureGWS, FeatureDefaultComponentZero,
-   FeatureVmemWriteVgprInOrder
+   FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+   FeatureUnalignedDSAccess, FeatureNegativeScratchOffsetBug, FeatureGWS,
+   FeatureDefaultComponentZero,FeatureVmemWriteVgprInOrder
   ]
 >;
 
@@ -1192,9 +1192,9 @@ def FeatureGFX10 : GCNSubtargetFeatureGeneration<"GFX10",
    FeatureVOP3Literal, FeatureDPP8, FeatureExtendedImageInsts,
    FeatureNoDataDepHazard, FeaturePkFmacF16Inst,
    FeatureA16, FeatureSMemTimeInst, FeatureFastDenormalF32, FeatureG16,
-   FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, FeatureImageInsts,
-   FeatureGDS, FeatureGWS, FeatureDefaultComponentZero,
-   FeatureMaxHardClauseLength63,
+   FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+   FeatureUnalignedDSAccess, FeatureImageInsts, FeatureGDS, FeatureGWS,
+   FeatureDefaultComponentZero, FeatureMaxHardClauseLength63,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF32FlatInsts, FeatureAtomicFMinFMaxF64FlatInsts,
    FeatureVmemWriteVgprInOrder
@@ -1216,9 +1216,9 @@ def FeatureGFX11 : GCNSubtargetFeatureGeneration<"GFX11",
    FeatureVOP3Literal, FeatureDPP8, FeatureExtendedImageInsts,
    FeatureNoDataDepHazard, FeaturePkFmacF16Inst,
    FeatureA16, FeatureFastDenormalF32, FeatureG16,
-   FeatureUnalignedBufferAccess, FeatureUnalignedDSAccess, FeatureGDS,
-   FeatureGWS, FeatureDefaultComponentZero,
-   FeatureMaxHardClauseLength32,
+   FeatureUnalignedBufferAccess, FeatureUnalignedScratchAccess,
+   FeatureUnalignedDSAccess, FeatureGDS, FeatureGWS,
+   FeatureDefaultComponentZero, FeatureMaxHardClauseLength32,
    FeatureAtomicFMinFMaxF32GlobalInsts, FeatureAtomicFMinFMaxF32FlatInsts,
    FeatureVmemWriteVgprInOrder
   ]
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index 4cf7733a260ff0..909b0e2f1c9861 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -387,8 +387,9 @@ bool GCNTTIImpl::isLegalToVectorizeMemChain(unsigned ChainSizeInBytes,
   // them later if they may access private memory. We don't have enough context
   // here, and legalization can handle it.
   if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
-    return (Alignment >= 4 || ST->hasUnalignedScratchAccess()) &&
-      ChainSizeInBytes <= ST->getMaxPrivateElementSize();
+    return (Alignment >= 4 || ST->hasUnalignedScratchAccessEnabled() ||
+            ST->enableFlatScratch()) &&
+           ChainSizeInBytes <= ST->getMaxPrivateElementSize();
   }
   return true;
 }
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index e6b7342d5fffcf..45d3ded9fa7bd5 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -590,6 +590,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
     return UnalignedScratchAccess;
   }
 
+  bool hasUnalignedScratchAccessEnabled() const {
+    return UnalignedScratchAccess && UnalignedAccessMode;
+  }
+
   bool hasUnalignedAccessMode() const {
     return UnalignedAccessMode;
   }
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 885ecab891b1f5..bf0f806ca50011 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1831,26 +1831,17 @@ bool SITargetLowering::allowsMisalignedMemoryAccessesImpl(
            Subtarget->hasUnalignedDSAccessEnabled();
   }
 
-  if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS) {
-    bool AlignedBy4 = Alignment >= Align(4);
-    if (IsFast)
-      *IsFast = AlignedBy4;
-
-    return AlignedBy4 ||
-           Subtarget->enableFlatScratch() ||
-           Subtarget->hasUnalignedScratchAccess();
-  }
-
   // FIXME: We have to be conservative here and assume that flat operations
   // will access scratch.  If we had access to the IR function, then we
   // could determine if any private memory was used in the function.
-  if (AddrSpace == AMDGPUAS::FLAT_ADDRESS &&
-      !Subtarget->hasUnalignedScratchAccess()) {
+  if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS ||
+      AddrSpace == AMDGPUAS::FLAT_ADDRESS) {
     bool AlignedBy4 = Alignment >= Align(4);
     if (IsFast)
       *IsFast = AlignedBy4;
 
-    return AlignedBy4;
+    return AlignedBy4 || Subtarget->enableFlatScratch() ||
+           Subtarget->hasUnalignedScratchAccessEnabled();
   }
 
   // So long as they are correct, wide global memory operations perform better
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
index e67f3620d013c7..bc33dc2a9d1926 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/legalize-load-flat.mir
@@ -1,10 +1,10 @@
 # NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
 # RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=bonaire -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=CI %s
 # RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=fiji -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=VI %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9 %s
-# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9 %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx900 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1010 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX9PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1100 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX11PLUS %s
+# RUN: llc -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx1200 -O0 -run-pass=legalizer  %s -o - | FileCheck -check-prefix=GFX11PLUS %s
 
 ---
 name: test_load_flat_s1_align1
@@ -30,14 +30,23 @@ body: |
     ; VI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
     ; VI-NEXT: $vgpr0 = COPY [[AND]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s1_align1
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
-    ; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
-    ; GFX9-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
-    ; GFX9-NEXT: $vgpr0 = COPY [[AND]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s1_align1
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
+    ; GFX9PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s1_align1
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX11PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
+    ; GFX11PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s1) = G_LOAD %0 :: (load (s1), align 1, addrspace 0)
     %2:_(s32) = G_ZEXT %1
@@ -68,14 +77,23 @@ body: |
     ; VI-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
     ; VI-NEXT: $vgpr0 = COPY [[AND]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s2_align1
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
-    ; GFX9-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
-    ; GFX9-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
-    ; GFX9-NEXT: $vgpr0 = COPY [[AND]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s2_align1
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
+    ; GFX9PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s2_align1
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX11PLUS-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 3
+    ; GFX11PLUS-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[LOAD]], [[C]]
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[AND]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s2) = G_LOAD %0 :: (load (s2), align 1, addrspace 0)
     %2:_(s32) = G_ZEXT %1
@@ -102,12 +120,19 @@ body: |
     ; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
     ; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s8_align4
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
-    ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s8_align4
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s8_align4
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8), align 4)
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s8) = G_LOAD %0 :: (load (s8), align 4, addrspace 0)
     %2:_(s32) = G_ANYEXT %1
@@ -134,12 +159,19 @@ body: |
     ; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
     ; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s8_align1
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
-    ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s8_align1
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s8_align1
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s8))
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s8) = G_LOAD %0 :: (load (s8), align 1, addrspace 0)
     %2:_(s32) = G_ANYEXT %1
@@ -166,12 +198,19 @@ body: |
     ; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
     ; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s16_align4
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
-    ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s16_align4
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s16_align4
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 4)
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s16) = G_LOAD %0 :: (load (s16), align 4, addrspace 0)
     %2:_(s32) = G_ANYEXT %1
@@ -198,12 +237,19 @@ body: |
     ; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
     ; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s16_align2
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
-    ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s16_align2
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s16_align2
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16))
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s16) = G_LOAD %0 :: (load (s16), align 2, addrspace 0)
     %2:_(s32) = G_ANYEXT %1
@@ -242,18 +288,25 @@ body: |
     ; VI-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
     ; VI-NEXT: $vgpr0 = COPY [[OR]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s16_align1
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s8))
-    ; GFX9-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
-    ; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s8) from unknown-address + 1)
-    ; GFX9-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
-    ; GFX9-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
-    ; GFX9-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
-    ; GFX9-NEXT: $vgpr0 = COPY [[OR]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s16_align1
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s8))
+    ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+    ; GFX9PLUS-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s8) from unknown-address + 1)
+    ; GFX9PLUS-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 8
+    ; GFX9PLUS-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
+    ; GFX9PLUS-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[OR]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s16_align1
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s16), align 1)
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s16) = G_LOAD %0 :: (load (s16), align 1, addrspace 0)
     %2:_(s32) = G_ANYEXT %1
@@ -280,12 +333,19 @@ body: |
     ; VI-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
     ; VI-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s32_align4
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
-    ; GFX9-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s32_align4
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s32_align4
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32))
+    ; GFX11PLUS-NEXT: $vgpr0 = COPY [[LOAD]](s32)
     %0:_(p0) = COPY $vgpr0_vgpr1
     %1:_(s32) = G_LOAD %0 :: (load (s32), align 4, addrspace 0)
     $vgpr0 = COPY %1
@@ -323,18 +383,25 @@ body: |
     ; VI-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
     ; VI-NEXT: $vgpr0 = COPY [[OR]](s32)
     ;
-    ; GFX9-LABEL: name: test_load_flat_s32_align2
-    ; GFX9: liveins: $vgpr0_vgpr1
-    ; GFX9-NEXT: {{  $}}
-    ; GFX9-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
-    ; GFX9-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s16))
-    ; GFX9-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
-    ; GFX9-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
-    ; GFX9-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s16) from unknown-address + 2)
-    ; GFX9-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
-    ; GFX9-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
-    ; GFX9-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
-    ; GFX9-NEXT: $vgpr0 = COPY [[OR]](s32)
+    ; GFX9PLUS-LABEL: name: test_load_flat_s32_align2
+    ; GFX9PLUS: liveins: $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: {{  $}}
+    ; GFX9PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX9PLUS-NEXT: [[ZEXTLOAD:%[0-9]+]]:_(s32) = G_ZEXTLOAD [[COPY]](p0) :: (load (s16))
+    ; GFX9PLUS-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+    ; GFX9PLUS-NEXT: [[PTR_ADD:%[0-9]+]]:_(p0) = G_PTR_ADD [[COPY]], [[C]](s64)
+    ; GFX9PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[PTR_ADD]](p0) :: (load (s16) from unknown-address + 2)
+    ; GFX9PLUS-NEXT: [[C1:%[0-9]+]]:_(s32) = G_CONSTANT i32 16
+    ; GFX9PLUS-NEXT: [[SHL:%[0-9]+]]:_(s32) = G_SHL [[LOAD]], [[C1]](s32)
+    ; GFX9PLUS-NEXT: [[OR:%[0-9]+]]:_(s32) = G_OR [[SHL]], [[ZEXTLOAD]]
+    ; GFX9PLUS-NEXT: $vgpr0 = COPY [[OR]](s32)
+    ;
+    ; GFX11PLUS-LABEL: name: test_load_flat_s32_align2
+    ; GFX11PLUS: liveins: $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: {{  $}}
+    ; GFX11PLUS-NEXT: [[COPY:%[0-9]+]]:_(p0) = COPY $vgpr0_vgpr1
+    ; GFX11PLUS-NEXT: [[LOAD:%[0-9]+]]:_(s32) = G_LOAD [[COPY]](p0) :: (load (s32),...
[truncated]

rampitec

LGTM

arsenm · 2024-09-30T12:12:34Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

    bool AlignedBy4 = Alignment >= Align(4);
    if (IsFast)
      *IsFast = AlignedBy4;

-    return AlignedBy4;
+    return AlignedBy4 || Subtarget->enableFlatScratch() ||


enableFlatScratch should be checked, the scratch instructions still should respect whether the unaligned access mode is enabled

Note that we don't do that in the current trunk state: see the old line 1840. Removing Subtarget->enableFlatScratch() from the OR would make code generation more pessimistic for cases where unaligned scratch is not explicitly enabled but flat scratch is, e.g., gfx12. Should we change that anyway?

I think this is just a bug. The question of whether we should assume unaligned access is enabled or not is a question of the driver (it really should just always be enabled, but I'm not sure it is). The place to fix that is somewhere else, and not in these low level usage checks

Do you mean (1) that Subtarget->enableFlatScratch() should be dropped from the OR, or (2) that it should become (Subtarget->enableFlatScratch() && Subtarget->hasUnalignedAccessMode()) (i.e., it would then depend on whether the mode bit is set, but would not require the UnalignedScratchAccess feature)?

Dropped from the or. The unaligned access mode is orthogonal to which scratch ABI is in use.

IIRC the difficulty is linux always enables unaligned access, and windows does not. As an incorrect way of dealing with this, the subtarget constructor assumes unaligned access is enabled for amdhsa (which will be wrong if executed on the compatibility layer on top of pal on windows), and not for amdpal. Don't remember what we do for unknown / mesa3d

Done in 2e43c4c.

arsenm · 2024-09-30T12:12:52Z

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

-    return (Alignment >= 4 || ST->hasUnalignedScratchAccess()) &&
-      ChainSizeInBytes <= ST->getMaxPrivateElementSize();
+    return (Alignment >= 4 || ST->hasUnalignedScratchAccessEnabled() ||
+            ST->enableFlatScratch()) &&


arsenm · 2024-09-30T15:57:36Z

llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll

+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    scratch_load_ubyte v4, v1, off glc
+; GFX9-NEXT:    s_waitcnt vmcnt(0)
+; GFX9-NEXT:    scratch_load_ubyte v4, v7, off glc


These tests should probably be switched to using an amdhsa triple in a precommit, so we still end up using the aligned accesses. We probably want dedicated aligned and unaligned cases in the future

I opened #110672 to do that.

As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.

…110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR #110219, which will generate different code depending on the unaligned-access-mode.

ritter-x2a · 2024-10-01T16:03:48Z

Rebased on top of the merged PR #110672 to update the GlobalISel tests.

arsenm · 2024-10-01T18:30:16Z

llvm/test/CodeGen/AMDGPU/GlobalISel/flat-scratch.ll

 ; GFX12-NEXT:    s_wait_storecnt 0x0
-; GFX12-NEXT:    scratch_load_b64 v[0:1], v0, off scope:SCOPE_SYS
+; GFX12-NEXT:    scratch_store_b8 v0, v2, off offset:1 scope:SCOPE_SYS


These cases shouldn't have been broken up? This is the unaligned enabled path?

That's because this PR adds the FeatureUnalignedScratchAccess feature only to gfx9-11. I'd need to check how and when to update it for gfx12.

I don't think there was any change here in gfx12

…lvm#110672) As a proxy criterion, mesa targets have unaligned-access-mode (which determines whether the hardware allows unaligned memory accesses) not set whereas amdhsa targets do. This PR changes tests to use amdhsa instead of mesa and inserts additional checks with unaligned-access-mode unset explicitly. This is in preparation for PR llvm#110219, which will generate different code depending on the unaligned-access-mode.

This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: Code has already assumed that unaligned accesses with the specialized flat scratch instructions are allowed independent of FeatureUnalignedScratchAccess at some places. This patch always uses this interpretation. Part of SWDEV-455845.

make flat scratch not imply that unaligned scratch accesses are valid

add FeatureUnalignedScratchAccess to gfx12

This allows us to emit wide generic and scratch memory accesses when we do not have alignment information. In cases where accesses happen to be properly aligned or where generic accesses do not go to scratch memory, this improves performance of the generated code by a factor of up to 16x and reduces code size, especially when lowering memcpy and memmove intrinsics. Also: Make the use of the FeatureUnalignedScratchAccess feature more consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now orthogonal, whereas, before, code assumed that the latter implies the former at some places. Part of SWDEV-455845.

ritter-x2a added the backend:AMDGPU label Sep 27, 2024

ritter-x2a requested review from jayfoad, arsenm, rampitec and Pierre-vh September 27, 2024 08:07

llvmbot added the llvm:globalisel label Sep 27, 2024

rampitec approved these changes Sep 27, 2024

View reviewed changes

arsenm reviewed Sep 30, 2024

View reviewed changes

ritter-x2a force-pushed the enable-unaligned-scratch branch from d75f9a3 to 2e43c4c Compare September 30, 2024 15:38

arsenm reviewed Sep 30, 2024

View reviewed changes

ritter-x2a mentioned this pull request Oct 1, 2024

[AMDGPU][GlobalISel][NFC] Use amdhsa target for flat/private tests #110672

Merged

ritter-x2a force-pushed the enable-unaligned-scratch branch from 2e43c4c to 0d0c4e0 Compare October 1, 2024 16:01

arsenm reviewed Oct 1, 2024

View reviewed changes

arsenm approved these changes Oct 7, 2024

View reviewed changes

ritter-x2a force-pushed the enable-unaligned-scratch branch from 0e137e4 to aa8ba73 Compare October 10, 2024 13:51

ritter-x2a added 3 commits October 10, 2024 09:58

fixup! [AMDGPU] Enable unaligned scratch accesses

422514e

make flat scratch not imply that unaligned scratch accesses are valid

fixup! fixup! [AMDGPU] Enable unaligned scratch accesses

c128ad8

add FeatureUnalignedScratchAccess to gfx12

ritter-x2a force-pushed the enable-unaligned-scratch branch from aa8ba73 to c128ad8 Compare October 10, 2024 14:16

ritter-x2a merged commit 173c682 into llvm:main Oct 11, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Enable unaligned scratch accesses #110219

[AMDGPU] Enable unaligned scratch accesses #110219

ritter-x2a commented Sep 27, 2024 •

edited

Loading

llvmbot commented Sep 27, 2024 •

edited

Loading

rampitec left a comment

arsenm Sep 30, 2024

ritter-x2a Sep 30, 2024

arsenm Sep 30, 2024

ritter-x2a Sep 30, 2024

arsenm Sep 30, 2024

ritter-x2a Sep 30, 2024

arsenm Sep 30, 2024

arsenm Sep 30, 2024

ritter-x2a Oct 1, 2024

ritter-x2a commented Oct 1, 2024

arsenm Oct 1, 2024

ritter-x2a Oct 2, 2024

arsenm Oct 2, 2024

[AMDGPU] Enable unaligned scratch accesses #110219

[AMDGPU] Enable unaligned scratch accesses #110219

Conversation

ritter-x2a commented Sep 27, 2024 • edited Loading

llvmbot commented Sep 27, 2024 • edited Loading

rampitec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritter-x2a commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ritter-x2a commented Sep 27, 2024 •

edited

Loading

llvmbot commented Sep 27, 2024 •

edited

Loading