[mlir][gpu] Add address space modifier to Barrier #110527

FMarno · 2024-09-30T15:58:08Z

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).

llvmbot · 2024-09-30T15:58:45Z

@llvm/pr-subscribers-mlir

Author: Finlay (FMarno)

Changes

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).

Full diff: https://github.com/llvm/llvm-project/pull/110527.diff

8 Files Affected:

(modified) mlir/include/mlir/Dialect/GPU/IR/GPUBase.td (+2)
(modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+17-2)
(modified) mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp (+24-5)
(modified) mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td (+1-1)
(modified) mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td (+1-1)
(modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+3)
(modified) mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir (+17-2)
(modified) mlir/test/Dialect/GPU/ops.mlir (+6)

diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
 def GPU_AddressSpaceAttr :
   GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
 
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
 //===----------------------------------------------------------------------===//
 // GPU Types.
 //===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
   ];
 }
 
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+    Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
   let summary = "Synchronizes all work items of a workgroup.";
   let description = [{
     The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
     accessing the same memory can be avoided by synchronizing work items
     in-between these accesses.
 
+    The address space of visible memory accesses can be modified by adding a
+    list of address spaces required to be visible. By default all address spaces
+    are included.
+
+    ```mlir
+    // only workgroup address spaces accesses required to be visible
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // no memory accesses required to be visible
+    gpu.barrier memfence []
+    // all memory accesses required to be visible
+    gpu.barrier
+    ```
+
     Either none or all work items of a workgroup need to execute this op
     in convergence.
   }];
-  let assemblyFormat = "attr-dict";
+  let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
   let hasCanonicalizer = 1;
+  let builders = [OpBuilder<(ins)>];
 }
 
 def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
         lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
                               /*isMemNone=*/false, /*isConvergent=*/true);
 
-    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
-    // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
-    constexpr int64_t localMemFenceFlag = 1;
+    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+    // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+    constexpr int32_t localMemFenceFlag = 1;
+    constexpr int32_t globalMemFenceFlag = 2;
+    int32_t memFenceFlag = 0;
+    std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+    if (addressSpaces) {
+      for (Attribute attr : addressSpaces.value()) {
+        auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+        switch (addressSpace) {
+        case gpu::AddressSpace::Global:
+          memFenceFlag = memFenceFlag | globalMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Workgroup:
+          memFenceFlag = memFenceFlag | localMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Private:
+          break;
+        }
+      }
+    } else {
+      memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+    }
     Location loc = op->getLoc();
-    Value flag =
-        rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+    Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
     rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
     return success();
   }
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/NVVMOps.td"
 
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
 
 #endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/ROCDLOps.td"
 
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
 
 #endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add(eraseRedundantGpuBarrierOps);
 }
 
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+                      mlir::OperationState &odsState) {}
+
 //===----------------------------------------------------------------------===//
 // GPUFuncOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
 
   // CHECK-LABEL: gpu_barrier
   func.func @gpu_barrier() {
-    // CHECK:         [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
-    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
     // CHECK-SAME-DAG:  no_unwind
     // CHECK-SAME-DAG:  convergent
     // CHECK-SAME-DAG:  will_return
     // CHECK-NOT:       memory_effects = #llvm.memory_effects
     // CHECK-SAME:    } : (i32) -> ()
     gpu.barrier
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+    // CHECK:         [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // CHECK:         [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<global>]
+    // CHECK:         [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+    gpu.barrier memfence []
+    // CHECK:         [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<private>]
     return
   }
 }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
       %shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
 
       "gpu.barrier"() : () -> ()
+      gpu.barrier
+      gpu.barrier memfence [#gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<global>]
+      gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<private>]
+      gpu.barrier memfence []
 
       "some_op"(%bIdX, %tIdX) : (index, index) -> ()
       %42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>

llvmbot · 2024-09-30T15:58:46Z

@llvm/pr-subscribers-mlir-gpu

Author: Finlay (FMarno)

Changes

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).

Full diff: https://github.com/llvm/llvm-project/pull/110527.diff

8 Files Affected:

(modified) mlir/include/mlir/Dialect/GPU/IR/GPUBase.td (+2)
(modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+17-2)
(modified) mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp (+24-5)
(modified) mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td (+1-1)
(modified) mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td (+1-1)
(modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+3)
(modified) mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir (+17-2)
(modified) mlir/test/Dialect/GPU/ops.mlir (+6)

diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
 def GPU_AddressSpaceAttr :
   GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
 
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
 //===----------------------------------------------------------------------===//
 // GPU Types.
 //===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
   ];
 }
 
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+    Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
   let summary = "Synchronizes all work items of a workgroup.";
   let description = [{
     The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
     accessing the same memory can be avoided by synchronizing work items
     in-between these accesses.
 
+    The address space of visible memory accesses can be modified by adding a
+    list of address spaces required to be visible. By default all address spaces
+    are included.
+
+    ```mlir
+    // only workgroup address spaces accesses required to be visible
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // no memory accesses required to be visible
+    gpu.barrier memfence []
+    // all memory accesses required to be visible
+    gpu.barrier
+    ```
+
     Either none or all work items of a workgroup need to execute this op
     in convergence.
   }];
-  let assemblyFormat = "attr-dict";
+  let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
   let hasCanonicalizer = 1;
+  let builders = [OpBuilder<(ins)>];
 }
 
 def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
         lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
                               /*isMemNone=*/false, /*isConvergent=*/true);
 
-    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
-    // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
-    constexpr int64_t localMemFenceFlag = 1;
+    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+    // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+    constexpr int32_t localMemFenceFlag = 1;
+    constexpr int32_t globalMemFenceFlag = 2;
+    int32_t memFenceFlag = 0;
+    std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+    if (addressSpaces) {
+      for (Attribute attr : addressSpaces.value()) {
+        auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+        switch (addressSpace) {
+        case gpu::AddressSpace::Global:
+          memFenceFlag = memFenceFlag | globalMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Workgroup:
+          memFenceFlag = memFenceFlag | localMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Private:
+          break;
+        }
+      }
+    } else {
+      memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+    }
     Location loc = op->getLoc();
-    Value flag =
-        rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+    Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
     rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
     return success();
   }
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/NVVMOps.td"
 
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
 
 #endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/ROCDLOps.td"
 
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
 
 #endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add(eraseRedundantGpuBarrierOps);
 }
 
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+                      mlir::OperationState &odsState) {}
+
 //===----------------------------------------------------------------------===//
 // GPUFuncOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
 
   // CHECK-LABEL: gpu_barrier
   func.func @gpu_barrier() {
-    // CHECK:         [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
-    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
     // CHECK-SAME-DAG:  no_unwind
     // CHECK-SAME-DAG:  convergent
     // CHECK-SAME-DAG:  will_return
     // CHECK-NOT:       memory_effects = #llvm.memory_effects
     // CHECK-SAME:    } : (i32) -> ()
     gpu.barrier
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+    // CHECK:         [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // CHECK:         [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<global>]
+    // CHECK:         [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+    gpu.barrier memfence []
+    // CHECK:         [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<private>]
     return
   }
 }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
       %shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
 
       "gpu.barrier"() : () -> ()
+      gpu.barrier
+      gpu.barrier memfence [#gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<global>]
+      gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<private>]
+      gpu.barrier memfence []
 
       "some_op"(%bIdX, %tIdX) : (index, index) -> ()
       %42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>

kuhar

I don't understand the new semantics as described in the op documentation.

kuhar · 2024-09-30T16:04:13Z

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

+    The address space of visible memory accesses can be modified by adding a
+    list of address spaces required to be visible. By default all address spaces
+    are included.


What does it mean for memory to be visible? What memory accesses is this referencing?

I might suggest "completed" or "committed" as synonyms here, if I've understood the semantics right.

That is, any memory operations involving the given memory spaces that were issued before the barrier must have their results reflected after the barrier completes.

Or, another way, any operation touching the listed memory spaces must be happens-before with any operation on those memory spaces after the barrier.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space<workgroup>]

I'm not trying to say that the memory is visible, but that the memory accesses are visible. I think it makes sense when paired with the previous paragraph

and all memory accesses made by these work items prior to the op are
visible to all work items in the workgroup.

I think what @krzysz00 is saying matches.

I'm open to suggestions on how to make it more clear.

Some good points there, @krzysz00. I only know the amdgpu implementation and that's why I'd like to make sure the semantics make sense for everyone in a portable way.

The way I understand it, it's backend compiler's job to track any memory dependencies / synchronization within the same subgroup, and we only need gpu.barrier (the plain version) when the dependency can be across subgroups.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space]

I don't see how this aligns with the stated goal in the RFC:

This could reduce the cost of synchronization.

All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

lds_barrier is strictly more work than just gpu.barrier, no? It's s_barrier and 'please flush the shared memory fifo' . At least that's what c++ libraries do: https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/utility/synchronization.hpp#L20-L25.

Maybe I'm missing something.

gpu.barrier is stronger than amdgpu.lds_barrier. The current semantics of gpu.barrier are not AMD's s_barrier, they're (atomic fence) + s_barrier.

(I think there are are even explicit calls to LLVM memory fencing in the AMD lowering path for gpu.barrier).

That is, gpu.barrier imposes a happens-before over all GPU memory by default, and this patch lets people loosen that.

(and those libraries are implementing amdgpu.lds_barrirer. If they wanted to match gpu.barrier, they'd be waiting on vmcnt(0) as well

I'm not trying to say that the memory is visible, but that the memory accesses are visible.

I don't know what this means either. I don't want to be pedantic here, but this is just not the wording I'm used to / have seen in this context. I did some searching and found this thread that talks about something adjacent: memory access availability vs. visibility https://community.khronos.org/t/difference-between-availability-and-visibility/7401/2.

I'm open to suggestions on how to make it more clear.

Maybe something like this (I reworded a few places):

The "barrier" op synchronizes all work items of a workgroup. It is used to coordinate communication between the work items of the workgroup. `gpu.barrier` waits until all work items in the workgroup have reached this point and all memory accesses made by these work items prior to the op have completed and are visible to all work items in the workgroup. Data races between work items accessing the same memory locations can be avoided by synchronizing work items in-between these accesses. By default, all address spaces are included in `gpu.barrier`. This is equivalent to: ```mlir gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>, #gpu.address_space<private>] ``` The `memfence` argument weakens the synchronization requrements such that only the listed address spaces need to have their accesses completed and visible across the workgroup.

I think that description basically covers it, but I would move a couple things around. Also I wouldn't say that memfence necessary weakens the memory fencing requirement, but it can be used for that.

fabianmcg

I'm preemptively blocking here because I think there has not been enough attention on the RFC.

I think this PR is conflating memory fences and thread barriers and as @krzysz00 said gpu.barrier is stronger than a memory fence, and currently everyone expects gpu.barrier to behave as a thread barrier.

I think a better approach might be needed, eg. adding a new op gpu.memfence.

krzysz00 · 2024-09-30T17:11:23Z

@fabianmcg I would say more accurately that MLIR's gpu.barrier is currently defined to include a memory fence, so this is a proposed weakening

fabianmcg · 2024-09-30T17:18:14Z

@fabianmcg I would say more accurately that MLIR's gpu.barrier is currently defined to include a memory fence, so this is a proposed weakening

I agree, thread barriers including gpu.barrier usually imply memory fences, however, their semantics are stronger and people are expecting that by definition.

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

kuhar · 2024-09-30T17:49:05Z

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

This would be a useful primitive to have IMO. I'd expect these to not wait for all threads to reach the same program point, right? The usecase I have in mind is to have more fine-grained control on the number of outstanding memory accesses within a subgroup (IE 'please flush this fifo for me).

krzysz00 · 2024-09-30T18:01:08Z

... I take back some of what I said earlier. On AMDGPU/ROCDL, the lowering of gpu.barrier comes down to

def ROCDL_BarrierOp : ROCDL_Op<"barrier"> {
  string llvmBuilder = [{
    llvm::LLVMContext &llvmContext = builder.getContext();
    builder.CreateFence(llvm::AtomicOrdering::Release,
                        llvmContext.getOrInsertSyncScopeID("workgroup"));
    createIntrinsicCall(builder, llvm::Intrinsic::amdgcn_s_barrier);
    builder.CreateFence(llvm::AtomicOrdering::Acquire,
                        llvmContext.getOrInsertSyncScopeID("workgroup"));
  }];
  let assemblyFormat = "attr-dict";
}

which in theory should just be an LDS fence but I think there are standing compiler issues that make it not that

kuhar · 2024-09-30T18:07:56Z

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

This would be a useful primitive to have IMO. I'd expect these to not wait for all threads to reach the same program point, right? The usecase I have in mind is to have more fine-grained control on the number of outstanding memory accesses within a subgroup (IE 'please flush this fifo for me).

IE ideally, I'd like to have separate ops for waiting for all threads to reach the execution point (say, gpu.synchronize) and for specifying fences (say, gpu.memfence). Then I'd imagine for gpu.barrier to be defined as doing both (with the maximal memory fence scope).

krzysz00 · 2024-09-30T18:10:38Z

I think I agree with you.

Would this definition make gpu.barrier correspond to HIP/CUDA's __syncthreads()? Since I have a suspicion that that was the original semantics of gpu.barrier

fabianmcg · 2024-09-30T18:23:36Z

I'd expect these to not wait for all threads to reach the same program point, right?

A fence never makes thread sync guarantees, only memory ordering (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions). Which precisely highlights the issue with this change, those are separate concepts.

Would this definition make gpu.barrier correspond to HIP/CUDA's __syncthreads()? Since I have a suspicion that that was the original semantics of gpu.barrier

Yes, __syncthreads = gpu.barrier.

IE ideally, I'd like to have separate ops for waiting for all threads to reach the execution point (say, gpu.synchronize) and for specifying fences (say, gpu.memfence). Then I'd imagine for gpu.barrier to be defined as doing both (with the maximal memory fence scope).

I'm inclined to keep barrier as it is and only add memfence. The name barrier is usually used by other parallel programming models to mean thread barriers, eg. https://www.openmp.org/spec-html/5.0/openmpsu90.html

grypp · 2024-09-30T18:24:55Z

I'm not sure I understood the OP semantics. Can we please show how the nvvm lowering will look like?

Introducing a specific memfence OP would be fine. But we need a RFC if you want to make something complex.

kuhar · 2024-09-30T23:01:50Z

I'm inclined to keep barrier as it is and only add memfence. The name barrier is usually used by other parallel programming models to mean thread barriers, eg. https://www.openmp.org/spec-html/5.0/openmpsu90.html

Which precisely highlights the issue with this change, those are separate concepts.

Yes, __syncthreads = gpu.barrier.

At least on amdgpu, __syncthreads is barrier + shared memory fence: https://github.com/ROCm/clr/blob/939c7887793f8280a3196cebc81ba1d07743f068/hipamd/include/hip/amd_detail/amd_device_functions.h#L767-L790.

@FMarno I think it would be very useful to have a survey of related functions/intrinsics across APIs (cuda, amdgpu, opencl, spirv, etc.) to see what semantics other frameworks adopted. These may not be as portable and separable as I initially imagined (I'm far from being an expert here). Would you be able to do that? We could resume it on the RFC thread.

fabianmcg · 2024-09-30T23:21:53Z

At least on amdgpu, __syncthreads is barrier + shared memory fence: https://github.com/ROCm/clr/blob/939c7887793f8280a3196cebc81ba1d07743f068/hipamd/include/hip/amd_detail/amd_device_functions.h#L767-L790.

Yes, that's also the case for NVIDIA: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#barrier-and-memory-fence
and matches the description of the Op https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L1361-L1372

What I was referring, is that in the context of the gpu dialect, gpu.barrier = __syncthreads.

kuhar · 2024-09-30T23:30:52Z

What I was referring, is that in the context of the gpu dialect, gpu.barrier = __syncthreads.

Right, so if the goal is to make it weaker, it seems useful to decouple synchronization from fences. Seems like this is what SPIRV decided to do: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_barrier_instructions

fabianmcg · 2024-09-30T23:36:19Z

it seems useful to decouple synchronization from fences.

I would argue that it would be useful to add new ops where those semantics are decoupled, allowing issuing more fine-grained ops. Not to change barrier as people expect the sync + fence implications. However, that can be discussion in the RFC.

FMarno · 2024-10-01T17:44:17Z

I've added some clarifying comments to the RFC, maybe it's easier to continue the discussion there.

Add address space modifier to barrier

874dd36

llvmbot added mlir:gpu mlir labels Sep 30, 2024

FMarno requested review from krzysz00, kuhar, victor-eds and nhaehnle September 30, 2024 15:59

kuhar reviewed Sep 30, 2024

View reviewed changes

fabianmcg self-requested a review September 30, 2024 17:06

fabianmcg requested changes Sep 30, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][gpu] Add address space modifier to Barrier #110527

[mlir][gpu] Add address space modifier to Barrier #110527

FMarno commented Sep 30, 2024

llvmbot commented Sep 30, 2024

llvmbot commented Sep 30, 2024

kuhar left a comment

kuhar Sep 30, 2024

krzysz00 Sep 30, 2024

FMarno Sep 30, 2024

kuhar Sep 30, 2024 •

edited

Loading

krzysz00 Sep 30, 2024

krzysz00 Sep 30, 2024

kuhar Sep 30, 2024

FMarno Oct 2, 2024

fabianmcg left a comment •

edited

Loading

krzysz00 commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 •

edited

Loading

kuhar commented Sep 30, 2024

krzysz00 commented Sep 30, 2024

kuhar commented Sep 30, 2024

krzysz00 commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 •

edited

Loading

grypp commented Sep 30, 2024 •

edited

Loading

kuhar commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 •

edited

Loading

kuhar commented Sep 30, 2024

fabianmcg commented Sep 30, 2024

FMarno commented Oct 1, 2024

[mlir][gpu] Add address space modifier to Barrier #110527

Are you sure you want to change the base?

[mlir][gpu] Add address space modifier to Barrier #110527

Conversation

FMarno commented Sep 30, 2024

llvmbot commented Sep 30, 2024

llvmbot commented Sep 30, 2024

kuhar left a comment

Choose a reason for hiding this comment

kuhar Sep 30, 2024

Choose a reason for hiding this comment

krzysz00 Sep 30, 2024

Choose a reason for hiding this comment

FMarno Sep 30, 2024

Choose a reason for hiding this comment

kuhar Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

krzysz00 Sep 30, 2024

Choose a reason for hiding this comment

krzysz00 Sep 30, 2024

Choose a reason for hiding this comment

kuhar Sep 30, 2024

Choose a reason for hiding this comment

FMarno Oct 2, 2024

Choose a reason for hiding this comment

fabianmcg left a comment • edited Loading

Choose a reason for hiding this comment

krzysz00 commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 • edited Loading

kuhar commented Sep 30, 2024

krzysz00 commented Sep 30, 2024

kuhar commented Sep 30, 2024

krzysz00 commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 • edited Loading

grypp commented Sep 30, 2024 • edited Loading

kuhar commented Sep 30, 2024

fabianmcg commented Sep 30, 2024 • edited Loading

kuhar commented Sep 30, 2024

fabianmcg commented Sep 30, 2024

FMarno commented Oct 1, 2024

kuhar Sep 30, 2024 •

edited

Loading

fabianmcg left a comment •

edited

Loading

fabianmcg commented Sep 30, 2024 •

edited

Loading

fabianmcg commented Sep 30, 2024 •

edited

Loading

grypp commented Sep 30, 2024 •

edited

Loading

fabianmcg commented Sep 30, 2024 •

edited

Loading