Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir][gpu] Add address space modifier to Barrier #110527

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

FMarno
Copy link
Contributor

@FMarno FMarno commented Sep 30, 2024

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 30, 2024

@llvm/pr-subscribers-mlir

Author: Finlay (FMarno)

Changes

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).


Full diff: https://github.com/llvm/llvm-project/pull/110527.diff

8 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUBase.td (+2)
  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+17-2)
  • (modified) mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp (+24-5)
  • (modified) mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td (+1-1)
  • (modified) mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td (+1-1)
  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+3)
  • (modified) mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir (+17-2)
  • (modified) mlir/test/Dialect/GPU/ops.mlir (+6)
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
 def GPU_AddressSpaceAttr :
   GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
 
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
 //===----------------------------------------------------------------------===//
 // GPU Types.
 //===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
   ];
 }
 
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+    Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
   let summary = "Synchronizes all work items of a workgroup.";
   let description = [{
     The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
     accessing the same memory can be avoided by synchronizing work items
     in-between these accesses.
 
+    The address space of visible memory accesses can be modified by adding a
+    list of address spaces required to be visible. By default all address spaces
+    are included.
+
+    ```mlir
+    // only workgroup address spaces accesses required to be visible
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // no memory accesses required to be visible
+    gpu.barrier memfence []
+    // all memory accesses required to be visible
+    gpu.barrier
+    ```
+
     Either none or all work items of a workgroup need to execute this op
     in convergence.
   }];
-  let assemblyFormat = "attr-dict";
+  let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
   let hasCanonicalizer = 1;
+  let builders = [OpBuilder<(ins)>];
 }
 
 def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
         lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
                               /*isMemNone=*/false, /*isConvergent=*/true);
 
-    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
-    // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
-    constexpr int64_t localMemFenceFlag = 1;
+    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+    // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+    constexpr int32_t localMemFenceFlag = 1;
+    constexpr int32_t globalMemFenceFlag = 2;
+    int32_t memFenceFlag = 0;
+    std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+    if (addressSpaces) {
+      for (Attribute attr : addressSpaces.value()) {
+        auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+        switch (addressSpace) {
+        case gpu::AddressSpace::Global:
+          memFenceFlag = memFenceFlag | globalMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Workgroup:
+          memFenceFlag = memFenceFlag | localMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Private:
+          break;
+        }
+      }
+    } else {
+      memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+    }
     Location loc = op->getLoc();
-    Value flag =
-        rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+    Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
     rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
     return success();
   }
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/NVVMOps.td"
 
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
 
 #endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/ROCDLOps.td"
 
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
 
 #endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add(eraseRedundantGpuBarrierOps);
 }
 
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+                      mlir::OperationState &odsState) {}
+
 //===----------------------------------------------------------------------===//
 // GPUFuncOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
 
   // CHECK-LABEL: gpu_barrier
   func.func @gpu_barrier() {
-    // CHECK:         [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
-    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
     // CHECK-SAME-DAG:  no_unwind
     // CHECK-SAME-DAG:  convergent
     // CHECK-SAME-DAG:  will_return
     // CHECK-NOT:       memory_effects = #llvm.memory_effects
     // CHECK-SAME:    } : (i32) -> ()
     gpu.barrier
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+    // CHECK:         [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // CHECK:         [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<global>]
+    // CHECK:         [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+    gpu.barrier memfence []
+    // CHECK:         [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<private>]
     return
   }
 }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
       %shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
 
       "gpu.barrier"() : () -> ()
+      gpu.barrier
+      gpu.barrier memfence [#gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<global>]
+      gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<private>]
+      gpu.barrier memfence []
 
       "some_op"(%bIdX, %tIdX) : (index, index) -> ()
       %42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 30, 2024

@llvm/pr-subscribers-mlir-gpu

Author: Finlay (FMarno)

Changes

Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

I've put up an RFC here (with a bit of a bad start).


Full diff: https://github.com/llvm/llvm-project/pull/110527.diff

8 Files Affected:

  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUBase.td (+2)
  • (modified) mlir/include/mlir/Dialect/GPU/IR/GPUOps.td (+17-2)
  • (modified) mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp (+24-5)
  • (modified) mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td (+1-1)
  • (modified) mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td (+1-1)
  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+3)
  • (modified) mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir (+17-2)
  • (modified) mlir/test/Dialect/GPU/ops.mlir (+6)
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
 def GPU_AddressSpaceAttr :
   GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
 
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
 //===----------------------------------------------------------------------===//
 // GPU Types.
 //===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
   ];
 }
 
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+    Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
   let summary = "Synchronizes all work items of a workgroup.";
   let description = [{
     The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
     accessing the same memory can be avoided by synchronizing work items
     in-between these accesses.
 
+    The address space of visible memory accesses can be modified by adding a
+    list of address spaces required to be visible. By default all address spaces
+    are included.
+
+    ```mlir
+    // only workgroup address spaces accesses required to be visible
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // no memory accesses required to be visible
+    gpu.barrier memfence []
+    // all memory accesses required to be visible
+    gpu.barrier
+    ```
+
     Either none or all work items of a workgroup need to execute this op
     in convergence.
   }];
-  let assemblyFormat = "attr-dict";
+  let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
   let hasCanonicalizer = 1;
+  let builders = [OpBuilder<(ins)>];
 }
 
 def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
         lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
                               /*isMemNone=*/false, /*isConvergent=*/true);
 
-    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
-    // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
-    constexpr int64_t localMemFenceFlag = 1;
+    // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+    // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+    constexpr int32_t localMemFenceFlag = 1;
+    constexpr int32_t globalMemFenceFlag = 2;
+    int32_t memFenceFlag = 0;
+    std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+    if (addressSpaces) {
+      for (Attribute attr : addressSpaces.value()) {
+        auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+        switch (addressSpace) {
+        case gpu::AddressSpace::Global:
+          memFenceFlag = memFenceFlag | globalMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Workgroup:
+          memFenceFlag = memFenceFlag | localMemFenceFlag;
+          break;
+        case gpu::AddressSpace::Private:
+          break;
+        }
+      }
+    } else {
+      memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+    }
     Location loc = op->getLoc();
-    Value flag =
-        rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+    Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
     rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
     return success();
   }
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/NVVMOps.td"
 
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
 
 #endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
 include "mlir/Dialect/GPU/IR/GPUOps.td"
 include "mlir/Dialect/LLVMIR/ROCDLOps.td"
 
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
 
 #endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
   results.add(eraseRedundantGpuBarrierOps);
 }
 
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+                      mlir::OperationState &odsState) {}
+
 //===----------------------------------------------------------------------===//
 // GPUFuncOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
 
   // CHECK-LABEL: gpu_barrier
   func.func @gpu_barrier() {
-    // CHECK:         [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
-    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
     // CHECK-SAME-DAG:  no_unwind
     // CHECK-SAME-DAG:  convergent
     // CHECK-SAME-DAG:  will_return
     // CHECK-NOT:       memory_effects = #llvm.memory_effects
     // CHECK-SAME:    } : (i32) -> ()
     gpu.barrier
+    // CHECK:         [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+    // CHECK:         [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<workgroup>]
+    // CHECK:         [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+    gpu.barrier memfence [#gpu.address_space<global>]
+    // CHECK:         [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+    gpu.barrier memfence []
+    // CHECK:         [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+    // CHECK:         llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+    gpu.barrier memfence [#gpu.address_space<private>]
     return
   }
 }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
       %shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
 
       "gpu.barrier"() : () -> ()
+      gpu.barrier
+      gpu.barrier memfence [#gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<global>]
+      gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+      gpu.barrier memfence [#gpu.address_space<private>]
+      gpu.barrier memfence []
 
       "some_op"(%bIdX, %tIdX) : (index, index) -> ()
       %42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the new semantics as described in the op documentation.

Comment on lines +1375 to +1377
The address space of visible memory accesses can be modified by adding a
list of address spaces required to be visible. By default all address spaces
are included.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for memory to be visible? What memory accesses is this referencing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest "completed" or "committed" as synonyms here, if I've understood the semantics right.

That is, any memory operations involving the given memory spaces that were issued before the barrier must have their results reflected after the barrier completes.

Or, another way, any operation touching the listed memory spaces must be happens-before with any operation on those memory spaces after the barrier.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space<workgroup>]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to say that the memory is visible, but that the memory accesses are visible. I think it makes sense when paired with the previous paragraph

and all memory accesses made by these work items prior to the op are
visible to all work items in the workgroup.

I think what @krzysz00 is saying matches.

I'm open to suggestions on how to make it more clear.

Copy link
Member

@kuhar kuhar Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some good points there, @krzysz00. I only know the amdgpu implementation and that's why I'd like to make sure the semantics make sense for everyone in a portable way.

The way I understand it, it's backend compiler's job to track any memory dependencies / synchronization within the same subgroup, and we only need gpu.barrier (the plain version) when the dependency can be across subgroups.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space]

I don't see how this aligns with the stated goal in the RFC:

This could reduce the cost of synchronization.

All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

lds_barrier is strictly more work than just gpu.barrier, no? It's s_barrier and 'please flush the shared memory fifo' . At least that's what c++ libraries do: https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/utility/synchronization.hpp#L20-L25.

Maybe I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu.barrier is stronger than amdgpu.lds_barrier. The current semantics of gpu.barrier are not AMD's s_barrier, they're (atomic fence) + s_barrier.

(I think there are are even explicit calls to LLVM memory fencing in the AMD lowering path for gpu.barrier).

That is, gpu.barrier imposes a happens-before over all GPU memory by default, and this patch lets people loosen that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and those libraries are implementing amdgpu.lds_barrirer. If they wanted to match gpu.barrier, they'd be waiting on vmcnt(0) as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to say that the memory is visible, but that the memory accesses are visible.

I don't know what this means either. I don't want to be pedantic here, but this is just not the wording I'm used to / have seen in this context. I did some searching and found this thread that talks about something adjacent: memory access availability vs. visibility https://community.khronos.org/t/difference-between-availability-and-visibility/7401/2.

I'm open to suggestions on how to make it more clear.

Maybe something like this (I reworded a few places):

    The "barrier" op synchronizes all work items of a workgroup. It is used
    to coordinate communication between the work items of the workgroup.
    `gpu.barrier` waits until all work items in the workgroup have reached this point
    and all memory accesses made by these work items prior to the op have
    completed and are visible to all work items in the workgroup.
    
    Data races between work items accessing the same memory locations can
    be avoided by synchronizing work items in-between these accesses.

    By default, all address spaces are included in `gpu.barrier`. This is equivalent to:
    ```mlir
    gpu.barrier memfence [#gpu.address_space<global>,
                          #gpu.address_space<workgroup>,
                          #gpu.address_space<private>]
    ```
    
    The `memfence` argument weakens the synchronization requrements such that
    only the listed address spaces need to have their accesses completed and visible across the
    workgroup. 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that description basically covers it, but I would move a couple things around. Also I wouldn't say that memfence necessary weakens the memory fencing requirement, but it can be used for that.

@fabianmcg fabianmcg self-requested a review September 30, 2024 17:06
Copy link
Contributor

@fabianmcg fabianmcg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm preemptively blocking here because I think there has not been enough attention on the RFC.

I think this PR is conflating memory fences and thread barriers and as @krzysz00 said gpu.barrier is stronger than a memory fence, and currently everyone expects gpu.barrier to behave as a thread barrier.

I think a better approach might be needed, eg. adding a new op gpu.memfence.

@krzysz00
Copy link
Contributor

@fabianmcg I would say more accurately that MLIR's gpu.barrier is currently defined to include a memory fence, so this is a proposed weakening

@fabianmcg
Copy link
Contributor

fabianmcg commented Sep 30, 2024

@fabianmcg I would say more accurately that MLIR's gpu.barrier is currently defined to include a memory fence, so this is a proposed weakening

I agree, thread barriers including gpu.barrier usually imply memory fences, however, their semantics are stronger and people are expecting that by definition.

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

@kuhar
Copy link
Member

kuhar commented Sep 30, 2024

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

This would be a useful primitive to have IMO. I'd expect these to not wait for all threads to reach the same program point, right? The usecase I have in mind is to have more fine-grained control on the number of outstanding memory accesses within a subgroup (IE 'please flush this fifo for me).

@krzysz00
Copy link
Contributor

... I take back some of what I said earlier. On AMDGPU/ROCDL, the lowering of gpu.barrier comes down to

def ROCDL_BarrierOp : ROCDL_Op<"barrier"> {
  string llvmBuilder = [{
    llvm::LLVMContext &llvmContext = builder.getContext();
    builder.CreateFence(llvm::AtomicOrdering::Release,
                        llvmContext.getOrInsertSyncScopeID("workgroup"));
    createIntrinsicCall(builder, llvm::Intrinsic::amdgcn_s_barrier);
    builder.CreateFence(llvm::AtomicOrdering::Acquire,
                        llvmContext.getOrInsertSyncScopeID("workgroup"));
  }];
  let assemblyFormat = "attr-dict";
}

which in theory should just be an LDS fence but I think there are standing compiler issues that make it not that

@kuhar
Copy link
Member

kuhar commented Sep 30, 2024

That's why I think a memfence operation might be better, as it makes clear that barrier and memfence have different semantics.

This would be a useful primitive to have IMO. I'd expect these to not wait for all threads to reach the same program point, right? The usecase I have in mind is to have more fine-grained control on the number of outstanding memory accesses within a subgroup (IE 'please flush this fifo for me).

IE ideally, I'd like to have separate ops for waiting for all threads to reach the execution point (say, gpu.synchronize) and for specifying fences (say, gpu.memfence). Then I'd imagine for gpu.barrier to be defined as doing both (with the maximal memory fence scope).

@krzysz00
Copy link
Contributor

I think I agree with you.

Would this definition make gpu.barrier correspond to HIP/CUDA's __syncthreads()? Since I have a suspicion that that was the original semantics of gpu.barrier

@fabianmcg
Copy link
Contributor

fabianmcg commented Sep 30, 2024

I'd expect these to not wait for all threads to reach the same program point, right?

A fence never makes thread sync guarantees, only memory ordering (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions). Which precisely highlights the issue with this change, those are separate concepts.

Would this definition make gpu.barrier correspond to HIP/CUDA's __syncthreads()? Since I have a suspicion that that was the original semantics of gpu.barrier

Yes, __syncthreads = gpu.barrier.

IE ideally, I'd like to have separate ops for waiting for all threads to reach the execution point (say, gpu.synchronize) and for specifying fences (say, gpu.memfence). Then I'd imagine for gpu.barrier to be defined as doing both (with the maximal memory fence scope).

I'm inclined to keep barrier as it is and only add memfence. The name barrier is usually used by other parallel programming models to mean thread barriers, eg. https://www.openmp.org/spec-html/5.0/openmpsu90.html

@grypp
Copy link
Member

grypp commented Sep 30, 2024

I'm not sure I understood the OP semantics. Can we please show how the nvvm lowering will look like?

Introducing a specific memfence OP would be fine. But we need a RFC if you want to make something complex.

@kuhar
Copy link
Member

kuhar commented Sep 30, 2024

I'm inclined to keep barrier as it is and only add memfence. The name barrier is usually used by other parallel programming models to mean thread barriers, eg. https://www.openmp.org/spec-html/5.0/openmpsu90.html

Which precisely highlights the issue with this change, those are separate concepts.

Yes, __syncthreads = gpu.barrier.

At least on amdgpu, __syncthreads is barrier + shared memory fence: https://github.com/ROCm/clr/blob/939c7887793f8280a3196cebc81ba1d07743f068/hipamd/include/hip/amd_detail/amd_device_functions.h#L767-L790.

@FMarno I think it would be very useful to have a survey of related functions/intrinsics across APIs (cuda, amdgpu, opencl, spirv, etc.) to see what semantics other frameworks adopted. These may not be as portable and separable as I initially imagined (I'm far from being an expert here). Would you be able to do that? We could resume it on the RFC thread.

@fabianmcg
Copy link
Contributor

fabianmcg commented Sep 30, 2024

At least on amdgpu, __syncthreads is barrier + shared memory fence: https://github.com/ROCm/clr/blob/939c7887793f8280a3196cebc81ba1d07743f068/hipamd/include/hip/amd_detail/amd_device_functions.h#L767-L790.

Yes, that's also the case for NVIDIA: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#barrier-and-memory-fence
and matches the description of the Op https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td#L1361-L1372

What I was referring, is that in the context of the gpu dialect, gpu.barrier = __syncthreads.

@kuhar
Copy link
Member

kuhar commented Sep 30, 2024

What I was referring, is that in the context of the gpu dialect, gpu.barrier = __syncthreads.

Right, so if the goal is to make it weaker, it seems useful to decouple synchronization from fences. Seems like this is what SPIRV decided to do: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_barrier_instructions

@fabianmcg
Copy link
Contributor

it seems useful to decouple synchronization from fences.

I would argue that it would be useful to add new ops where those semantics are decoupled, allowing issuing more fine-grained ops. Not to change barrier as people expect the sync + fence implications. However, that can be discussion in the RFC.

@FMarno
Copy link
Contributor Author

FMarno commented Oct 1, 2024

I've added some clarifying comments to the RFC, maybe it's easier to continue the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants