-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][gpu] Add address space modifier to Barrier #110527
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-mlir Author: Finlay (FMarno) ChangesAdd an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced. I've put up an RFC here (with a bit of a bad start). Full diff: https://github.com/llvm/llvm-project/pull/110527.diff 8 Files Affected:
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
def GPU_AddressSpaceAttr :
GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
//===----------------------------------------------------------------------===//
// GPU Types.
//===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
];
}
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+ Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
let summary = "Synchronizes all work items of a workgroup.";
let description = [{
The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
accessing the same memory can be avoided by synchronizing work items
in-between these accesses.
+ The address space of visible memory accesses can be modified by adding a
+ list of address spaces required to be visible. By default all address spaces
+ are included.
+
+ ```mlir
+ // only workgroup address spaces accesses required to be visible
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ // no memory accesses required to be visible
+ gpu.barrier memfence []
+ // all memory accesses required to be visible
+ gpu.barrier
+ ```
+
Either none or all work items of a workgroup need to execute this op
in convergence.
}];
- let assemblyFormat = "attr-dict";
+ let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
let hasCanonicalizer = 1;
+ let builders = [OpBuilder<(ins)>];
}
def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
/*isMemNone=*/false, /*isConvergent=*/true);
- // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
- // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
- constexpr int64_t localMemFenceFlag = 1;
+ // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+ // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+ constexpr int32_t localMemFenceFlag = 1;
+ constexpr int32_t globalMemFenceFlag = 2;
+ int32_t memFenceFlag = 0;
+ std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+ if (addressSpaces) {
+ for (Attribute attr : addressSpaces.value()) {
+ auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+ switch (addressSpace) {
+ case gpu::AddressSpace::Global:
+ memFenceFlag = memFenceFlag | globalMemFenceFlag;
+ break;
+ case gpu::AddressSpace::Workgroup:
+ memFenceFlag = memFenceFlag | localMemFenceFlag;
+ break;
+ case gpu::AddressSpace::Private:
+ break;
+ }
+ }
+ } else {
+ memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+ }
Location loc = op->getLoc();
- Value flag =
- rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+ Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
return success();
}
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/NVVMOps.td"
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
#endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/ROCDLOps.td"
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
#endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
results.add(eraseRedundantGpuBarrierOps);
}
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+ mlir::OperationState &odsState) {}
+
//===----------------------------------------------------------------------===//
// GPUFuncOp
//===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
// CHECK-LABEL: gpu_barrier
func.func @gpu_barrier() {
- // CHECK: [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
- // CHECK: llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+ // CHECK: [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
// CHECK-SAME-DAG: no_unwind
// CHECK-SAME-DAG: convergent
// CHECK-SAME-DAG: will_return
// CHECK-NOT: memory_effects = #llvm.memory_effects
// CHECK-SAME: } : (i32) -> ()
gpu.barrier
+ // CHECK: [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+ gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+ // CHECK: [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ // CHECK: [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+ gpu.barrier memfence [#gpu.address_space<global>]
+ // CHECK: [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+ gpu.barrier memfence []
+ // CHECK: [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+ gpu.barrier memfence [#gpu.address_space<private>]
return
}
}
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
%shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
"gpu.barrier"() : () -> ()
+ gpu.barrier
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ gpu.barrier memfence [#gpu.address_space<global>]
+ gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+ gpu.barrier memfence [#gpu.address_space<private>]
+ gpu.barrier memfence []
"some_op"(%bIdX, %tIdX) : (index, index) -> ()
%42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>
|
@llvm/pr-subscribers-mlir-gpu Author: Finlay (FMarno) ChangesAdd an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced. I've put up an RFC here (with a bit of a bad start). Full diff: https://github.com/llvm/llvm-project/pull/110527.diff 8 Files Affected:
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 860f8933672038..ccb1678aef9192 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
def GPU_AddressSpaceAttr :
GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;
+def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;
+
//===----------------------------------------------------------------------===//
// GPU Types.
//===----------------------------------------------------------------------===//
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 6098eb34d04d52..9d89068c72969b 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
];
}
-def GPU_BarrierOp : GPU_Op<"barrier"> {
+def GPU_BarrierOp : GPU_Op<"barrier">,
+ Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
let summary = "Synchronizes all work items of a workgroup.";
let description = [{
The "barrier" op synchronizes all work items of a workgroup. It is used
@@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
accessing the same memory can be avoided by synchronizing work items
in-between these accesses.
+ The address space of visible memory accesses can be modified by adding a
+ list of address spaces required to be visible. By default all address spaces
+ are included.
+
+ ```mlir
+ // only workgroup address spaces accesses required to be visible
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ // no memory accesses required to be visible
+ gpu.barrier memfence []
+ // all memory accesses required to be visible
+ gpu.barrier
+ ```
+
Either none or all work items of a workgroup need to execute this op
in convergence.
}];
- let assemblyFormat = "attr-dict";
+ let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
let hasCanonicalizer = 1;
+ let builders = [OpBuilder<(ins)>];
}
def GPU_GPUModuleOp : GPU_Op<"module", [
diff --git a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
index 739a34e0aa610e..f9e8e397f93f27 100644
--- a/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
+++ b/mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
@@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
/*isMemNone=*/false, /*isConvergent=*/true);
- // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
- // See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
- constexpr int64_t localMemFenceFlag = 1;
+ // Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
+ // `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
+ constexpr int32_t localMemFenceFlag = 1;
+ constexpr int32_t globalMemFenceFlag = 2;
+ int32_t memFenceFlag = 0;
+ std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
+ if (addressSpaces) {
+ for (Attribute attr : addressSpaces.value()) {
+ auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
+ switch (addressSpace) {
+ case gpu::AddressSpace::Global:
+ memFenceFlag = memFenceFlag | globalMemFenceFlag;
+ break;
+ case gpu::AddressSpace::Workgroup:
+ memFenceFlag = memFenceFlag | localMemFenceFlag;
+ break;
+ case gpu::AddressSpace::Private:
+ break;
+ }
+ }
+ } else {
+ memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
+ }
Location loc = op->getLoc();
- Value flag =
- rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
+ Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
return success();
}
diff --git a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
index f513bb1a0a8265..0fcda38631a9b0 100644
--- a/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
+++ b/mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/NVVMOps.td"
-def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;
#endif // MLIR_CONVERSION_GPUTONVVM_TD
diff --git a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
index 8d2f30a9a16835..d3bb7748134374 100644
--- a/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
+++ b/mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
@@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/ROCDLOps.td"
-def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
+def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;
#endif // MLIR_CONVERSION_GPUTOROCDL_TD
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index 956877497d9338..156d6b8fe15951 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
results.add(eraseRedundantGpuBarrierOps);
}
+void BarrierOp::build(mlir::OpBuilder &odsBuilder,
+ mlir::OperationState &odsState) {}
+
//===----------------------------------------------------------------------===//
// GPUFuncOp
//===----------------------------------------------------------------------===//
diff --git a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
index 910105ddf69586..4767565ea05501 100644
--- a/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
+++ b/mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
@@ -213,14 +213,29 @@ gpu.module @barriers {
// CHECK-LABEL: gpu_barrier
func.func @gpu_barrier() {
- // CHECK: [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
- // CHECK: llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
+ // CHECK: [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
// CHECK-SAME-DAG: no_unwind
// CHECK-SAME-DAG: convergent
// CHECK-SAME-DAG: will_return
// CHECK-NOT: memory_effects = #llvm.memory_effects
// CHECK-SAME: } : (i32) -> ()
gpu.barrier
+ // CHECK: [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
+ gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+ // CHECK: [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ // CHECK: [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
+ gpu.barrier memfence [#gpu.address_space<global>]
+ // CHECK: [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
+ gpu.barrier memfence []
+ // CHECK: [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
+ // CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
+ gpu.barrier memfence [#gpu.address_space<private>]
return
}
}
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index b9c0a0e79e8f2a..2bba66f786f189 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
%shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32
"gpu.barrier"() : () -> ()
+ gpu.barrier
+ gpu.barrier memfence [#gpu.address_space<workgroup>]
+ gpu.barrier memfence [#gpu.address_space<global>]
+ gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
+ gpu.barrier memfence [#gpu.address_space<private>]
+ gpu.barrier memfence []
"some_op"(%bIdX, %tIdX) : (index, index) -> ()
%42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the new semantics as described in the op documentation.
The address space of visible memory accesses can be modified by adding a | ||
list of address spaces required to be visible. By default all address spaces | ||
are included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean for memory to be visible? What memory accesses is this referencing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest "completed" or "committed" as synonyms here, if I've understood the semantics right.
That is, any memory operations involving the given memory spaces that were issued before the barrier must have their results reflected after the barrier completes.
Or, another way, any operation touching the listed memory spaces must be happens-before with any operation on those memory spaces after the barrier.
(This would allow us to not necessarily have amdgpu.lds_barrier
- since we'd be able to express that as gpu.barrier [#gpu.address_space<workgroup>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not trying to say that the memory is visible, but that the memory accesses are visible. I think it makes sense when paired with the previous paragraph
and all memory accesses made by these work items prior to the op are
visible to all work items in the workgroup.
I think what @krzysz00 is saying matches.
I'm open to suggestions on how to make it more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some good points there, @krzysz00. I only know the amdgpu implementation and that's why I'd like to make sure the semantics make sense for everyone in a portable way.
The way I understand it, it's backend compiler's job to track any memory dependencies / synchronization within the same subgroup, and we only need gpu.barrier
(the plain version) when the dependency can be across subgroups.
(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space]
I don't see how this aligns with the stated goal in the RFC:
This could reduce the cost of synchronization.
All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.
lds_barrier is strictly more work than just gpu.barrier, no? It's s_barrier
and 'please flush the shared memory fifo' . At least that's what c++ libraries do: https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/utility/synchronization.hpp#L20-L25.
Maybe I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpu.barrier
is stronger than amdgpu.lds_barrier
. The current semantics of gpu.barrier
are not AMD's s_barrier
, they're (atomic fence) + s_barrier
.
(I think there are are even explicit calls to LLVM memory fencing in the AMD lowering path for gpu.barrier
).
That is, gpu.barrier
imposes a happens-before over all GPU memory by default, and this patch lets people loosen that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and those libraries are implementing amdgpu.lds_barrirer
. If they wanted to match gpu.barrier
, they'd be waiting on vmcnt(0)
as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not trying to say that the memory is visible, but that the memory accesses are visible.
I don't know what this means either. I don't want to be pedantic here, but this is just not the wording I'm used to / have seen in this context. I did some searching and found this thread that talks about something adjacent: memory access availability vs. visibility https://community.khronos.org/t/difference-between-availability-and-visibility/7401/2.
I'm open to suggestions on how to make it more clear.
Maybe something like this (I reworded a few places):
The "barrier" op synchronizes all work items of a workgroup. It is used
to coordinate communication between the work items of the workgroup.
`gpu.barrier` waits until all work items in the workgroup have reached this point
and all memory accesses made by these work items prior to the op have
completed and are visible to all work items in the workgroup.
Data races between work items accessing the same memory locations can
be avoided by synchronizing work items in-between these accesses.
By default, all address spaces are included in `gpu.barrier`. This is equivalent to:
```mlir
gpu.barrier memfence [#gpu.address_space<global>,
#gpu.address_space<workgroup>,
#gpu.address_space<private>]
```
The `memfence` argument weakens the synchronization requrements such that
only the listed address spaces need to have their accesses completed and visible across the
workgroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that description basically covers it, but I would move a couple things around. Also I wouldn't say that memfence
necessary weakens the memory fencing requirement, but it can be used for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm preemptively blocking here because I think there has not been enough attention on the RFC.
I think this PR is conflating memory fences and thread barriers and as @krzysz00 said gpu.barrier
is stronger than a memory fence, and currently everyone expects gpu.barrier
to behave as a thread barrier.
I think a better approach might be needed, eg. adding a new op gpu.memfence
.
@fabianmcg I would say more accurately that MLIR's |
I agree, thread barriers including That's why I think a |
This would be a useful primitive to have IMO. I'd expect these to not wait for all threads to reach the same program point, right? The usecase I have in mind is to have more fine-grained control on the number of outstanding memory accesses within a subgroup (IE 'please flush this fifo for me). |
... I take back some of what I said earlier. On AMDGPU/ROCDL, the lowering of
which in theory should just be an LDS fence but I think there are standing compiler issues that make it not that |
IE ideally, I'd like to have separate ops for waiting for all threads to reach the execution point (say, |
I think I agree with you. Would this definition make |
A fence never makes thread sync guarantees, only memory ordering (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions). Which precisely highlights the issue with this change, those are separate concepts.
Yes,
I'm inclined to keep |
I'm not sure I understood the OP semantics. Can we please show how the nvvm lowering will look like? Introducing a specific memfence OP would be fine. But we need a RFC if you want to make something complex. |
At least on amdgpu, @FMarno I think it would be very useful to have a survey of related functions/intrinsics across APIs (cuda, amdgpu, opencl, spirv, etc.) to see what semantics other frameworks adopted. These may not be as portable and separable as I initially imagined (I'm far from being an expert here). Would you be able to do that? We could resume it on the RFC thread. |
Yes, that's also the case for NVIDIA: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#barrier-and-memory-fence What I was referring, is that in the context of the |
Right, so if the goal is to make it weaker, it seems useful to decouple synchronization from fences. Seems like this is what SPIRV decided to do: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#_barrier_instructions |
I would argue that it would be useful to add new ops where those semantics are decoupled, allowing issuing more fine-grained ops. Not to change |
I've added some clarifying comments to the RFC, maybe it's easier to continue the discussion there. |
Add an address space modifier to the GPU Barrier. All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.
I've put up an RFC here (with a bit of a bad start).