Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir][gpu] Add address space modifier to Barrier #110527

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ def GPU_AddressSpaceEnum : GPU_I32Enum<
def GPU_AddressSpaceAttr :
GPU_I32EnumAttr<"address_space", GPU_AddressSpaceEnum>;

def GPU_AddressSpaceAttrArray : TypedArrayAttrBase<GPU_AddressSpaceAttr, "GPU Address Space array">;

//===----------------------------------------------------------------------===//
// GPU Types.
//===----------------------------------------------------------------------===//
Expand Down
19 changes: 17 additions & 2 deletions mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
Original file line number Diff line number Diff line change
Expand Up @@ -1355,7 +1355,8 @@ def GPU_ShuffleOp : GPU_Op<
];
}

def GPU_BarrierOp : GPU_Op<"barrier"> {
def GPU_BarrierOp : GPU_Op<"barrier">,
Arguments<(ins OptionalAttr<GPU_AddressSpaceAttrArray> :$address_spaces)> {
let summary = "Synchronizes all work items of a workgroup.";
let description = [{
The "barrier" op synchronizes all work items of a workgroup. It is used
Expand All @@ -1371,11 +1372,25 @@ def GPU_BarrierOp : GPU_Op<"barrier"> {
accessing the same memory can be avoided by synchronizing work items
in-between these accesses.

The address space of visible memory accesses can be modified by adding a
list of address spaces required to be visible. By default all address spaces
are included.
Comment on lines +1375 to +1377
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for memory to be visible? What memory accesses is this referencing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest "completed" or "committed" as synonyms here, if I've understood the semantics right.

That is, any memory operations involving the given memory spaces that were issued before the barrier must have their results reflected after the barrier completes.

Or, another way, any operation touching the listed memory spaces must be happens-before with any operation on those memory spaces after the barrier.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space<workgroup>]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to say that the memory is visible, but that the memory accesses are visible. I think it makes sense when paired with the previous paragraph

and all memory accesses made by these work items prior to the op are
visible to all work items in the workgroup.

I think what @krzysz00 is saying matches.

I'm open to suggestions on how to make it more clear.

Copy link
Member

@kuhar kuhar Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some good points there, @krzysz00. I only know the amdgpu implementation and that's why I'd like to make sure the semantics make sense for everyone in a portable way.

The way I understand it, it's backend compiler's job to track any memory dependencies / synchronization within the same subgroup, and we only need gpu.barrier (the plain version) when the dependency can be across subgroups.

(This would allow us to not necessarily have amdgpu.lds_barrier - since we'd be able to express that as gpu.barrier [#gpu.address_space]

I don't see how this aligns with the stated goal in the RFC:

This could reduce the cost of synchronization.

All work-items in the workgroup are still required to reach the barrier, but the address space visibility can be reduced.

lds_barrier is strictly more work than just gpu.barrier, no? It's s_barrier and 'please flush the shared memory fifo' . At least that's what c++ libraries do: https://github.com/ROCm/composable_kernel/blob/de3e3b642402eac5b4a466f6a2fa5e9f022ba680/include/ck/utility/synchronization.hpp#L20-L25.

Maybe I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu.barrier is stronger than amdgpu.lds_barrier. The current semantics of gpu.barrier are not AMD's s_barrier, they're (atomic fence) + s_barrier.

(I think there are are even explicit calls to LLVM memory fencing in the AMD lowering path for gpu.barrier).

That is, gpu.barrier imposes a happens-before over all GPU memory by default, and this patch lets people loosen that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and those libraries are implementing amdgpu.lds_barrirer. If they wanted to match gpu.barrier, they'd be waiting on vmcnt(0) as well

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not trying to say that the memory is visible, but that the memory accesses are visible.

I don't know what this means either. I don't want to be pedantic here, but this is just not the wording I'm used to / have seen in this context. I did some searching and found this thread that talks about something adjacent: memory access availability vs. visibility https://community.khronos.org/t/difference-between-availability-and-visibility/7401/2.

I'm open to suggestions on how to make it more clear.

Maybe something like this (I reworded a few places):

    The "barrier" op synchronizes all work items of a workgroup. It is used
    to coordinate communication between the work items of the workgroup.
    `gpu.barrier` waits until all work items in the workgroup have reached this point
    and all memory accesses made by these work items prior to the op have
    completed and are visible to all work items in the workgroup.
    
    Data races between work items accessing the same memory locations can
    be avoided by synchronizing work items in-between these accesses.

    By default, all address spaces are included in `gpu.barrier`. This is equivalent to:
    ```mlir
    gpu.barrier memfence [#gpu.address_space<global>,
                          #gpu.address_space<workgroup>,
                          #gpu.address_space<private>]
    ```
    
    The `memfence` argument weakens the synchronization requrements such that
    only the listed address spaces need to have their accesses completed and visible across the
    workgroup. 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that description basically covers it, but I would move a couple things around. Also I wouldn't say that memfence necessary weakens the memory fencing requirement, but it can be used for that.


```mlir
// only workgroup address spaces accesses required to be visible
gpu.barrier memfence [#gpu.address_space<workgroup>]
// no memory accesses required to be visible
gpu.barrier memfence []
// all memory accesses required to be visible
gpu.barrier
```

Either none or all work items of a workgroup need to execute this op
in convergence.
}];
let assemblyFormat = "attr-dict";
let assemblyFormat = "(`memfence` $address_spaces^)? attr-dict";
let hasCanonicalizer = 1;
let builders = [OpBuilder<(ins)>];
}

def GPU_GPUModuleOp : GPU_Op<"module", [
Expand Down
29 changes: 24 additions & 5 deletions mlir/lib/Conversion/GPUToLLVMSPV/GPUToLLVMSPV.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -116,12 +116,31 @@ struct GPUBarrierConversion final : ConvertOpToLLVMPattern<gpu::BarrierOp> {
lookupOrCreateSPIRVFn(moduleOp, funcName, flagTy, voidTy,
/*isMemNone=*/false, /*isConvergent=*/true);

// Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE`.
// See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
constexpr int64_t localMemFenceFlag = 1;
// Value used by SPIR-V backend to represent `CLK_LOCAL_MEM_FENCE` and
// `CLK_GLOBAL_MEM_FENCE`. See `llvm/lib/Target/SPIRV/SPIRVBuiltins.td`.
constexpr int32_t localMemFenceFlag = 1;
constexpr int32_t globalMemFenceFlag = 2;
int32_t memFenceFlag = 0;
std::optional<ArrayAttr> addressSpaces = adaptor.getAddressSpaces();
if (addressSpaces) {
for (Attribute attr : addressSpaces.value()) {
auto addressSpace = cast<gpu::AddressSpaceAttr>(attr).getValue();
switch (addressSpace) {
case gpu::AddressSpace::Global:
memFenceFlag = memFenceFlag | globalMemFenceFlag;
break;
case gpu::AddressSpace::Workgroup:
memFenceFlag = memFenceFlag | localMemFenceFlag;
break;
case gpu::AddressSpace::Private:
break;
}
}
} else {
memFenceFlag = localMemFenceFlag | globalMemFenceFlag;
}
Location loc = op->getLoc();
Value flag =
rewriter.create<LLVM::ConstantOp>(loc, flagTy, localMemFenceFlag);
Value flag = rewriter.create<LLVM::ConstantOp>(loc, flagTy, memFenceFlag);
rewriter.replaceOp(op, createSPIRVBuiltinCall(loc, rewriter, func, flag));
return success();
}
Expand Down
2 changes: 1 addition & 1 deletion mlir/lib/Conversion/GPUToNVVM/GPUToNVVM.td
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/NVVMOps.td"

def : Pat<(GPU_BarrierOp), (NVVM_Barrier0Op)>;
def : Pat<(GPU_BarrierOp : $op $memory_fence), (NVVM_Barrier0Op)>;

#endif // MLIR_CONVERSION_GPUTONVVM_TD
2 changes: 1 addition & 1 deletion mlir/lib/Conversion/GPUToROCDL/GPUToROCDL.td
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ include "mlir/IR/PatternBase.td"
include "mlir/Dialect/GPU/IR/GPUOps.td"
include "mlir/Dialect/LLVMIR/ROCDLOps.td"

def : Pat<(GPU_BarrierOp), (ROCDL_BarrierOp)>;
def : Pat<(GPU_BarrierOp : $op $memory_fence), (ROCDL_BarrierOp)>;

#endif // MLIR_CONVERSION_GPUTOROCDL_TD
3 changes: 3 additions & 0 deletions mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1351,6 +1351,9 @@ void BarrierOp::getCanonicalizationPatterns(RewritePatternSet &results,
results.add(eraseRedundantGpuBarrierOps);
}

void BarrierOp::build(mlir::OpBuilder &odsBuilder,
mlir::OperationState &odsState) {}

//===----------------------------------------------------------------------===//
// GPUFuncOp
//===----------------------------------------------------------------------===//
Expand Down
19 changes: 17 additions & 2 deletions mlir/test/Conversion/GPUToLLVMSPV/gpu-to-llvm-spv.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -213,14 +213,29 @@ gpu.module @barriers {

// CHECK-LABEL: gpu_barrier
func.func @gpu_barrier() {
// CHECK: [[FLAGS:%.*]] = llvm.mlir.constant(1 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[FLAGS]]) {
// CHECK: [[GLOBAL_AND_LOCAL_FLAG:%.*]] = llvm.mlir.constant(3 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG]]) {
// CHECK-SAME-DAG: no_unwind
// CHECK-SAME-DAG: convergent
// CHECK-SAME-DAG: will_return
// CHECK-NOT: memory_effects = #llvm.memory_effects
// CHECK-SAME: } : (i32) -> ()
gpu.barrier
// CHECK: [[GLOBAL_AND_LOCAL_FLAG2:%.*]] = llvm.mlir.constant(3 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_AND_LOCAL_FLAG2]])
gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
// CHECK: [[LOCAL_FLAG:%.*]] = llvm.mlir.constant(1 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[LOCAL_FLAG]])
gpu.barrier memfence [#gpu.address_space<workgroup>]
// CHECK: [[GLOBAL_FLAG:%.*]] = llvm.mlir.constant(2 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[GLOBAL_FLAG]])
gpu.barrier memfence [#gpu.address_space<global>]
// CHECK: [[NONE_FLAG:%.*]] = llvm.mlir.constant(0 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG]])
gpu.barrier memfence []
// CHECK: [[NONE_FLAG2:%.*]] = llvm.mlir.constant(0 : i32) : i32
// CHECK: llvm.call spir_funccc @_Z7barrierj([[NONE_FLAG2]])
gpu.barrier memfence [#gpu.address_space<private>]
return
}
}
Expand Down
6 changes: 6 additions & 0 deletions mlir/test/Dialect/GPU/ops.mlir
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,12 @@ module attributes {gpu.container_module} {
%shfl3, %pred3 = gpu.shuffle idx %arg0, %offset, %width : f32

"gpu.barrier"() : () -> ()
gpu.barrier
gpu.barrier memfence [#gpu.address_space<workgroup>]
gpu.barrier memfence [#gpu.address_space<global>]
gpu.barrier memfence [#gpu.address_space<global>, #gpu.address_space<workgroup>]
gpu.barrier memfence [#gpu.address_space<private>]
gpu.barrier memfence []

"some_op"(%bIdX, %tIdX) : (index, index) -> ()
%42 = memref.load %arg1[%bIdX] : memref<?xf32, 1>
Expand Down
Loading