-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP][MLIR] Add omp.distribute op to the OMP dialect #67720
Conversation
@llvm/pr-subscribers-flang-openmp @llvm/pr-subscribers-mlir-openmp ChangesThis patch adds the omp.distribute operation to the OMP dialect. The purpose is to be able to represent the distribute construct in OpenMP with the associated clauses. The effect of the operation is to distributes the loop iterations of the loop(s) contained inside the region across multiple teams. Full diff: https://github.com/llvm/llvm-project/pull/67720.diff 3 Files Affected:
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
index 5da05476a683725..89d2ac342281ccb 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
@@ -636,6 +636,57 @@ def YieldOp : OpenMP_Op<"yield",
let assemblyFormat = [{ ( `(` $results^ `:` type($results) `)` )? attr-dict}];
}
+//===----------------------------------------------------------------------===//
+// Distribute construct [2.9.4.1]
+//===----------------------------------------------------------------------===//
+def DistributeOp : OpenMP_Op<"distribute", [AttrSizedOperandSegments,
+ RecursiveMemoryEffects]> {
+ let summary = "distribute construct";
+ let description = [{
+ The distribute construct specifies that the iterations of one or more loop
+ will be executed by the initial teams in the context of their implicit
+ tasks. The iterations are distributed across the initial threads of all
+ initial teams that execute the teams region to which the distribute region
+ binds.
+
+ The distribute loop construct specifies that the iterations of the loop(s)
+ will be executed in parallel by threads in the current context. These
+ iterations are spread across threads that already exist in the enclosing
+ region. The lower and upper bounds specify a half-open range: the
+ range includes the lower bound but does not include the upper bound. If the
+ `inclusive` attribute is specified then the upper bound is also included.
+
+ The `dist_schedule_static` attribute specifies the schedule for this
+ loop, determining how the loop is distributed across the parallel threads.
+ The optional `schedule_chunk` associated with this determines further
+ controls this distribution.
+
+ // TODO: private_var, firstprivate_var, lastprivate_var, collapse
+ }];
+ let arguments = (ins
+ UnitAttr:$dist_schedule_static,
+ Optional<IntLikeType>:$chunk_size,
+ Variadic<AnyType>:$allocate_vars,
+ Variadic<AnyType>:$allocators_vars,
+ OptionalAttr<OrderKindAttr>:$order_val);
+
+ let regions = (region AnyRegion:$region);
+
+ let assemblyFormat = [{
+ oilist(`dist_schedule_static` $dist_schedule_static
+ |`chunk_size` `(` $chunk_size `:` type($chunk_size) `)`
+ |`order` `(` custom<ClauseAttr>($order_val) `)`
+ |`allocate` `(`
+ custom<AllocateAndAllocator>(
+ $allocate_vars, type($allocate_vars),
+ $allocators_vars, type($allocators_vars)
+ ) `)`
+ ) $region attr-dict
+ }];
+
+ let hasVerifier = 1;
+}
+
//===----------------------------------------------------------------------===//
// 2.10.1 task Construct
//===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 2bf9355ed62676b..3c3fbe2981b018c 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1053,6 +1053,22 @@ LogicalResult SimdLoopOp::verify() {
return success();
}
+//===----------------------------------------------------------------------===//
+// Verifier for Distribute construct [2.9.4.1]
+//===----------------------------------------------------------------------===//
+
+LogicalResult DistributeOp::verify() {
+ if (this->getChunkSize() && !this->getDistScheduleStatic())
+ return emitOpError() << "chunk size set without "
+ "dist_schedule_static being present";
+
+ if (getAllocateVars().size() != getAllocatorsVars().size())
+ return emitError(
+ "expected equal sizes for allocate and allocator variables");
+
+ return success();
+}
+
//===----------------------------------------------------------------------===//
// ReductionOp
//===----------------------------------------------------------------------===//
diff --git a/mlir/test/Dialect/OpenMP/ops.mlir b/mlir/test/Dialect/OpenMP/ops.mlir
index 13cbea6c9923c22..a509c6cb1626677 100644
--- a/mlir/test/Dialect/OpenMP/ops.mlir
+++ b/mlir/test/Dialect/OpenMP/ops.mlir
@@ -479,6 +479,36 @@ func.func @omp_simdloop_pretty_multiple(%lb1 : index, %ub1 : index, %step1 : ind
return
}
+// CHECK-LABEL: omp_distribute
+func.func @omp_distribute(%chunk_size : i32, %data_var : memref<i32>) -> () {
+ // CHECK: omp.distribute
+ "omp.distribute" () ({
+ omp.terminator
+ }) {} : () -> ()
+ // CHECK: omp.distribute
+ omp.distribute {
+ omp.terminator
+ }
+ // CHECK: omp.distribute dist_schedule_static
+ omp.distribute dist_schedule_static {
+ omp.terminator
+ }
+ // CHECK: omp.distribute dist_schedule_static chunk_size(%{{.+}} : i32)
+ omp.distribute dist_schedule_static chunk_size(%chunk_size : i32) {
+ omp.terminator
+ }
+ // CHECK: omp.distribute order(concurrent)
+ omp.distribute order(concurrent) {
+ omp.terminator
+ }
+ // CHECK: omp.distribute allocate(%{{.+}} : memref<i32> -> %{{.+}} : memref<i32>)
+ omp.distribute allocate(%data_var : memref<i32> -> %data_var : memref<i32>) {
+ omp.terminator
+ }
+return
+}
+
+
// CHECK-LABEL: omp_target
func.func @omp_target(%if_cond : i1, %device : si32, %num_threads : i32, %map1: memref<?xi32>, %map2: memref<?xi32>) -> () {
|
Nice! The change seems OK to me. However, it re-kindles that there is a need for an MLIR form that doesn't break this operation's expected properties. And more specifically, since by construction, distribute needs to be associated with loops, it needs some way to maintain that constraint. Being separate, it cannot guarantee this property across transformations. It also cannot guarantee the data-sharing property of the loop's IV must be private - since an inner loop may be "canonicalized" and it wouldn't maintain the user's IV. It also makes it unclear who executes instructions between distribute and the actual loop - likely each team - but that code is not distributed across iterations even though it currently would appear inside the omp.distribute construct. Thus, it would be ideal if these properties are captured at construction time and cannot be broken by later passes. I've given some thought and I imagine that doing some of the checking in verifier is hard when omp dialect is used with dialects that don't have a native loop syntax - like LLVM dialect. So I am OK with your PR as-is - but I would be happier to see this operation evolve in a way that the constraints are maintained by construction instead of just hoping it doesn't go through passes that invalidate the properties. By the way, I haven't evaluated the merits of the omp.canonical_loop proposal/implementation (and I noticed you commented which makes me think you are aware). But, from a high level, I think that is likely the right direction for omp.distribute as well. |
Could you share your plan before we proceed? |
I just created a dependent PR with lowering support from PFT to create instances of this operation, in case that helps understanding the representation proposed here.
You make a good point about this representation being slightly different. Since the canonical loop proposal might take some time before it's finished, in my opinion a good approach could be to mimic |
The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse). To summarize the thoughts from below. Separating meta-ops (unroll, tile etc) that affect code generation from operations that execute and affect the semantics of the program seems to make sense. One possible way to separate them would be for executable operations to use regions and nesting, and compile-time (meta) ops to use CLIs. In this case omp.distribute should use nesting. If it needs to produce a compile-time CLI value it could be added later on. The concern about transformations interfering with the representation this is not very different from having a canonical loop with CLIs, since both solutions rely on transformations not knowing the semantics of the ops, which should prevent the code from being modified until lowering is done by the OpenMPIRBuilder. If there happen to be issues we can fall back on using the IsolatedFromAbove trait. There are other aspects to the omp.canonical_loop where in my mind the operation somewhat problematic since it combines meta-values via yield and at the same time is a regular executable op. Maybe this is there because of SSA? There is also the discussion/issue about omp.structured_region and how to handle CLIs in presence of control flow, where it seems that a static view of the program seems to make sense, meaning all canonical loops should be propagated up to any containing canonical loop. Also, the restriction about not allowing operations being inserted between omp loop transformation ops is something that would be hard to enforce unless a container op was introduced that could enclose a sequence of meta-ops. I think something like this was proposed on discourse at some point. It would be cleaner imo if a omp.canonical_loop would create single CLI "handle" and to use container meta-ops to combine these values. As a side note, yield ops seem to generally be used to yield the induction variable to the next iteration in other loop constructs, so we might want to use a different op than omp.yield to propagate values.. |
I am thinking one way to resolve this would be to have a single load/store model of CLIs. Instead of omp.canonical_loop defining a CLI as a value, it will set the cli value as an optional operand. That would also allow the loop structure to be defined via ops. Something like this:
An added benefit would also be that deleting all the meta-ops would still result in correct code. The invariants should also be pretty few. A CLI value can only be referred to once by a transform-type op (load), and once by a omp.canonical_loop (store) if it is a leaf. A transform op will always produce a new CLI that can be used by a subsequent op. This will encode the ordering of transformations as data dependencies instead of control dependencies, which don't really exist anyway since this is a static view of the program, this also means there shouldn't be any restrictions on the ordering of ops other than def before use. Note that the omp.cli_nest is redundant, since it could be inferred from the structure of the code, however if the code is manipulated outside of the meta-ops it can be used to verify that the code is in the expected form, which was a feature in the original omp.canonical_loop proposal. |
Are you suggesting a different approach here? To have In the lowering in #67798, an
Will the contained loop be an Regarding the new proposal, we might still have to propagate CLIs if atleast one of the CLI is generated. Like in the following example, where the inner loop is unrolled partially. That unrolling generates a new CLI that has to be somehow propagated up right?
|
I am proposing a different approach. The way I see it it makes sense that the omp.distribute is a wrapper op because it does not affect the code of the loop only the "schedule" of the loop, either run all iterations for each team, or a portion of the iterations per team.
I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.
No, it could be some other kind of loop.
Sort of, it would not be propagated up in the regular code, just encoded by the meta-ops. In the example above we would get something like:
Alternatively if the order of the loop transforms don't matter (no dependence between them), there is no need to really encode the nesting, so I think it could be simplified to:
The CLIs get associated with a specific canonical loop, or they are created by a top-level loop transformation op. |
The example lowering created a
In the other approach, the canonical loop was always created by an omp.canonical_loop declaration or by a loop transformation op. This way it was always easy to find the canonical loop given a CLI. Now we will have to reach the Associating CLIs at the top-level will not always be possible particularly if there are other operations, like a parallel operation. For the following example,
Should generate something like the following.
%cli4 = omp.unroll(%cli1)
|
Not sure if that is correct or not.
For now we would only support omp.wsloop until we decide what to do with omp.canonical_loop and CLIs if we believe that this is an okay solution for distribute.
I don't think that there is much of a difference other than at the top level since it would still be necessary to look at uses to find loops through the omp.yield ops in the original proposal. Also understanding the structure of the code would require more analysis, but could be encoded in the ops using this approach if we see the need for it.
For the following example,
The code below should be equivalent (assuming the duplicated $cli4, was a typo). It does not really matter where the omp.unroll ops occur in the code except that the uses of CLI values must be dominated by the definitions. A loop transformation happens at the location where the loop associated with a CLI is, not where the loop transformation op is. If the order of the transforms are important it would have to be encoded using CLI dependencies.
|
Thanks for bringing attention to this. I think that example should not produce an To me, at this time the options to achieve this are either to adopt |
I guess it depends on how we want to handle the collapse clause and how to handle loops that are both workshare and distribute. |
The issue here is that |
If the outlining is done then %cli2 would become a parameter to the outlined function, so the the outlining wouldn't be lost. but it would require an interprocedural view. How would this work with the original proposal? Would the CLI's be returned from the outlined function? |
This also raises questions about possible loop transformations that involve multiple loops. How would e.g. loop interchange work in the presence of parallel like the example above? If outlining happens the interchange would be very difficult to achieve. It seems reasonable to do the transforms early and eliminate those ops before any other transformations take place, The transforms specified in the source code refer to the initial structure of the code and not the structure that may come about after doing a bunch of transforms like outlining. |
I was only pointing out that we cannot have the OpenMP loop transformation operations at the top-level. It will have to be nested whenever we are not dealing with The previous proposal would have looked like the following.
|
Merged with main, added negative test, added MemoryEffects<[MemWrite]>. |
Is the plan to lower as the following : -> !$omp distribute
-> !$omp distribute simd
-> !$omp distribute parallel do
-> !$omp distribute parallel do simd
|
Yes, that is correct. |
Since distribute is always specified on a loop, it might be good to retain that relation in the IR. You could do the following. Here the !$omp distribute
!$omp distribute simd
!$omp distribute parallel do
!$omp distribute parallel do simd
|
I'm wondering if it will be difficult to represent the various clauses that could be added to the different directives if we use this representation? I am also wondering if there is additional ops other than omp.canonical_loop, then it may still require some traversal of the code. |
For clauses, all supported clauses will have to be added and the verifier should be used to ensure that only the permitted ones are supported.
I did not understand this point. For suggesting the above representation, I was concerned with three points:
|
Does that mean that clauses that are associated with omp.parallel would be duplicated in omp.distribute? I would be hesitant to duplicate things if it is possible to compose things instead.
I misunderstood the restrictions of distribute, so you can ignore this.
I'm not sure it is possible to have omp.wsloop_simd if there is a loop transform that could change which loop the simd operates on, so it would have to be an attribute or wrapper op.
Yes, I think these are valid points. Would it be enough for a verifier to check that only the allowed ops like parallel and simd (in addition to the omp.wsloop) are present inside of a omp.distribute instead of encoding it directly in the IR as attributes? |
I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.
Also with the original proposal for omp.canonical_loop this would be possible. |
Yes, but it will still apply to the first canonical loop or generated loop. The possible issues, that I can think of if we lose the association with the loop are: Also, is it guaranteed that the OpenMP runtime call generation for |
// Distribute construct [2.9.4.1] | ||
//===----------------------------------------------------------------------===// | ||
def DistributeOp : OpenMP_Op<"distribute", [AttrSizedOperandSegments, | ||
MemoryEffects<[MemWrite]>]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shraiysh could you comment on the MemWrite
effect here. I believe you previously worked on adding RecursiveMemoryEffects
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if @shraiysh is working on OpenMP anymore. There is a bit of a nuance with respect to the memory effects. The memory effects apply to operations inside the distribute region, and to containing ops, but not ops on the same level. If CSE or some other pass decides to hoist things outside the inner region that would be illegal (so in some sense it does apply moved operations at the same level). Similarly it would be illegal to hoist the omp.distribute outside a omp.teams, It is safest to mark the op with MemoryEffects<[MemWrite]>.
The distribute construct specifies that the iterations of one or more loop | ||
will be executed by the initial teams in the context of their implicit | ||
tasks. The iterations are distributed across the initial threads of all | ||
initial teams that execute the teams region to which the distribute region | ||
binds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify which loop this applies to, is this the closest enclosing omp.canonical_loop
?
The loops that distribute refer to don't exist until after the loop transformations have been performed, and if the loop transformations happen at the OpenMPTranslation and this is an issue for handling of lastprivate, it will be a problem no matter what solution we pick unless the loop transformations happen earlier in the compiler. I think perhaps a single version of distribute is not enough to represent what we need. The distribute construct may have to be split up into a CLI version op, or wrapper op and additionally a lower level representation which would be attributes on the loop ops. I believe it should always be possible at the loop level to determine if some iteration should be executed on a specific thread or not, so adding an attribute should be enough (plus adding the clauses that were part of the distribute) information. The problem I see is that it will not be possible (or very convoluted) to directly generate code with omp.distribute higher up in the chain of CLIs, since at the time of generating code for the loops (e.g. omp.wsloop) we need to know if distribute is present or not. As you mention, the association is broken at that point and unless that information is propagated down or scanned upward in the chain of CLIs, or the region hierarchy in the case of this PR, it is not possible to generate code without non-local information in either solution. I think the only alternative would be to do the loop transformations in MLIR and convert the omp.distribute op into attributes on the loop operations.
For the target device codegen, distribute is part of the generation for the loop, so it is just a different function call than the non-distribute version. There is no way to generate code for 'distribute' in isolation. Would it make sense to implement a more minimal solution to handle simple distribute uses for now by adding a attribute to omp.wsloop and have the frontend handle distribute? |
I like the idea of potentially collecting all information on how a loop may be split up across different levels of parallelism in the // !$omp target teams distribute parallel do
omp.target {
omp.teams {
omp.parallel {
omp.wsloop distribute for (%i) : i32 (%lb) to (%ub) step (%step) {
}
}
}
}
// !$omp distribute
omp.wsloop distribute (%i) : i32 (%lb) to (%ub) step (%step) {
}
// !$omp distribute parallel do
omp.parallel {
omp.wsloop distribute for (%i) : i32 (%lb) to (%ub) step (%step) {
}
}
// !$omp distribute parallel do simd
omp.parallel {
omp.wsloop distribute for simd (%i) : i32 (%lb) to (%ub) step (%step) {
}
}
// !$omp distribute simd
omp.wsloop distribute simd (%i) : i32 (%lb) to (%ub) step (%step) {
}
// !$omp do
omp.wsloop for (%i) : i32 (%lb) to (%ub) step (%step) {
}
// !$omp do simd
omp.wsloop for simd (%i) : i32 (%lb) to (%ub) step (%step) {
}
// !$omp simd
omp.wsloop simd (%i) : i32 (%lb) to (%ub) step (%step) {
}
// !$omp parallel do
omp.parallel {
omp.wsloop for (%i) : i32 (%lb) to (%ub) step (%step) {
}
}
// !$omp parallel do simd
omp.parallel {
omp.wsloop for simd (%i) : i32 (%lb) to (%ub) step (%step) {
}
} This ignores for the time being the introduction of // !$omp distribute private(a)
// !$omp parallel do private(b)
omp.parallel {
omp.wsloop private(%1, %2) distribute for ... {}
// or
omp.wsloop distribute_private(%1) for_private(%2) distribute for ... {}
}
// Would something like this be valid?
// !$omp distribute collapse(1)
// !$omp parallel do collapse(2)
omp.parallel {
omp.wsloop collapse(%whichever_takes_precedence) distribute for ... {}
// or
omp.wsloop distribute_collapse(%1) for_collapse(%2) distribute for ... {}
} This is the list of clauses, according to the 5.2 spec document, accepted by the
|
Thanks for detailing this, it seems like a good approach. We still are stuck on the loop transformations though, since they will prevent this solution if they remain until codegen. I'm basically a proponent of doing them quite early in the compiler in a specialized pass which would in effect all CLI values in the IR by performing the transformations. It seems reasonable since we are not doing code generation and emitting LLVMIR, we're just transforming some omp ops into other omp ops. |
The
|
I'm not quite sure I understand the example. Are the construct(%cli... ) the "outputs" and the loops(%cli, ..) the ones being operated on, or do I have it backwards? Does the modified example below make sense?
@Meinersbur had objections to the initial nested proposal because there were no values to refer to specific loops after transformations were done, but that seems to be fixed here. |
I think tile was the long construct, I meant a collapse here.
|
|
I guess you are pointing towards a situation where there are two generated loops and the construct needs to work on both. For an example like the following:
I was suggesting something like the following:
|
I was looking for an example where only a subset of the created loops from construct(...) are used in the enclosing op in loops(...) |
I think we need to get this PR moving again. From what I can tell in the original and alternative solutions to handle CLIs the omp.distribute will be a wrapper op in both cases. I think we can add the CLI handling for future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG.
Note that we will need the omp.canonical_loop
to retain loop information at the LLVM dialect. With @kparzysz changes, it might be easier to add the canonical loops.
This patch adds the omp.distribute operation to the OMP dialect. The purpose is to be able to represent the distribute construct in OpenMP with the associated clauses. The effect of the operation is to distributes the loop iterations of the loop(s) contained inside the region across multiple teams.