Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenMP][MLIR] Add omp.distribute op to the OMP dialect #67720

Merged
merged 2 commits into from
Jan 24, 2024

Conversation

jsjodin
Copy link
Contributor

@jsjodin jsjodin commented Sep 28, 2023

This patch adds the omp.distribute operation to the OMP dialect. The purpose is to be able to represent the distribute construct in OpenMP with the associated clauses. The effect of the operation is to distributes the loop iterations of the loop(s) contained inside the region across multiple teams.

@llvmbot
Copy link
Collaborator

llvmbot commented Sep 28, 2023

@llvm/pr-subscribers-flang-openmp
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-openmp

Changes

This patch adds the omp.distribute operation to the OMP dialect. The purpose is to be able to represent the distribute construct in OpenMP with the associated clauses. The effect of the operation is to distributes the loop iterations of the loop(s) contained inside the region across multiple teams.


Full diff: https://github.com/llvm/llvm-project/pull/67720.diff

3 Files Affected:

  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+51)
  • (modified) mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp (+16)
  • (modified) mlir/test/Dialect/OpenMP/ops.mlir (+30)
diff --git a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
index 5da05476a683725..89d2ac342281ccb 100644
--- a/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
+++ b/mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td
@@ -636,6 +636,57 @@ def YieldOp : OpenMP_Op<"yield",
   let assemblyFormat = [{ ( `(` $results^ `:` type($results) `)` )? attr-dict}];
 }
 
+//===----------------------------------------------------------------------===//
+// Distribute construct [2.9.4.1]
+//===----------------------------------------------------------------------===//
+def DistributeOp : OpenMP_Op<"distribute", [AttrSizedOperandSegments,
+                              RecursiveMemoryEffects]> {
+  let summary = "distribute construct";
+  let description = [{
+    The distribute construct specifies that the iterations of one or more loop
+    will be executed by the initial teams in the context of their implicit
+    tasks. The iterations are distributed across the initial threads of all
+    initial teams that execute the teams region to which the distribute region
+    binds.
+
+    The distribute loop construct specifies that the iterations of the loop(s)
+    will be executed in parallel by threads in the current context. These
+    iterations are spread across threads that already exist in the enclosing
+    region. The lower and upper bounds specify a half-open range: the
+    range includes the lower bound but does not include the upper bound. If the
+    `inclusive` attribute is specified then the upper bound is also included.
+
+    The `dist_schedule_static` attribute specifies the  schedule for this
+    loop, determining how the loop is distributed across the parallel threads.
+    The optional `schedule_chunk` associated with this determines further
+    controls this distribution.
+
+    // TODO: private_var, firstprivate_var, lastprivate_var, collapse
+  }];
+  let arguments = (ins
+             UnitAttr:$dist_schedule_static,
+             Optional<IntLikeType>:$chunk_size,
+             Variadic<AnyType>:$allocate_vars,
+             Variadic<AnyType>:$allocators_vars,
+             OptionalAttr<OrderKindAttr>:$order_val);
+
+  let regions = (region AnyRegion:$region);
+
+  let assemblyFormat = [{
+    oilist(`dist_schedule_static` $dist_schedule_static
+          |`chunk_size` `(` $chunk_size `:` type($chunk_size) `)`
+          |`order` `(` custom<ClauseAttr>($order_val) `)`
+          |`allocate` `(`
+             custom<AllocateAndAllocator>(
+               $allocate_vars, type($allocate_vars),
+               $allocators_vars, type($allocators_vars)
+             ) `)`
+    ) $region attr-dict
+  }];
+
+  let hasVerifier = 1;
+}
+
 //===----------------------------------------------------------------------===//
 // 2.10.1 task Construct
 //===----------------------------------------------------------------------===//
diff --git a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
index 2bf9355ed62676b..3c3fbe2981b018c 100644
--- a/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
+++ b/mlir/lib/Dialect/OpenMP/IR/OpenMPDialect.cpp
@@ -1053,6 +1053,22 @@ LogicalResult SimdLoopOp::verify() {
   return success();
 }
 
+//===----------------------------------------------------------------------===//
+// Verifier for Distribute construct [2.9.4.1]
+//===----------------------------------------------------------------------===//
+
+LogicalResult DistributeOp::verify() {
+  if (this->getChunkSize() && !this->getDistScheduleStatic())
+    return emitOpError() << "chunk size set without "
+                            "dist_schedule_static being present";
+
+  if (getAllocateVars().size() != getAllocatorsVars().size())
+    return emitError(
+        "expected equal sizes for allocate and allocator variables");
+
+  return success();
+}
+
 //===----------------------------------------------------------------------===//
 // ReductionOp
 //===----------------------------------------------------------------------===//
diff --git a/mlir/test/Dialect/OpenMP/ops.mlir b/mlir/test/Dialect/OpenMP/ops.mlir
index 13cbea6c9923c22..a509c6cb1626677 100644
--- a/mlir/test/Dialect/OpenMP/ops.mlir
+++ b/mlir/test/Dialect/OpenMP/ops.mlir
@@ -479,6 +479,36 @@ func.func @omp_simdloop_pretty_multiple(%lb1 : index, %ub1 : index, %step1 : ind
   return
 }
 
+// CHECK-LABEL: omp_distribute
+func.func @omp_distribute(%chunk_size : i32, %data_var : memref<i32>) -> () {
+  // CHECK: omp.distribute
+  "omp.distribute" () ({
+    omp.terminator
+  }) {} : () -> ()
+  // CHECK: omp.distribute
+  omp.distribute {
+    omp.terminator
+  }
+  // CHECK: omp.distribute dist_schedule_static
+  omp.distribute dist_schedule_static {
+    omp.terminator
+  }
+  // CHECK: omp.distribute dist_schedule_static chunk_size(%{{.+}} : i32)
+  omp.distribute dist_schedule_static chunk_size(%chunk_size : i32) {
+    omp.terminator
+  }
+  // CHECK: omp.distribute order(concurrent)
+  omp.distribute order(concurrent) {
+    omp.terminator
+  }
+  // CHECK: omp.distribute allocate(%{{.+}} : memref<i32> -> %{{.+}} : memref<i32>)
+  omp.distribute allocate(%data_var : memref<i32> -> %data_var : memref<i32>) {
+    omp.terminator
+  }
+return
+}
+
+
 // CHECK-LABEL: omp_target
 func.func @omp_target(%if_cond : i1, %device : si32,  %num_threads : i32, %map1: memref<?xi32>, %map2: memref<?xi32>) -> () {
 

@razvanlupusoru
Copy link
Contributor

Nice! The change seems OK to me.

However, it re-kindles that there is a need for an MLIR form that doesn't break this operation's expected properties. And more specifically, since by construction, distribute needs to be associated with loops, it needs some way to maintain that constraint. Being separate, it cannot guarantee this property across transformations. It also cannot guarantee the data-sharing property of the loop's IV must be private - since an inner loop may be "canonicalized" and it wouldn't maintain the user's IV. It also makes it unclear who executes instructions between distribute and the actual loop - likely each team - but that code is not distributed across iterations even though it currently would appear inside the omp.distribute construct. Thus, it would be ideal if these properties are captured at construction time and cannot be broken by later passes.

I've given some thought and I imagine that doing some of the checking in verifier is hard when omp dialect is used with dialects that don't have a native loop syntax - like LLVM dialect. So I am OK with your PR as-is - but I would be happier to see this operation evolve in a way that the constraints are maintained by construction instead of just hoping it doesn't go through passes that invalidate the properties.

By the way, I haven't evaluated the merits of the omp.canonical_loop proposal/implementation (and I noticed you commented which makes me think you are aware). But, from a high level, I think that is likely the right direction for omp.distribute as well.

@kiranchandramohan
Copy link
Contributor

Could you share your plan before we proceed?
The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

@skatrak
Copy link
Contributor

skatrak commented Sep 29, 2023

I just created a dependent PR with lowering support from PFT to create instances of this operation, in case that helps understanding the representation proposed here.

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

You make a good point about this representation being slightly different. Since the canonical loop proposal might take some time before it's finished, in my opinion a good approach could be to mimic omp.wsloop and produce new loop index variables as entry block arguments to this op's region. I guess in that case we'd have to think about what the omp.distribute loop bounds and steps would be for a given nested Do loop, and how they might interoperate with the corresponding omp.wsloop operation.

@jsjodin
Copy link
Contributor Author

jsjodin commented Sep 29, 2023

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

To summarize the thoughts from below. Separating meta-ops (unroll, tile etc) that affect code generation from operations that execute and affect the semantics of the program seems to make sense. One possible way to separate them would be for executable operations to use regions and nesting, and compile-time (meta) ops to use CLIs. In this case omp.distribute should use nesting. If it needs to produce a compile-time CLI value it could be added later on.

The concern about transformations interfering with the representation this is not very different from having a canonical loop with CLIs, since both solutions rely on transformations not knowing the semantics of the ops, which should prevent the code from being modified until lowering is done by the OpenMPIRBuilder. If there happen to be issues we can fall back on using the IsolatedFromAbove trait.

There are other aspects to the omp.canonical_loop where in my mind the operation somewhat problematic since it combines meta-values via yield and at the same time is a regular executable op. Maybe this is there because of SSA? There is also the discussion/issue about omp.structured_region and how to handle CLIs in presence of control flow, where it seems that a static view of the program seems to make sense, meaning all canonical loops should be propagated up to any containing canonical loop. Also, the restriction about not allowing operations being inserted between omp loop transformation ops is something that would be hard to enforce unless a container op was introduced that could enclose a sequence of meta-ops. I think something like this was proposed on discourse at some point. It would be cleaner imo if a omp.canonical_loop would create single CLI "handle" and to use container meta-ops to combine these values. As a side note, yield ops seem to generally be used to yield the induction variable to the next iteration in other loop constructs, so we might want to use a different op than omp.yield to propagate values..

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 2, 2023

I am thinking one way to resolve this would be to have a single load/store model of CLIs. Instead of omp.canonical_loop defining a CLI as a value, it will set the cli value as an optional operand. That would also allow the loop structure to be defined via ops. Something like this:

%cli1 = omp.cli()
%cli2 = omp.cli()
omp.canonical_loop  (%cli1, ...) {
   omp.canonical_loop(%cli2, ...) {
     ...
   }
 }
%cli3 = omp.cli_nest(%cli1, %cli2) // Encode the relationship
%cli4 = omp.unroll(%cli3#0)

An added benefit would also be that deleting all the meta-ops would still result in correct code. The invariants should also be pretty few. A CLI value can only be referred to once by a transform-type op (load), and once by a omp.canonical_loop (store) if it is a leaf. A transform op will always produce a new CLI that can be used by a subsequent op. This will encode the ordering of transformations as data dependencies instead of control dependencies, which don't really exist anyway since this is a static view of the program, this also means there shouldn't be any restrictions on the ordering of ops other than def before use. Note that the omp.cli_nest is redundant, since it could be inferred from the structure of the code, however if the code is manipulated outside of the meta-ops it can be used to verify that the code is in the expected form, which was a feature in the original omp.canonical_loop proposal.

@kiranchandramohan
Copy link
Contributor

Since the canonical loop proposal might take some time before it's finished, in my opinion a good approach could be to mimic omp.wsloop and produce new loop index variables as entry block arguments to this op's region. I guess in that case we'd have to think about what the omp.distribute loop bounds and steps would be for a given nested Do loop, and how they might interoperate with the corresponding omp.wsloop operation.

Are you suggesting a different approach here? To have omp.distribute the same as omp.wsloop, i.e a loop-like operation? Currently omp.wsloop subsumes the nested Do loop.

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

Will the contained loop be an omp.wsloop always?

Regarding the new proposal, we might still have to propagate CLIs if atleast one of the CLI is generated. Like in the following example, where the inner loop is unrolled partially. That unrolling generates a new CLI that has to be somehow propagated up right?

!$omp tile (4,5)
do i=1,n
  !$omp unroll partial
  do j=1,m
  end do
end o

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 2, 2023

Since the canonical loop proposal might take some time before it's finished, in my opinion a good approach could be to mimic omp.wsloop and produce new loop index variables as entry block arguments to this op's region. I guess in that case we'd have to think about what the omp.distribute loop bounds and steps would be for a given nested Do loop, and how they might interoperate with the corresponding omp.wsloop operation.

Are you suggesting a different approach here? To have omp.distribute the same as omp.wsloop, i.e a loop-like operation? Currently omp.wsloop subsumes the nested Do loop.

I am proposing a different approach. The way I see it it makes sense that the omp.distribute is a wrapper op because it does not affect the code of the loop only the "schedule" of the loop, either run all iterations for each team, or a portion of the iterations per team.

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

Will the contained loop be an omp.wsloop always?

No, it could be some other kind of loop.

Regarding the new proposal, we might still have to propagate CLIs if atleast one of the CLI is generated. Like in the following example, where the inner loop is unrolled partially. That unrolling generates a new CLI that has to be somehow propagated up right?

!$omp tile (4,5)
do i=1,n
  !$omp unroll partial
  do j=1,m
  end do
end o

Sort of, it would not be propagated up in the regular code, just encoded by the meta-ops. In the example above we would get something like:

%cli1 = omp.cli()
%cli2 = omp.cli()
canonical_loop(%i, 1, %n, %cli1) {
   canonical_loop(%j, 1, %m, %cli2) {
 }
}
%cli3 = omp.cli_nest(%cli1, %cli2)
%cli4 = omp.unroll_partial(%cli3#1)
%cli5 = omp.tile(%cli4#0)

Alternatively if the order of the loop transforms don't matter (no dependence between them), there is no need to really encode the nesting, so I think it could be simplified to:

%cli1 = omp.cli()
%cli2 = omp.cli()
canonical_loop(%i, 1, %n, %cli1) {
   canonical_loop(%j, 1, %m, %cli2) {
 }
}
%cli3 = omp.unroll_partial(%cli2)
%cli4 = omp.tile(%cl1)

The CLIs get associated with a specific canonical loop, or they are created by a top-level loop transformation op.

@kiranchandramohan
Copy link
Contributor

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.

The example lowering created a omp.wsloop operation for an omp.distribute eventhough there was no parallel do. Hence the question.
https://github.com/llvm/llvm-project/pull/67798/files#diff-0652f88238afa05fb262dcebab875780ab553b3914ba7239512c45986198240d

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

Will the contained loop be an omp.wsloop always?

No, it could be some other kind of loop.

fir.do_loop, scf.for are converted to control-flow by the time they are in the OpenMP + LLVM dialect stage. So it has to be something encoded with OpenMP, like omp.canonical_loop.

The CLIs get associated with a specific canonical loop, or they are created by a top-level loop transformation op.

In the other approach, the canonical loop was always created by an omp.canonical_loop declaration or by a loop transformation op. This way it was always easy to find the canonical loop given a CLI. Now we will have to reach the omp.cli operation and look at its use.

Associating CLIs at the top-level will not always be possible particularly if there are other operations, like a parallel operation. For the following example,

!$omp unroll
do i=1,l
  !$omp parallel
  do j=1,m
    !$omp unroll
    do k=1,n
    end do
  end do
end o

Should generate something like the following.

%cli1 = omp.cli()
canonical_loop(%i, 1, %l, %cli1) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       %cli2 = omp.cli()
       canonical_loop(%j, 1, %m, %cli2) {
       }
       %cli4 = omp.unroll(%cli2)
     }
   }
}

%cli4 = omp.unroll(%cli1)

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 3, 2023

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.

The example lowering created a omp.wsloop operation for an omp.distribute eventhough there was no parallel do. Hence the question. https://github.com/llvm/llvm-project/pull/67798/files#diff-0652f88238afa05fb262dcebab875780ab553b3914ba7239512c45986198240d

Not sure if that is correct or not.

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

Will the contained loop be an omp.wsloop always?

No, it could be some other kind of loop.

fir.do_loop, scf.for are converted to control-flow by the time they are in the OpenMP + LLVM dialect stage. So it has to be something encoded with OpenMP, like omp.canonical_loop.

For now we would only support omp.wsloop until we decide what to do with omp.canonical_loop and CLIs if we believe that this is an okay solution for distribute.

The CLIs get associated with a specific canonical loop, or they are created by a top-level loop transformation op.

In the other approach, the canonical loop was always created by an omp.canonical_loop declaration or by a loop transformation op. This way it was always easy to find the canonical loop given a CLI. Now we will have to reach the omp.cli operation and look at its use.

I don't think that there is much of a difference other than at the top level since it would still be necessary to look at uses to find loops through the omp.yield ops in the original proposal. Also understanding the structure of the code would require more analysis, but could be encoded in the ops using this approach if we see the need for it.

Associating CLIs at the top-level will not always be possible particularly if there are other operations, like a parallel operation.
I'm not sure what effect other operations would have. I understand that they might affect how the loop transformation is done, but not how they would affect the location of the transformation op.

For the following example,

!$omp unroll
do i=1,l
  !$omp parallel
  do j=1,m
    !$omp unroll
    do k=1,n
    end do
  end do
end o

Should generate something like the following.

%cli1 = omp.cli()
canonical_loop(%i, 1, %l, %cli1) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       %cli2 = omp.cli()
       canonical_loop(%j, 1, %m, %cli2) {
       }
       %cli4 = omp.unroll(%cli2)
     }
   }
}

%cli4 = omp.unroll(%cli1)

The code below should be equivalent (assuming the duplicated $cli4, was a typo). It does not really matter where the omp.unroll ops occur in the code except that the uses of CLI values must be dominated by the definitions. A loop transformation happens at the location where the loop associated with a CLI is, not where the loop transformation op is. If the order of the transforms are important it would have to be encoded using CLI dependencies.

%cli1 = omp.cli()
%cli2 = omp.cli()
canonical_loop(%i, 1, %l, %cli1) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       canonical_loop(%j, 1, %m, %cli2) {
       }
     }
   }
}
}
%cli3 = omp.unroll(%cli2)
%cli4 = omp.unroll(%cli1)

@skatrak
Copy link
Contributor

skatrak commented Oct 3, 2023

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.

The example lowering created a omp.wsloop operation for an omp.distribute eventhough there was no parallel do. Hence the question. https://github.com/llvm/llvm-project/pull/67798/files#diff-0652f88238afa05fb262dcebab875780ab553b3914ba7239512c45986198240d

Thanks for bringing attention to this. I think that example should not produce an omp.wsloop in that situation. We would need to have the omp.distribute operation somehow represent that loop by itself, since the semantics for omp.wsloop are not the same (worksharing iterations across threads in a team vs worksharing iterations across teams in a league).

To me, at this time the options to achieve this are either to adopt omp.canonical_loop and to integrate it with both omp.distribute and omp.wsloop or to follow omp.wsloop design in omp.distribute so that it defines its own loop indices as block arguments after being given the bounds and step. I understand that the first option would be what we'd like to have long term, but the second would allow us to keep adding offloading features while the omp.canonical_loop design is finalized.

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 3, 2023

The example lowering created a omp.wsloop operation for an omp.distribute eventhough there was no parallel do. Hence the question. https://github.com/llvm/llvm-project/pull/67798/files#diff-0652f88238afa05fb262dcebab875780ab553b3914ba7239512c45986198240d

Thanks for bringing attention to this. I think that example should not produce an omp.wsloop in that situation. We would need to have the omp.distribute operation somehow represent that loop by itself, since the semantics for omp.wsloop are not the same (worksharing iterations across threads in a team vs worksharing iterations across teams in a league).

To me, at this time the options to achieve this are either to adopt omp.canonical_loop and to integrate it with both omp.distribute and omp.wsloop or to follow omp.wsloop design in omp.distribute so that it defines its own loop indices as block arguments after being given the bounds and step. I understand that the first option would be what we'd like to have long term, but the second would allow us to keep adding offloading features while the omp.canonical_loop design is finalized.

I guess it depends on how we want to handle the collapse clause and how to handle loops that are both workshare and distribute.

@kiranchandramohan
Copy link
Contributor

In the lowering in #67798, an omp.wsloop is created for !$omp distribute. Is this always correct to do as per the standard? Is it based on existing Clang lowering?

I think a omp.wsloop is created from parallel do, Afaik 'distribute' is associated with teams, so that each team takes a chunk of the iterations instead of all teams taking all (duplicating) iterations.

The example lowering created a omp.wsloop operation for an omp.distribute eventhough there was no parallel do. Hence the question. https://github.com/llvm/llvm-project/pull/67798/files#diff-0652f88238afa05fb262dcebab875780ab553b3914ba7239512c45986198240d

Not sure if that is correct or not.

Could you share your plan before we proceed? The proposal of this patch of a distribute operation that (possibly) nests a loop seems to be different from both the existing worksharing-loop design (that includes the loop) and the canonical-loop proposal under design where the distribute will accept a CLI value that represents a loop.

The plan is to have some representation of 'distribute' that does not rely on meta-values (CLI) for now, since there are still a lot of unanswered questions. The version of the omp.distribute op in this patch basically works as a wrapper op that modified how the contained loop(s) execute. It does rely on there being a contained loop or (several if we want to consider collapse).

Will the contained loop be an omp.wsloop always?

No, it could be some other kind of loop.

fir.do_loop, scf.for are converted to control-flow by the time they are in the OpenMP + LLVM dialect stage. So it has to be something encoded with OpenMP, like omp.canonical_loop.

For now we would only support omp.wsloop until we decide what to do with omp.canonical_loop and CLIs if we believe that this is an okay solution for distribute.

The CLIs get associated with a specific canonical loop, or they are created by a top-level loop transformation op.

In the other approach, the canonical loop was always created by an omp.canonical_loop declaration or by a loop transformation op. This way it was always easy to find the canonical loop given a CLI. Now we will have to reach the omp.cli operation and look at its use.

I don't think that there is much of a difference other than at the top level since it would still be necessary to look at uses to find loops through the omp.yield ops in the original proposal. Also understanding the structure of the code would require more analysis, but could be encoded in the ops using this approach if we see the need for it.

Associating CLIs at the top-level will not always be possible particularly if there are other operations, like a parallel operation.
I'm not sure what effect other operations would have. I understand that they might affect how the loop transformation is done, but not how they would affect the location of the transformation op.

For the following example,

!$omp unroll
do i=1,l
  !$omp parallel
  do j=1,m
    !$omp unroll
    do k=1,n
    end do
  end do
end o

Should generate something like the following.

%cli1 = omp.cli()
canonical_loop(%i, 1, %l, %cli1) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       %cli2 = omp.cli()
       canonical_loop(%j, 1, %m, %cli2) {
       }
       %cli4 = omp.unroll(%cli2)
     }
   }
}

%cli4 = omp.unroll(%cli1)

The code below should be equivalent (assuming the duplicated $cli4, was a typo). It does not really matter where the omp.unroll ops occur in the code except that the uses of CLI values must be dominated by the definitions. A loop transformation happens at the location where the loop associated with a CLI is, not where the loop transformation op is. If the order of the transforms are important it would have to be encoded using CLI dependencies.

%cli1 = omp.cli()
%cli2 = omp.cli()
canonical_loop(%i, 1, %l, %cli1) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       canonical_loop(%j, 1, %m, %cli2) {
       }
     }
   }
}
}
%cli3 = omp.unroll(%cli2)
%cli4 = omp.unroll(%cli1)

The issue here is that omp.parallel can potentially be outlined into a function and it is a legal transformation. But if this transformation is performed, then the unroll of the inner loop is lost and existing data dependence will not prevent that.

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 3, 2023

The issue here is that omp.parallel can potentially be outlined into a function and it is a legal transformation. But if this transformation is performed, then the unroll of the inner loop is lost and existing data dependence will not prevent that.

If the outlining is done then %cli2 would become a parameter to the outlined function, so the the outlining wouldn't be lost. but it would require an interprocedural view. How would this work with the original proposal? Would the CLI's be returned from the outlined function?

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 4, 2023

The issue here is that omp.parallel can potentially be outlined into a function and it is a legal transformation. But if this transformation is performed, then the unroll of the inner loop is lost and existing data dependence will not prevent that.

If the outlining is done then %cli2 would become a parameter to the outlined function, so the the outlining wouldn't be lost. but it would require an interprocedural view. How would this work with the original proposal? Would the CLI's be returned from the outlined function?

This also raises questions about possible loop transformations that involve multiple loops. How would e.g. loop interchange work in the presence of parallel like the example above? If outlining happens the interchange would be very difficult to achieve. It seems reasonable to do the transforms early and eliminate those ops before any other transformations take place, The transforms specified in the source code refer to the initial structure of the code and not the structure that may come about after doing a bunch of transforms like outlining.

@kiranchandramohan
Copy link
Contributor

I was only pointing out that we cannot have the OpenMP loop transformation operations at the top-level. It will have to be nested whenever we are not dealing with omp.canonical_loop like the case of parallel here.

The previous proposal would have looked like the following.

%cli1 = canonical_loop(%i, 1, %l) {
   omp.parallel {
     fir.do_loop (%j,1,%m) {
       %cli2 = canonical_loop(%k, 1, %n) {
       }
       %cli3 = omp.unroll(%cli2)
     }
   }
}
%cli4 = omp.unroll(%cli1)

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 26, 2023

Merged with main, added negative test, added MemoryEffects<[MemWrite]>.

@jsjodin jsjodin requested a review from shraiysh October 26, 2023 18:10
@kiranchandramohan
Copy link
Contributor

Is the plan to lower as the following :

-> !$omp distribute

omp.distribute {
  omp.canonical_loop {
  }
}

-> !$omp distribute simd

omp.distribute {
  omp.simd {
  }
}

-> !$omp distribute parallel do

omp.distribute {
  omp.parallel {
    omp.wsloop {
    }
  }
}

-> !$omp distribute parallel do simd

omp.distribute {
  omp.parallel {
    omp.wsloop {
      omp.simd {
      }
    }
  }
}

@jsjodin
Copy link
Contributor Author

jsjodin commented Oct 27, 2023

Yes, that is correct.

@kiranchandramohan
Copy link
Contributor

Since distribute is always specified on a loop, it might be good to retain that relation in the IR. You could do the following. Here the omp.canonical_loop is either the loop to which the distribute construct is applied in the source or could be one that is generated after a loop transformation construct.

!$omp distribute

omp.distribute {
  omp.canonical_loop {
  }
}

!$omp distribute simd

omp.distribute {simd} {
  omp.canonical_loop {
  }
}

!$omp distribute parallel do

omp.distribute {parallel do} {
  omp.canonical_loop {
  }
}

!$omp distribute parallel do simd

omp.distribute {parallel do simd} {
  omp.canonical_loop {
  }
}

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 6, 2023

Since distribute is always specified on a loop, it might be good to retain that relation in the IR. You could do the following. Here the omp.canonical_loop is either the loop to which the distribute construct is applied in the source or could be one that is generated after a loop transformation construct.

omp.distribute {parallel do simd} {
  omp.canonical_loop {
  }
}

I'm wondering if it will be difficult to represent the various clauses that could be added to the different directives if we use this representation? I am also wondering if there is additional ops other than omp.canonical_loop, then it may still require some traversal of the code.

@kiranchandramohan
Copy link
Contributor

Since distribute is always specified on a loop, it might be good to retain that relation in the IR. You could do the following. Here the omp.canonical_loop is either the loop to which the distribute construct is applied in the source or could be one that is generated after a loop transformation construct.

omp.distribute {parallel do simd} {
  omp.canonical_loop {
  }
}

I'm wondering if it will be difficult to represent the various clauses that could be added to the different directives if we use this representation?

For clauses, all supported clauses will have to be added and the verifier should be used to ensure that only the permitted ones are supported.

I am also wondering if there is additional ops other than omp.canonical_loop, then it may still require some traversal of the code.

I did not understand this point.

For suggesting the above representation, I was concerned with three points:

  1. Composite constructs : !$omp do simd is a composite construct and not a combined construct. Hence we should either model thatas a separate operation ( (omp.wsloop_simd) or have these as attributes.

  2. While omp.wsloop and omp.simd works on loops omp.parallel does not and having that in between constructs would prevent omp.distribute from working directly on a loop or a loop-related construct.

  3. In source code, distribute is always specified on a loop.

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 7, 2023

Since distribute is always specified on a loop, it might be good to retain that relation in the IR. You could do the following. Here the omp.canonical_loop is either the loop to which the distribute construct is applied in the source or could be one that is generated after a loop transformation construct.

omp.distribute {parallel do simd} {
  omp.canonical_loop {
  }
}

I'm wondering if it will be difficult to represent the various clauses that could be added to the different directives if we use this representation?

For clauses, all supported clauses will have to be added and the verifier should be used to ensure that only the permitted ones are supported.

Does that mean that clauses that are associated with omp.parallel would be duplicated in omp.distribute? I would be hesitant to duplicate things if it is possible to compose things instead.

I am also wondering if there is additional ops other than omp.canonical_loop, then it may still require some traversal of the code.

I did not understand this point.

I misunderstood the restrictions of distribute, so you can ignore this.

For suggesting the above representation, I was concerned with three points:

  1. Composite constructs : !$omp do simd is a composite construct and not a combined construct. Hence we should either model thatas a separate operation ( (omp.wsloop_simd) or have these as attributes.

I'm not sure it is possible to have omp.wsloop_simd if there is a loop transform that could change which loop the simd operates on, so it would have to be an attribute or wrapper op.

  1. While omp.wsloop and omp.simd works on loops omp.parallel does not and having that in between constructs would prevent omp.distribute from working directly on a loop or a loop-related construct.
  2. In source code, distribute is always specified on a loop.

Yes, I think these are valid points. Would it be enough for a verifier to check that only the allowed ops like parallel and simd (in addition to the omp.wsloop) are present inside of a omp.distribute instead of encoding it directly in the IR as attributes?

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 7, 2023

I am also wondering if there is additional ops other than omp.canonical_loop, then it may still require some traversal of the code.

I did not understand this point.

I misunderstood the restrictions of distribute, so you can ignore this.

I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.

omp.distribute {
   %cli = omp.cli
   canonical_loop ... %cli
   omp.tile(%cli, ... )
}

Also with the original proposal for omp.canonical_loop this would be possible.

@kiranchandramohan
Copy link
Contributor

I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.

omp.distribute {
   %cli = omp.cli
   canonical_loop ... %cli
   omp.tile(%cli, ... )
}

Also with the original proposal for omp.canonical_loop this would be possible.

Yes, but it will still apply to the first canonical loop or generated loop.

The possible issues, that I can think of if we lose the association with the loop are:
-> Handling of lastprivate. It would be difficult to do this in a delayed fashion (at OpenMPTranslation time).
-> Handling of collapse. How do you propose to handle this?

Also, is it guaranteed that the OpenMP runtime call generation for distribute do not need to know the bounds and loop control of the loop?

// Distribute construct [2.9.4.1]
//===----------------------------------------------------------------------===//
def DistributeOp : OpenMP_Op<"distribute", [AttrSizedOperandSegments,
MemoryEffects<[MemWrite]>]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shraiysh could you comment on the MemWrite effect here. I believe you previously worked on adding RecursiveMemoryEffects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if @shraiysh is working on OpenMP anymore. There is a bit of a nuance with respect to the memory effects. The memory effects apply to operations inside the distribute region, and to containing ops, but not ops on the same level. If CSE or some other pass decides to hoist things outside the inner region that would be illegal (so in some sense it does apply moved operations at the same level). Similarly it would be illegal to hoist the omp.distribute outside a omp.teams, It is safest to mark the op with MemoryEffects<[MemWrite]>.

Comment on lines 646 to 650
The distribute construct specifies that the iterations of one or more loop
will be executed by the initial teams in the context of their implicit
tasks. The iterations are distributed across the initial threads of all
initial teams that execute the teams region to which the distribute region
binds.
Copy link
Contributor

@kiranchandramohan kiranchandramohan Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify which loop this applies to, is this the closest enclosing omp.canonical_loop?

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 17, 2023

I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.

omp.distribute {
   %cli = omp.cli
   canonical_loop ... %cli
   omp.tile(%cli, ... )
}

Also with the original proposal for omp.canonical_loop this would be possible.

Yes, but it will still apply to the first canonical loop or generated loop.

The possible issues, that I can think of if we lose the association with the loop are: -> Handling of lastprivate. It would be difficult to do this in a delayed fashion (at OpenMPTranslation time). -> Handling of collapse. How do you propose to handle this?

The loops that distribute refer to don't exist until after the loop transformations have been performed, and if the loop transformations happen at the OpenMPTranslation and this is an issue for handling of lastprivate, it will be a problem no matter what solution we pick unless the loop transformations happen earlier in the compiler.

I think perhaps a single version of distribute is not enough to represent what we need. The distribute construct may have to be split up into a CLI version op, or wrapper op and additionally a lower level representation which would be attributes on the loop ops. I believe it should always be possible at the loop level to determine if some iteration should be executed on a specific thread or not, so adding an attribute should be enough (plus adding the clauses that were part of the distribute) information.

The problem I see is that it will not be possible (or very convoluted) to directly generate code with omp.distribute higher up in the chain of CLIs, since at the time of generating code for the loops (e.g. omp.wsloop) we need to know if distribute is present or not. As you mention, the association is broken at that point and unless that information is propagated down or scanned upward in the chain of CLIs, or the region hierarchy in the case of this PR, it is not possible to generate code without non-local information in either solution. I think the only alternative would be to do the loop transformations in MLIR and convert the omp.distribute op into attributes on the loop operations.

Also, is it guaranteed that the OpenMP runtime call generation for distribute do not need to know the bounds and loop control of the loop?

For the target device codegen, distribute is part of the generation for the loop, so it is just a different function call than the non-distribute version. There is no way to generate code for 'distribute' in isolation.

Would it make sense to implement a more minimal solution to handle simple distribute uses for now by adding a attribute to omp.wsloop and have the frontend handle distribute?

@skatrak
Copy link
Contributor

skatrak commented Nov 17, 2023

I like the idea of potentially collecting all information on how a loop may be split up across different levels of parallelism in the omp.wsloop MLIR operation, just because it makes it simpler to call the right runtime function. In my mind, that approach taken to what I think to be its full realization would probably at least impact the distribute, do and simd directives, though maybe also loop or others. Assuming the "worksharing type" attribute can be printed/parsed how I envision it, these various OpenMP constructs could be represented as follows:

// !$omp target teams distribute parallel do
omp.target {
  omp.teams {
    omp.parallel {
      omp.wsloop distribute for (%i) : i32 (%lb) to (%ub) step (%step) {
      }
    }
  }
}

// !$omp distribute
omp.wsloop distribute (%i) : i32 (%lb) to (%ub) step (%step) {
}

// !$omp distribute parallel do
omp.parallel {
  omp.wsloop distribute for (%i) : i32 (%lb) to (%ub) step (%step) {
  }
}

// !$omp distribute parallel do simd
omp.parallel {
  omp.wsloop distribute for simd (%i) : i32 (%lb) to (%ub) step (%step) {
  }
}

// !$omp distribute simd
omp.wsloop distribute simd (%i) : i32 (%lb) to (%ub) step (%step) {
}

// !$omp do
omp.wsloop for (%i) : i32 (%lb) to (%ub) step (%step) {
}

// !$omp do simd
omp.wsloop for simd (%i) : i32 (%lb) to (%ub) step (%step) {
}

// !$omp simd
omp.wsloop simd (%i) : i32 (%lb) to (%ub) step (%step) {
}

// !$omp parallel do
omp.parallel {
  omp.wsloop for (%i) : i32 (%lb) to (%ub) step (%step) {
  }
}

// !$omp parallel do simd
omp.parallel {
  omp.wsloop for simd (%i) : i32 (%lb) to (%ub) step (%step) {
  }
}

This ignores for the time being the introduction of omp.canonical_loop. I assume that whichever approach is taken there would need to somehow interact with or take over omp.wsloop, so in principle that fact wouldn't change by this addition. The main shortcoming of this approach, as I see it, is that certain clauses may be applicable to multiple of these directives, but we would be trying to store them all in a single MLIR operation, so we may possibly need to disambiguate by adding multiple variants to certain clauses. For example:

// !$omp distribute private(a)
// !$omp parallel do private(b)
omp.parallel {
  omp.wsloop private(%1, %2) distribute for ... {}
  // or
  omp.wsloop distribute_private(%1) for_private(%2) distribute for ... {}
}
// Would something like this be valid?
// !$omp distribute collapse(1)
// !$omp parallel do collapse(2)
omp.parallel {
  omp.wsloop collapse(%whichever_takes_precedence) distribute for ... {}
  // or
  omp.wsloop distribute_collapse(%1) for_collapse(%2) distribute for ... {}
}

This is the list of clauses, according to the 5.2 spec document, accepted by the distribute, do and simd constructs. There are several clauses that apply to multiple or all of these directives, so it may be the case for some that semantically it's not the same attaching them to one of these constructs as doing it to all of them, or it may be allowed for them to have different values for each directive.

Clause distribute do simd
allocate
aligned
collapse
dist_schedule
firstprivate
if
lastprivate
linear
nontemporal
nowait
order
ordered
private
reduction
schedule
safelen
simdlen

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 17, 2023

Thanks for detailing this, it seems like a good approach. We still are stuck on the loop transformations though, since they will prevent this solution if they remain until codegen. I'm basically a proponent of doing them quite early in the compiler in a specialized pass which would in effect all CLI values in the IR by performing the transformations. It seems reasonable since we are not doing code generation and emitting LLVMIR, we're just transforming some omp ops into other omp ops.

@kiranchandramohan
Copy link
Contributor

I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.

omp.distribute {
   %cli = omp.cli
   canonical_loop ... %cli
   omp.tile(%cli, ... )
}

Also with the original proposal for omp.canonical_loop this would be possible.

The omp.new_cli operation is strictly not necessary. They could just be block arguments. We could go ahead with the nesting approach (with block arguments) and @jsjodin 's approach of providing CLI as operands of canonical loops and loop transformation ops. An example is given below

omp.distribute loops(%tloop)
bb0 (%tloop : !omp.cli):   
  omp.tile loops(%outer, %inner), construct(%tloop:!omp.cli) {
  bb0 (%outer, %inner : !omp.cli, !omp.cli):
    omp.canonical_loop %iv1 : i32 in [0, %tripcount), construct(%outer : !omp.cli){
      omp.canonical_loop %iv2 : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
        store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
      }
    }
  }

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 17, 2023

The omp.new_cli operation is strictly not necessary. They could just be block arguments. We could go ahead with the nesting approach (with block arguments) and @jsjodin 's approach of providing CLI as operands of canonical loops and loop transformation ops. An example is given below

omp.distribute loops(%tloop)
bb0 (%tloop : !omp.cli):   
  omp.tile loops(%outer, %inner), construct(%tloop:!omp.cli) {
  bb0 (%outer, %inner : !omp.cli, !omp.cli):
    omp.canonical_loop %iv1 : i32 in [0, %tripcount), construct(%outer : !omp.cli){
      omp.canonical_loop %iv2 : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
        store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
      }
    }
  }

I'm not quite sure I understand the example. Are the construct(%cli... ) the "outputs" and the loops(%cli, ..) the ones being operated on, or do I have it backwards? Does the modified example below make sense?

omp.distribute loops(%tloop1) { 
 bb0 (%tloop1 : !omp.cli, %tloop2 : !omp.cli): // Is it legal to have %tloop2 without it being in omp.distribute? We only care about one of the loops.   
  omp.tile loops(%loop), construct(%tloop1:!omp.cli, %tloop2:!omp.cli) { // Input 1 loop, output 2 loops
  bb0 (%loop : !omp.cli):
      omp.canonical_loop %iv : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv] : memref<?x?xf32>
        store %a, %arrB[%iv] : memref<?x?xf32>
    }
  }
}

@Meinersbur had objections to the initial nested proposal because there were no values to refer to specific loops after transformations were done, but that seems to be fixed here.

@kiranchandramohan
Copy link
Contributor

I was originally thinking about other regular operations, however considering that there might be transformation ops there could be other ops than just a omp.canonical_loop inside a omp.distribute region.

omp.distribute {
   %cli = omp.cli
   canonical_loop ... %cli
   omp.tile(%cli, ... )
}

Also with the original proposal for omp.canonical_loop this would be possible.

The omp.new_cli operation is strictly not necessary. They could just be block arguments. We could go ahead with the nesting approach (with block arguments) and @jsjodin 's approach of providing CLI as operands of canonical loops and loop transformation ops. An example is given below

omp.distribute loops(%tloop)
bb0 (%tloop : !omp.cli):   
  omp.tile loops(%outer, %inner), construct(%tloop:!omp.cli) {
  bb0 (%outer, %inner : !omp.cli, !omp.cli):
    omp.canonical_loop %iv1 : i32 in [0, %tripcount), construct(%outer : !omp.cli){
      omp.canonical_loop %iv2 : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
        store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
      }
    }
  }

I think tile was the long construct, I meant a collapse here.

omp.distribute loops(%tloop)
bb0 (%tloop : !omp.cli):   
  omp.collapse loops(%outer, %inner), construct(%tloop:!omp.cli) {
  bb0 (%outer, %inner : !omp.cli, !omp.cli):
    omp.canonical_loop %iv1 : i32 in [0, %tripcount), construct(%outer : !omp.cli){
      omp.canonical_loop %iv2 : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv1, %iv2] : memref<?x?xf32>
        store %a, %arrB[%iv1, %iv2] : memref<?x?xf32>
      }
    }
  }

loops refer to the loops on which the construct will operate, construct points back to the construct that will modify this loop or loop transformation construct.

@kiranchandramohan
Copy link
Contributor

I'm not quite sure I understand the example. Are the construct(%cli... ) the "outputs" and the loops(%cli, ..) the ones being operated on, or do I have it backwards? Does the modified example below make sense?

omp.distribute loops(%tloop1) { 
 bb0 (%tloop1 : !omp.cli, %tloop2 : !omp.cli): // Is it legal to have %tloop2 without it being in omp.distribute? We only care about one of the loops.   
  omp.tile loops(%loop), construct(%tloop1:!omp.cli, %tloop2:!omp.cli) { // Input 1 loop, output 2 loops
  bb0 (%loop : !omp.cli):
      omp.canonical_loop %iv : i32 in [0, %tc), construct(%inner : !omp.cli) {
        %a = load %arrA[%iv] : memref<?x?xf32>
        store %a, %arrB[%iv] : memref<?x?xf32>
    }
  }
}
omp.distribute loops(%tloop1) { 
 bb0 (%tloop1 : !omp.cli):
  omp.tile loops(%loop), construct(%tloop1:!omp.cli) { // Input 1 loop, output is two but further transformation by distribute only affects outermost.
  bb0 (%loop : !omp.cli):
      omp.canonical_loop %iv : i32 in [0, %tc), construct(%loop : !omp.cli) {
        %a = load %arrA[%iv] : memref<?x?xf32>
        store %a, %arrB[%iv] : memref<?x?xf32>
    }
  }
}

@kiranchandramohan
Copy link
Contributor

I guess you are pointing towards a situation where there are two generated loops and the construct needs to work on both. For an example like the following:

!$omp distribute collapse(2)
!$omp.tile size(4)
do i=0, tripcount
  arrB(i) = arrA(i)
end do

I was suggesting something like the following:

omp.distribute loops(%dloop) {
bb0 (%dloop : !omp.cli):   
  omp.collapse loops(%outer, %inner), construct(%dloop:!omp.cli) {
  bb0 (%outer, %inner : !omp.cli, !omp.cli):
    omp.tile loops(%loop) size(4) construct(%outer:!omp.cli, %inner:!omp.cli) {
    bb0(%loop: !omp.cli):
      omp.canonical_loop %iv1 : i32 in [0, %tripcount), construct(%loop : !omp.cli){
        %a = load %arrA[%iv1] : memref<?xf32>
        store %a, %arrB[%iv1] : memref<?xf32>
      }
   }
 }
}

@jsjodin
Copy link
Contributor Author

jsjodin commented Nov 17, 2023

I was looking for an example where only a subset of the created loops from construct(...) are used in the enclosing op in loops(...)

@jsjodin
Copy link
Contributor Author

jsjodin commented Jan 11, 2024

I think we need to get this PR moving again. From what I can tell in the original and alternative solutions to handle CLIs the omp.distribute will be a wrapper op in both cases. I think we can add the CLI handling for future PRs.

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG.

Note that we will need the omp.canonical_loop to retain loop information at the LLVM dialect. With @kparzysz changes, it might be easier to add the canonical loops.

@jsjodin jsjodin merged commit 17db9ef into llvm:main Jan 24, 2024
4 checks passed
@jsjodin jsjodin deleted the jleyonberg/distop branch October 1, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants