Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flang] Introduce custom loop nest generation for loops in workshare construct #101445

Open
wants to merge 2 commits into
base: users/ivanradanov/flang-workshare-emit
Choose a base branch
from

Conversation

ivanradanov
Copy link
Contributor

@ivanradanov ivanradanov commented Aug 1, 2024

3/4

1/4 #101443
2/4 #101444
3/4 #101445
4/4 #101446

This alternative loop nest generation is used in 4/4 for the workshare lowering.

@ivanradanov
Copy link
Contributor Author

I am debating introducing a new operation workshare_loop_container which exists only to "contain" a omp.loop_nest between lowering an elemental to lowering the omp.workshare it is contained in.

so we would have this state:

omp.workshare {
  omp.workshare_loop_container {
    omp.loop_nest {}
  }
}
omp.workshare {
  omp.wsloop {
    omp.loop_nest {}
  }
}

Which may have come from a different lowering/codegen and we are not sure what the semantics of that code would be.

This new operation can later be reused for the workdistribute lowering as well.

@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from c2cbd77 to 4da93bb Compare August 1, 2024 02:58
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 8068d60 to 14878e8 Compare August 2, 2024 08:17
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch 2 times, most recently from decd0c5 to cff85ab Compare August 4, 2024 07:14
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from e3460a0 to 3b8dbba Compare August 4, 2024 08:04
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from cff85ab to 6571aed Compare August 4, 2024 08:04
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 3b8dbba to 38322dc Compare August 4, 2024 13:08
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 35fb583 to a14789a Compare August 4, 2024 13:08
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 38322dc to 9046df2 Compare August 6, 2024 04:52
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from a14789a to f1b9e86 Compare August 6, 2024 04:53
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 9046df2 to 10b7a39 Compare August 19, 2024 04:01
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from f1b9e86 to 21c5c42 Compare August 19, 2024 04:02
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 10b7a39 to d17c552 Compare August 19, 2024 06:21
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 21c5c42 to a9f4b77 Compare August 19, 2024 06:22
@ivanradanov ivanradanov changed the title [flang] Introduce ws loop nest generation for HLFIR lowering [flang] Introduce custom HLFIR lowering for loops in workshare construct Aug 19, 2024
@ivanradanov ivanradanov marked this pull request as ready for review August 19, 2024 08:25
@llvmbot llvmbot added flang Flang issues not falling into any other category flang:fir-hlfir labels Aug 19, 2024
@ivanradanov ivanradanov changed the title [flang] Introduce custom HLFIR lowering for loops in workshare construct [flang] Introduce custom loop nest generation for loops in workshare construct Aug 19, 2024
@llvmbot
Copy link
Collaborator

llvmbot commented Aug 19, 2024

@llvm/pr-subscribers-flang-fir-hlfir

Author: Ivan R. Ivanov (ivanradanov)

Changes

3/4

1/4 #101443
2/4 #101444
3/4 #101445
4/4 #101446

This alternative loop nest generation is used in 4/4 for the workshare lowering.


Full diff: https://github.com/llvm/llvm-project/pull/101445.diff

7 Files Affected:

  • (modified) flang/include/flang/Optimizer/Builder/HLFIRTools.h (+7-5)
  • (modified) flang/lib/Lower/ConvertCall.cpp (+1-1)
  • (modified) flang/lib/Lower/OpenMP/ReductionProcessor.cpp (+2-2)
  • (modified) flang/lib/Optimizer/Builder/HLFIRTools.cpp (+39-13)
  • (modified) flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp (+2-1)
  • (modified) flang/lib/Optimizer/HLFIR/Transforms/LowerHLFIROrderedAssignments.cpp (+15-15)
  • (modified) flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp (+3-3)
diff --git a/flang/include/flang/Optimizer/Builder/HLFIRTools.h b/flang/include/flang/Optimizer/Builder/HLFIRTools.h
index 6b41025eea0780..f073f494b3fb21 100644
--- a/flang/include/flang/Optimizer/Builder/HLFIRTools.h
+++ b/flang/include/flang/Optimizer/Builder/HLFIRTools.h
@@ -357,8 +357,8 @@ hlfir::ElementalOp genElementalOp(
 
 /// Structure to describe a loop nest.
 struct LoopNest {
-  fir::DoLoopOp outerLoop;
-  fir::DoLoopOp innerLoop;
+  mlir::Operation *outerOp = nullptr;
+  mlir::Block *body = nullptr;
   llvm::SmallVector<mlir::Value> oneBasedIndices;
 };
 
@@ -366,11 +366,13 @@ struct LoopNest {
 /// \p isUnordered specifies whether the loops in the loop nest
 /// are unordered.
 LoopNest genLoopNest(mlir::Location loc, fir::FirOpBuilder &builder,
-                     mlir::ValueRange extents, bool isUnordered = false);
+                     mlir::ValueRange extents, bool isUnordered = false,
+                     bool emitWorkshareLoop = false);
 inline LoopNest genLoopNest(mlir::Location loc, fir::FirOpBuilder &builder,
-                            mlir::Value shape, bool isUnordered = false) {
+                            mlir::Value shape, bool isUnordered = false,
+                            bool emitWorkshareLoop = false) {
   return genLoopNest(loc, builder, getIndexExtents(loc, builder, shape),
-                     isUnordered);
+                     isUnordered, emitWorkshareLoop);
 }
 
 /// Inline the body of an hlfir.elemental at the current insertion point
diff --git a/flang/lib/Lower/ConvertCall.cpp b/flang/lib/Lower/ConvertCall.cpp
index fd873f55dd844e..0689d6e033dd9c 100644
--- a/flang/lib/Lower/ConvertCall.cpp
+++ b/flang/lib/Lower/ConvertCall.cpp
@@ -2128,7 +2128,7 @@ class ElementalCallBuilder {
           hlfir::genLoopNest(loc, builder, shape, !mustBeOrdered);
       mlir::ValueRange oneBasedIndices = loopNest.oneBasedIndices;
       auto insPt = builder.saveInsertionPoint();
-      builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
+      builder.setInsertionPointToStart(loopNest.body);
       callContext.stmtCtx.pushScope();
       for (auto &preparedActual : loweredActuals)
         if (preparedActual)
diff --git a/flang/lib/Lower/OpenMP/ReductionProcessor.cpp b/flang/lib/Lower/OpenMP/ReductionProcessor.cpp
index c3c1f363033c27..72a90dd0d6f29d 100644
--- a/flang/lib/Lower/OpenMP/ReductionProcessor.cpp
+++ b/flang/lib/Lower/OpenMP/ReductionProcessor.cpp
@@ -375,7 +375,7 @@ static void genBoxCombiner(fir::FirOpBuilder &builder, mlir::Location loc,
   // know this won't miss any opportuinties for clever elemental inlining
   hlfir::LoopNest nest = hlfir::genLoopNest(
       loc, builder, shapeShift.getExtents(), /*isUnordered=*/true);
-  builder.setInsertionPointToStart(nest.innerLoop.getBody());
+  builder.setInsertionPointToStart(nest.body);
   mlir::Type refTy = fir::ReferenceType::get(seqTy.getEleTy());
   auto lhsEleAddr = builder.create<fir::ArrayCoorOp>(
       loc, refTy, lhs, shapeShift, /*slice=*/mlir::Value{},
@@ -389,7 +389,7 @@ static void genBoxCombiner(fir::FirOpBuilder &builder, mlir::Location loc,
       builder, loc, redId, refTy, lhsEle, rhsEle);
   builder.create<fir::StoreOp>(loc, scalarReduction, lhsEleAddr);
 
-  builder.setInsertionPointAfter(nest.outerLoop);
+  builder.setInsertionPointAfter(nest.outerOp);
   builder.create<mlir::omp::YieldOp>(loc, lhsAddr);
 }
 
diff --git a/flang/lib/Optimizer/Builder/HLFIRTools.cpp b/flang/lib/Optimizer/Builder/HLFIRTools.cpp
index 8d0ae2f195178c..333331378841ed 100644
--- a/flang/lib/Optimizer/Builder/HLFIRTools.cpp
+++ b/flang/lib/Optimizer/Builder/HLFIRTools.cpp
@@ -20,6 +20,7 @@
 #include "mlir/IR/IRMapping.h"
 #include "mlir/Support/LLVM.h"
 #include "llvm/ADT/TypeSwitch.h"
+#include <mlir/Dialect/OpenMP/OpenMPDialect.h>
 #include <optional>
 
 // Return explicit extents. If the base is a fir.box, this won't read it to
@@ -855,26 +856,51 @@ mlir::Value hlfir::inlineElementalOp(
 
 hlfir::LoopNest hlfir::genLoopNest(mlir::Location loc,
                                    fir::FirOpBuilder &builder,
-                                   mlir::ValueRange extents, bool isUnordered) {
+                                   mlir::ValueRange extents, bool isUnordered,
+                                   bool emitWorkshareLoop) {
+  emitWorkshareLoop = emitWorkshareLoop && isUnordered;
   hlfir::LoopNest loopNest;
   assert(!extents.empty() && "must have at least one extent");
-  auto insPt = builder.saveInsertionPoint();
+  mlir::OpBuilder::InsertionGuard guard(builder);
   loopNest.oneBasedIndices.assign(extents.size(), mlir::Value{});
   // Build loop nest from column to row.
   auto one = builder.create<mlir::arith::ConstantIndexOp>(loc, 1);
   mlir::Type indexType = builder.getIndexType();
-  unsigned dim = extents.size() - 1;
-  for (auto extent : llvm::reverse(extents)) {
-    auto ub = builder.createConvert(loc, indexType, extent);
-    loopNest.innerLoop =
-        builder.create<fir::DoLoopOp>(loc, one, ub, one, isUnordered);
-    builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
-    // Reverse the indices so they are in column-major order.
-    loopNest.oneBasedIndices[dim--] = loopNest.innerLoop.getInductionVar();
-    if (!loopNest.outerLoop)
-      loopNest.outerLoop = loopNest.innerLoop;
+  if (emitWorkshareLoop) {
+    auto wslw = builder.create<mlir::omp::WorkshareLoopWrapperOp>(loc);
+    loopNest.outerOp = wslw;
+    builder.createBlock(&wslw.getRegion());
+    mlir::omp::LoopNestOperands lnops;
+    lnops.loopInclusive = builder.getUnitAttr();
+    for (auto extent : llvm::reverse(extents)) {
+      lnops.loopLowerBounds.push_back(one);
+      lnops.loopUpperBounds.push_back(extent);
+      lnops.loopSteps.push_back(one);
+    }
+    auto lnOp = builder.create<mlir::omp::LoopNestOp>(loc, lnops);
+    builder.create<mlir::omp::TerminatorOp>(loc);
+    mlir::Block *block = builder.createBlock(&lnOp.getRegion());
+    for (auto extent : llvm::reverse(extents))
+      block->addArgument(extent.getType(), extent.getLoc());
+    loopNest.body = block;
+    builder.create<mlir::omp::YieldOp>(loc);
+    for (unsigned dim = 0; dim < extents.size(); dim++)
+      loopNest.oneBasedIndices[extents.size() - dim - 1] =
+          lnOp.getRegion().front().getArgument(dim);
+  } else {
+    unsigned dim = extents.size() - 1;
+    for (auto extent : llvm::reverse(extents)) {
+      auto ub = builder.createConvert(loc, indexType, extent);
+      auto doLoop =
+          builder.create<fir::DoLoopOp>(loc, one, ub, one, isUnordered);
+      loopNest.body = doLoop.getBody();
+      builder.setInsertionPointToStart(loopNest.body);
+      // Reverse the indices so they are in column-major order.
+      loopNest.oneBasedIndices[dim--] = doLoop.getInductionVar();
+      if (!loopNest.outerOp)
+        loopNest.outerOp = doLoop;
+    }
   }
-  builder.restoreInsertionPoint(insPt);
   return loopNest;
 }
 
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp b/flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp
index a70a6b388c4b1a..b608677c526310 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp
@@ -31,6 +31,7 @@
 #include "mlir/Pass/Pass.h"
 #include "mlir/Pass/PassManager.h"
 #include "mlir/Transforms/DialectConversion.h"
+#include "mlir/Dialect/OpenMP/OpenMPDialect.h"
 #include "llvm/ADT/TypeSwitch.h"
 
 namespace hlfir {
@@ -793,7 +794,7 @@ struct ElementalOpConversion
     hlfir::LoopNest loopNest =
         hlfir::genLoopNest(loc, builder, extents, !elemental.isOrdered());
     auto insPt = builder.saveInsertionPoint();
-    builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
+    builder.setInsertionPointToStart(loopNest.body);
     auto yield = hlfir::inlineElementalOp(loc, builder, elemental,
                                           loopNest.oneBasedIndices);
     hlfir::Entity elementValue(yield.getElementValue());
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/LowerHLFIROrderedAssignments.cpp b/flang/lib/Optimizer/HLFIR/Transforms/LowerHLFIROrderedAssignments.cpp
index 85dd517cb57914..645abf65d10a32 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/LowerHLFIROrderedAssignments.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/LowerHLFIROrderedAssignments.cpp
@@ -464,7 +464,7 @@ void OrderedAssignmentRewriter::pre(hlfir::RegionAssignOp regionAssignOp) {
       // if the LHS is not).
       mlir::Value shape = hlfir::genShape(loc, builder, lhsEntity);
       elementalLoopNest = hlfir::genLoopNest(loc, builder, shape);
-      builder.setInsertionPointToStart(elementalLoopNest->innerLoop.getBody());
+      builder.setInsertionPointToStart(elementalLoopNest->body);
       lhsEntity = hlfir::getElementAt(loc, builder, lhsEntity,
                                       elementalLoopNest->oneBasedIndices);
       rhsEntity = hlfir::getElementAt(loc, builder, rhsEntity,
@@ -484,7 +484,7 @@ void OrderedAssignmentRewriter::pre(hlfir::RegionAssignOp regionAssignOp) {
     for (auto &cleanupConversion : argConversionCleanups)
       cleanupConversion();
     if (elementalLoopNest)
-      builder.setInsertionPointAfter(elementalLoopNest->outerLoop);
+      builder.setInsertionPointAfter(elementalLoopNest->outerOp);
   } else {
     // TODO: preserve allocatable assignment aspects for forall once
     // they are conveyed in hlfir.region_assign.
@@ -493,7 +493,7 @@ void OrderedAssignmentRewriter::pre(hlfir::RegionAssignOp regionAssignOp) {
   generateCleanupIfAny(loweredLhs.elementalCleanup);
   if (loweredLhs.vectorSubscriptLoopNest)
     builder.setInsertionPointAfter(
-        loweredLhs.vectorSubscriptLoopNest->outerLoop);
+        loweredLhs.vectorSubscriptLoopNest->outerOp);
   generateCleanupIfAny(oldRhsYield);
   generateCleanupIfAny(loweredLhs.nonElementalCleanup);
 }
@@ -518,8 +518,8 @@ void OrderedAssignmentRewriter::pre(hlfir::WhereOp whereOp) {
       hlfir::Entity savedMask{maybeSaved->first};
       mlir::Value shape = hlfir::genShape(loc, builder, savedMask);
       whereLoopNest = hlfir::genLoopNest(loc, builder, shape);
-      constructStack.push_back(whereLoopNest->outerLoop.getOperation());
-      builder.setInsertionPointToStart(whereLoopNest->innerLoop.getBody());
+      constructStack.push_back(whereLoopNest->outerOp);
+      builder.setInsertionPointToStart(whereLoopNest->body);
       mlir::Value cdt = hlfir::getElementAt(loc, builder, savedMask,
                                             whereLoopNest->oneBasedIndices);
       generateMaskIfOp(cdt);
@@ -527,7 +527,7 @@ void OrderedAssignmentRewriter::pre(hlfir::WhereOp whereOp) {
         // If this is the same run as the one that saved the value, the clean-up
         // was left-over to be done now.
         auto insertionPoint = builder.saveInsertionPoint();
-        builder.setInsertionPointAfter(whereLoopNest->outerLoop);
+        builder.setInsertionPointAfter(whereLoopNest->outerOp);
         generateCleanupIfAny(maybeSaved->second);
         builder.restoreInsertionPoint(insertionPoint);
       }
@@ -539,8 +539,8 @@ void OrderedAssignmentRewriter::pre(hlfir::WhereOp whereOp) {
     mask.generateNoneElementalPart(builder, mapper);
     mlir::Value shape = mask.generateShape(builder, mapper);
     whereLoopNest = hlfir::genLoopNest(loc, builder, shape);
-    constructStack.push_back(whereLoopNest->outerLoop.getOperation());
-    builder.setInsertionPointToStart(whereLoopNest->innerLoop.getBody());
+    constructStack.push_back(whereLoopNest->outerOp);
+    builder.setInsertionPointToStart(whereLoopNest->body);
     mlir::Value cdt = generateMaskedEntity(mask);
     generateMaskIfOp(cdt);
     return;
@@ -754,7 +754,7 @@ OrderedAssignmentRewriter::generateYieldedLHS(
       loweredLhs.vectorSubscriptLoopNest = hlfir::genLoopNest(
           loc, builder, loweredLhs.vectorSubscriptShape.value());
       builder.setInsertionPointToStart(
-          loweredLhs.vectorSubscriptLoopNest->innerLoop.getBody());
+          loweredLhs.vectorSubscriptLoopNest->body);
     }
     loweredLhs.lhs = temp->second.fetch(loc, builder);
     return loweredLhs;
@@ -772,7 +772,7 @@ OrderedAssignmentRewriter::generateYieldedLHS(
         hlfir::genLoopNest(loc, builder, *loweredLhs.vectorSubscriptShape,
                            !elementalAddrLhs.isOrdered());
     builder.setInsertionPointToStart(
-        loweredLhs.vectorSubscriptLoopNest->innerLoop.getBody());
+        loweredLhs.vectorSubscriptLoopNest->body);
     mapper.map(elementalAddrLhs.getIndices(),
                loweredLhs.vectorSubscriptLoopNest->oneBasedIndices);
     for (auto &op : elementalAddrLhs.getBody().front().without_terminator())
@@ -798,11 +798,11 @@ OrderedAssignmentRewriter::generateMaskedEntity(MaskedArrayExpr &maskedExpr) {
   if (!maskedExpr.noneElementalPartWasGenerated) {
     // Generate none elemental part before the where loops (but inside the
     // current forall loops if any).
-    builder.setInsertionPoint(whereLoopNest->outerLoop);
+    builder.setInsertionPoint(whereLoopNest->outerOp);
     maskedExpr.generateNoneElementalPart(builder, mapper);
   }
   // Generate the none elemental part cleanup after the where loops.
-  builder.setInsertionPointAfter(whereLoopNest->outerLoop);
+  builder.setInsertionPointAfter(whereLoopNest->outerOp);
   maskedExpr.generateNoneElementalCleanupIfAny(builder, mapper);
   // Generate the value of the current element for the masked expression
   // at the current insertion point (inside the where loops, and any fir.if
@@ -1242,7 +1242,7 @@ void OrderedAssignmentRewriter::saveLeftHandSide(
   LhsValueAndCleanUp loweredLhs = generateYieldedLHS(loc, region);
   fir::factory::TemporaryStorage *temp = nullptr;
   if (loweredLhs.vectorSubscriptLoopNest)
-    constructStack.push_back(loweredLhs.vectorSubscriptLoopNest->outerLoop);
+    constructStack.push_back(loweredLhs.vectorSubscriptLoopNest->outerOp);
   if (loweredLhs.vectorSubscriptLoopNest && !rhsIsArray(regionAssignOp)) {
     // Vector subscripted entity for which the shape must also be saved on top
     // of the element addresses (e.g. the shape may change in each forall
@@ -1265,7 +1265,7 @@ void OrderedAssignmentRewriter::saveLeftHandSide(
     // subscripted LHS.
     auto &vectorTmp = temp->cast<fir::factory::AnyVectorSubscriptStack>();
     auto insertionPoint = builder.saveInsertionPoint();
-    builder.setInsertionPoint(loweredLhs.vectorSubscriptLoopNest->outerLoop);
+    builder.setInsertionPoint(loweredLhs.vectorSubscriptLoopNest->outerOp);
     vectorTmp.pushShape(loc, builder, shape);
     builder.restoreInsertionPoint(insertionPoint);
   } else {
@@ -1291,7 +1291,7 @@ void OrderedAssignmentRewriter::saveLeftHandSide(
   if (loweredLhs.vectorSubscriptLoopNest) {
     constructStack.pop_back();
     builder.setInsertionPointAfter(
-        loweredLhs.vectorSubscriptLoopNest->outerLoop);
+        loweredLhs.vectorSubscriptLoopNest->outerOp);
   }
 }
 
diff --git a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
index c5b809514c54c6..c4aed6b79df923 100644
--- a/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
+++ b/flang/lib/Optimizer/HLFIR/Transforms/OptimizedBufferization.cpp
@@ -483,7 +483,7 @@ llvm::LogicalResult ElementalAssignBufferization::matchAndRewrite(
   // hlfir.elemental region inside the inner loop
   hlfir::LoopNest loopNest =
       hlfir::genLoopNest(loc, builder, extents, !elemental.isOrdered());
-  builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
+  builder.setInsertionPointToStart(loopNest.body);
   auto yield = hlfir::inlineElementalOp(loc, builder, elemental,
                                         loopNest.oneBasedIndices);
   hlfir::Entity elementValue{yield.getElementValue()};
@@ -554,7 +554,7 @@ llvm::LogicalResult BroadcastAssignBufferization::matchAndRewrite(
       hlfir::getIndexExtents(loc, builder, shape);
   hlfir::LoopNest loopNest =
       hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true);
-  builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
+  builder.setInsertionPointToStart(loopNest.body);
   auto arrayElement =
       hlfir::getElementAt(loc, builder, lhs, loopNest.oneBasedIndices);
   builder.create<hlfir::AssignOp>(loc, rhs, arrayElement);
@@ -649,7 +649,7 @@ llvm::LogicalResult VariableAssignBufferization::matchAndRewrite(
       hlfir::getIndexExtents(loc, builder, shape);
   hlfir::LoopNest loopNest =
       hlfir::genLoopNest(loc, builder, extents, /*isUnordered=*/true);
-  builder.setInsertionPointToStart(loopNest.innerLoop.getBody());
+  builder.setInsertionPointToStart(loopNest.body);
   auto rhsArrayElement =
       hlfir::getElementAt(loc, builder, rhs, loopNest.oneBasedIndices);
   rhsArrayElement = hlfir::loadTrivialScalar(loc, builder, rhsArrayElement);

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from d17c552 to 621b017 Compare August 19, 2024 08:40
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from b37194f to 082b892 Compare August 19, 2024 08:40
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 621b017 to 5e01e41 Compare August 21, 2024 14:06
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 082b892 to 7cce1db Compare August 21, 2024 14:07
Copy link

github-actions bot commented Aug 21, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 5e01e41 to 70daa01 Compare August 22, 2024 03:48
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 7cce1db to 51efcce Compare August 22, 2024 03:48
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 70daa01 to fb06794 Compare August 22, 2024 08:59
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 51efcce to c3ec4f1 Compare August 22, 2024 08:59
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from fb06794 to de32599 Compare August 22, 2024 09:08
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from c3ec4f1 to d5fbe9c Compare August 22, 2024 09:09
@skatrak
Copy link
Contributor

skatrak commented Aug 22, 2024

I am debating introducing a new operation workshare_loop_container which exists only to "contain" a omp.loop_nest between lowering an elemental to lowering the omp.workshare it is contained in.

so we would have this state:

omp.workshare {
  omp.workshare_loop_container {
    omp.loop_nest {}
  }
}
omp.workshare {
  omp.wsloop {
    omp.loop_nest {}
  }
}

Which may have come from a different lowering/codegen and we are not sure what the semantics of that code would be.

This new operation can later be reused for the workdistribute lowering as well.

I've been thinking about it and I believe we could actually use omp.wsloop rather than creating a new loop wrapper just for this. The reason is that, according to the spec, "the only OpenMP constructs that may be closely nested inside a workshare construct are the atomic, critical and parallel constructs" (5.2 spec, page 246). So there's no other way we could end up with an omp.wsloop nested directly inside of omp.workshare other than to represent these elementals.

The only difference is that perhaps we could enforce tighter restrictions to the omp.wsloop MLIR verifier if it's nested inside omp.workshare (e.g. it's not composite, it doesn't have any clause specified, etc).

@ivanradanov
Copy link
Contributor Author

@skatrak
I considered that as well, however, I did not like it because the semantics of the block that the operation is contained in is different. wsloop expects its parent block to be a parallel block which all threads will execute and all of those threads will share the work of the nested loop nest. Whereas the workshare.loop_nest op is semantically executed by a single-thread (because the workshare directive acts like it preserves the semantics of single-threaded fortran execution.).

I don't think it is currently a problem, however, if in the future someone adds some optimization or transformation that assumes that it is nested in a parallel it may break wsloops which happen to be nested in a workshare instead. (As you said it is possible to check whether it is supposed to be a workshare or parallel wsloop but an op changing its semantics or transformations dpeending on what it's nested in feels more error prone).

So that is why I opted for this approach.

@skatrak
Copy link
Contributor

skatrak commented Aug 22, 2024

wsloop expects its parent block to be a parallel block which all threads will execute and all of those threads will share the work of the nested loop nest.

Yes, the omp.wsloop binds to the current team (usually the innermost omp.parallel parent). It doesn't have to be the direct parent, though, there can be other constructs in between.

Whereas the workshare.loop_nest op is semantically executed by a single-thread (because the workshare directive acts like it preserves the semantics of single-threaded fortran execution.).

My understanding is that omp.workshare would be the operation defining a region with sequential execution (as if it was a single thread of the enclosing parallel region), but then there can be worksharing loops inside where all threads split loop iterations, which is why I proposed using omp.wsloop. Thinking about this, maybe this could be implemented based on existing OpenMP operations:

subroutine workshare(A, B, C)
  integer, parameter :: N = 10
  integer, intent(in) :: A(N), B(N)
  integer, intent(out :: C(N)
  integer :: tmp

  !$omp parallel workshare
  C = A + B
  tmp = N
  C = C + A
  !$omp end parallel workshare
end subroutine workshare
func.func @workshare(%A : ..., %B : ..., %C : ...) {
  %N = arith.constant 10 : i32
  %tmp = fir.alloca i32
  omp.parallel {
    omp.wsloop {
      omp.loop_nest (%i) : i32 = (%...) to (%...) inclusive step (%...) {
        // C(%i) = A(%i) + B(%i)
        omp.yield
      }
      omp.terminator
    }
    omp.single {
      fir.store %N to %tmp
      omp.terminator
    }
    omp.wsloop {
      omp.loop_nest (%i) : i32 = (%...) to (%...) inclusive step (%...) {
        // C(%i) = C(%i) + A(%i)
        omp.yield
      }
      omp.terminator
    }
    omp.terminator
  }
}

Maybe support for this operation could be just based on changes to how the MLIR representation is built in the first place, what do you think? Otherwise, something more similar to your proposal for workdistribute, introducing only the omp.workshare operation, keeping fir.do_loop inside and having some sort of pass to translate this to omp.wsloop and omp.single (or splitting the parent omp.parallel) would be possible too. It just feels a bit too complex in comparison.

@ivanradanov
Copy link
Contributor Author

ivanradanov commented Aug 23, 2024

Maybe support for this operation could be just based on changes to how the MLIR representation is built in the first place, what do you think?

This is partly what this implementation aims to do. In fact, after the pass that lowers the omp.workshare operation we are left with IR very close to the one you showed in your example. (please take a look at some of the tests in #101446)

The approach taken here is similar to the omp.workdistribute implementation, in that the purpose of the omp.workshare and omp.workshare.loop_wrapper ops are to preserve the high-level optimizations available when using HLFIR, after we are done with the LowerWorkshare pass, both omp.workdistribute and omp.workdistribute.loop_wrapper disappear.

The sole purpose of the omp.workdistribute.loop_wrapper op is to be able to more explicitly mark loops that need to "parallelized" by the workshare construct and preserve that information through the pipeline. Its lifetime is from the frontend (Fortran->{HLFIR,FIR}) up to the the LowerWorkshare pass which runs after we are done with HLFIR optimizations (after HLFIR->FIR lowering), same for omp.workshare.

The problem with trying to convert fir.do_loop's to wsloop is that it is harder to keep track of where they came from - did they come from an array intrinsic which needs to be parallelized or was it just a do loop which the programmer wrote in the workshare which must not be parallelized.

@skatrak
Copy link
Contributor

skatrak commented Aug 23, 2024

Maybe support for this operation could be just based on changes to how the MLIR representation is built in the first place, what do you think?

This is partly what this implementation aims to do. In fact, after the pass that lowers the omp.workshare operation we are left with IR very close to the one you showed in your example. (please take a look at some of the tests in #101446)

The approach taken here is similar to the omp.workdistribute implementation, in that the purpose of the omp.workshare and omp.workshare.loop_wrapper ops are to preserve the high-level optimizations available when using HLFIR, after we are done with the LowerWorkshare pass, both omp.workdistribute and omp.workdistribute.loop_wrapper disappear.

I see that this approach reduces significantly the amount of OpenMP-specific handling that needs to be done inside of the creation of Fortran loops, so I don't have an issue with adding the omp.workshare and omp.workdistribute ops and creating a late-running transformation pass rather than generating directly the "lower-level" set of OpenMP operations that represent the semantics of these constructs.

The sole purpose of the omp.workdistribute.loop_wrapper op is to be able to more explicitly mark loops that need to "parallelized" by the workshare construct and preserve that information through the pipeline. Its lifetime is from the frontend (Fortran->{HLFIR,FIR}) up to the the LowerWorkshare pass which runs after we are done with HLFIR optimizations (after HLFIR->FIR lowering), same for omp.workshare.

The problem with trying to convert fir.do_loop's to wsloop is that it is harder to keep track of where they came from - did they come from an array intrinsic which needs to be parallelized or was it just a do loop which the programmer wrote in the workshare which must not be parallelized.

I guess what I still don't understand is the need for omp.work{share, distribute}.loop_wrapper operations. For telling apart a sequential loop from a parallel loop inside of a workshare or workdistribute construct, we already have fir.do_loop and omp.wsloop + omp.loop_nest. If I'm not wrong, this PR prepares the genLoopNest function to later specify when to produce a parallel or a sequential loop inside of a workshare construct so that you can create omp.workshare.loop_wrapper or fir.do_loop. What I'm saying is that we can just use omp.wsloop in place of the first because it already has the meaning of "share iterations of the following loop across threads in the current team of threads". And that team of threads is defined by the parent omp.parallel, rather than omp.workshare. I can't think of a semantic difference between encountering omp.wsloop directly nested inside of omp.parallel vs nested inside of an omp.workshare nested inside of omp.parallel. What changes is everything else, which in the second case is functionally equivalent to being inside of an omp.single. Does that make sense or am I still missing something?

@ivanradanov
Copy link
Contributor Author

ivanradanov commented Aug 23, 2024

No you are right, sorry for the back and forth, as you said, since a wsloop can only be nested in a omp.parallel it is immediately obvious that it binds to the omp.parallel threads so that makes sense.

My only concern was that at some point some transformation (perhaps in the future, because I don't think anything transforms wsloops currently) could make the assumption that either all or none of the threads of the team an omp.parallel launches will execute the parent block of a wsloop that binds to that team. (for example an optimization/transformation could add an operation immediately before the wsloop which is supposed to be executed by all threads (or none) in the omp.parallel. that operation would then be erroneously wrapped in an omp.single in LowerWorkshare.)

I thought this was a fair assumption for an optimization/transformation to make because if for example only one of the threads executes a wsloop it would not produce the intended result. So the intention was to guard against a potential error like that. Let me know if I am wrong here since I am sure people here have more experience than me on this.

I can see that if no transformation can make that assumption, then it is perfectly safe to use omp.wsloop instead of workdistribute.loop_wrapper. I am fine with both ways and can make that change if you think it is better. (In fact that is what the initial version of this PR did. I decided to introduce the workshare.loop_wrapper later because I was concerned about a potential issue like above)

Thank you for the in depth review.

@ivanradanov ivanradanov requested a review from tblah August 23, 2024 14:09
@skatrak
Copy link
Contributor

skatrak commented Aug 26, 2024

My only concern was that at some point some transformation (perhaps in the future, because I don't think anything transforms wsloops currently) could make the assumption that either all or none of the threads of the team an omp.parallel launches will execute the parent block of a wsloop that binds to that team. (for example an optimization/transformation could add an operation immediately before the wsloop which is supposed to be executed by all threads (or none) in the omp.parallel. that operation would then be erroneously wrapped in an omp.single in LowerWorkshare.)

I thought this was a fair assumption for an optimization/transformation to make because if for example only one of the threads executes a wsloop it would not produce the intended result. So the intention was to guard against a potential error like that. Let me know if I am wrong here since I am sure people here have more experience than me on this.

Thank you for bringing attention to this potential issue I hadn't considered, I think it's a productive discussion. I'm not aware of any transformations like that existing at the moment either, but it looks like they would certainly break if they ran while the omp.workshare operation still remained. However, they would work if they ran after the pass lowering omp.workshare to a set of omp.single for the code in between omp.wsloops. That way we would not have to introduce a new loop wrapper and also we could create passes assuming the parent of region of an omp.wsloop is executed by all threads in the team. I don't think that should be an issue, since in principle it makes sense to me that the omp.workshare transformation would run immediately after PFT to MLIR lowering. What do you think about that alternative?

@ivanradanov
Copy link
Contributor Author

... However, they would work if they ran after the pass lowering omp.workshare to a set of omp.single for the code in between omp.wsloops. That way we would not have to introduce a new loop wrapper and also we could create passes assuming the parent of region of an omp.wsloop is executed by all threads in the team. I don't think that should be an issue, since in principle it makes sense to me that the omp.workshare transformation would run immediately after PFT to MLIR lowering. What do you think about that alternative?

Ideally, the omp.workshare lowering will run after the HLIF to FIR lowering, because missing the high level optimizations that HLFIR provides can result in very bad performance (unneeded temporary arrays, unnecessary copies, non-fused array computation, etc). The workshare lowering transforms the omp.workshare.loop_wrappers into omp.wsloops so they are gone after that.

Another factor is that there may not be PFT->loop lowerings for many constructs that need to be divided into units of work. so we may need to first generate HLFIR and alter the lowerings from HLFIR to FIR to get the omp.wsloop (or omp.workshare.loop_wrapper), which means that there will be portions of the pipeline (from PFT->HLFIR until HLFIR->FIR) where a omp.wsloop nested in an omp.workshare will be the wrong representation.

Are there any concerns with adding omp.workshare.loop_wrapper? I do not see that big of an overhead (maintenance or compile time) resulting from its addition, while it makes things clearer and more robust in my opinion.

@skatrak
Copy link
Contributor

skatrak commented Aug 29, 2024

Ideally, the omp.workshare lowering will run after the HLIF to FIR lowering, because missing the high level optimizations that HLFIR provides can result in very bad performance (unneeded temporary arrays, unnecessary copies, non-fused array computation, etc). The workshare lowering transforms the omp.workshare.loop_wrappers into omp.wsloops so they are gone after that.

I see, though this doesn't necessarily mean that approach wouldn't work. It means we'd only have to delay the point at which transformations for omp.wsloop that assume the parent region is executed by all threads would be allowed. Either way they would have to be done after omp.workshare lowering, whenever that happened.

Another factor is that there may not be PFT->loop lowerings for many constructs that need to be divided into units of work. so we may need to first generate HLFIR and alter the lowerings from HLFIR to FIR to get the omp.wsloop (or omp.workshare.loop_wrapper), which means that there will be portions of the pipeline (from PFT->HLFIR until HLFIR->FIR) where a omp.wsloop nested in an omp.workshare will be the wrong representation.

Can you share a case where this would happen? I agree that we wouldn't want to produce some IR that doesn't keep consistent semantics for a given operation across the pipeline. In that case, adding another operation might indeed be the right solution.

Are there any concerns with adding omp.workshare.loop_wrapper? I do not see that big of an overhead (maintenance or compile time) resulting from its addition, while it makes things clearer and more robust in my opinion.

My main concern is from the dialect design perspective. It would be confusing to have two separate "worksharing loop" operations with one being used on its own and the other one in conjunction with the omp.workshare operation, but both basically representing the same thing (splitting loop iterations across threads in the team). That's why I'm trying to explore other options that may result in a better representation before committing to it.

@ivanradanov
Copy link
Contributor Author

Can you share a case where this would happen? I agree that we wouldn't want to produce some IR that doesn't keep consistent semantics for a given operation across the pipeline. In that case, adding another operation might indeed be the right solution.

I was under the impression that direct PFT to FIR lowering is deprecated, so things like array notation (e.g. z = x + y where x,y,z are arrays) always go through hlfir.elemental and then to fir loops. Not sure if the PFT->FIR lowering for that exists, but if PFT->FIR is deprecated then we should probably use the HLFIR lowering for this.

My main concern is from the dialect design perspective. It would be confusing to have two separate "worksharing loop" operations with one being used on its own and the other one in conjunction with the omp.workshare operation, but both basically representing the same thing (splitting loop iterations across threads in the team). That's why I'm trying to explore other options that may result in a better representation before committing to it.

I think the operations describe very different things. The similarity in naming is an unfortunate consequence of the workshare construct having the same name as a workshare loop (I am open to more descriptive name suggestions). How I read it is: omp.wsloop is "each thread from from the team that encounter it, executes its share of the loop nest" whereas omp.workdistribute.loop_wrapper is "this loop nest is marked for dividing into units of work by the encompassing omp.workshare" (as per the standard). Semantically, it is just a loop nest that is executed by a single thread and only when the workshare lowering transforms it into an omp.wsloop does it turn into a worksharing loop.

@skatrak
Copy link
Contributor

skatrak commented Sep 9, 2024

Sorry for the delay getting back to this! I don't have any remaining blocking concerns with your proposal, but I think it would be good if others shared their opinions before proceeding with this approach.

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@jeanPerier jeanPerier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from de32599 to 2ff1ac1 Compare October 19, 2024 16:37
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from d5fbe9c to 8afc111 Compare October 19, 2024 16:37
Emit loop nests in a custom wrapper

Only emit unordered loops as omp loops

Fix uninitialized memory bug in genLoopNest
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-emit branch from 2ff1ac1 to e23cf32 Compare October 19, 2024 17:22
@ivanradanov ivanradanov force-pushed the flang-workshare-elemental-lowering branch from 8afc111 to d8cfd38 Compare October 19, 2024 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang:fir-hlfir flang Flang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants