Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flang] Lower omp.workshare to other omp constructs #101446

Open
wants to merge 14 commits into
base: users/ivanradanov/flang-workshare-elemental-lowering
Choose a base branch
from

Conversation

ivanradanov
Copy link
Contributor

@ivanradanov ivanradanov commented Aug 1, 2024

4/4

There are two points which need some discussion in this PR:

  1. We need to make a value computed in a omp.single accessible in all threads of the omp.parallel region. This is achieved by allocating temporary memory outside the omp.parallel and atoring that in the omp.single and then reloading it from all threads. However, from reading the standard I dont think we are guaranteed that the workshare is nested in the omp.parallel so there could be a omp.parallel { func.call @contains_workshare }, then we would not be able to access the omp.parallel. So I think adding support in the runtime to be able to yield a value from a omp.single could be the fix to this.

  2. For the temporary allocations above not all types are supported by fir.alloca, so I need to use llvm.alloca and unrealized_cast to be able to allocate a temporary for a fir.ref type. This too can be fixed by introducing yielding from omp.single

1/4 #101443
2/4 #101444
3/4 #101445
4/4 #101446

WIP #104748 adds HLFIR lowering that make use of this pipeline

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work so far. This is a great start.

What is the plan for transforming do loops generated by lowering (e.g. that do not become hlfir.elemental operations and are not generated by hlfir bufferization)?

flang/include/flang/Optimizer/CMakeLists.txt Outdated Show resolved Hide resolved
@@ -344,6 +345,7 @@ inline void createHLFIRToFIRPassPipeline(
pm.addPass(hlfir::createLowerHLFIRIntrinsics());
pm.addPass(hlfir::createBufferizeHLFIR());
pm.addPass(hlfir::createConvertHLFIRtoFIR());
pm.addPass(flangomp::createLowerWorkshare());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other OpenMP passes are added in createOpenMPFIRPassPipeline, which is only called when -fopenmp is used. It would be convenient if this new pass could stay with the other OpenMP passes.

Currently those passes are run immediately after lowering. There are comments which say they have to be run immediately after lowering, but at a glance it isn't obvious why they couldn't be run here after HLFIR. @agozillon what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just seen this ping! The comment is primarily to state that the passes should be ran immediately after lowering from parse tree to IR (HLFIR/FIR/OMP), as they make a lot of changes to convert things into a more final form for the OMP dialect with respect to target. It was previously a lot more important as we had a form of outlining that ripped out target regions from their functions into seperate functions. That's no longer there, but we do still have some passes that modify the IR at this stage to a more finalized form for target, in particular OMPMapInfoFinalization which will generate some new maps for descriptor types, OMPMarkDeclareTarget which will mark functions declare target implicitly, and another that removes functions unnecessary for device. There is also a pass or will be for do concurrent which I believe outlines loops into target regions as well.

But TL;DR, there's a lot of things going on in those passes that would be preferable to keep happening immediately after lowering from the parse tree so later passes can depend on the information being in the "correct" format, whether or not that "immediate" location has changed to after this HLFIR lowering or remains where it is currently I am unsure of!

@skatrak @jsjodin may also have some feedback/input to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to keep the rest of the openmp passes as they are and have added a bool argument to control whether to run the lower-workshare pass

flang/lib/Optimizer/HLFIR/Transforms/BufferizeHLFIR.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
flang/include/flang/Optimizer/OpenMP/Passes.td Outdated Show resolved Hide resolved
flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Outdated Show resolved Hide resolved
@ivanradanov
Copy link
Contributor Author

ivanradanov commented Aug 2, 2024

Thank you a lot @tblah for taking a look and for the helpful comments.

Thank you for your work so far. This is a great start.

What is the plan for transforming do loops generated by lowering (e.g. that do not become hlfir.elemental operations and are not generated by hlfir bufferization)?

I am looking at this for the standard.

I intend to go through the various constructs that require to be separated into units of work and provide an alternative lowering for them so that they will get parallelized when we lower the workdistribute operation.

To accurately keep track of constructs that need to be parallelized for workdistribute I em debating adding a new loop_nest wrapper for that as discussed here

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from 26d0051 to 386157c Compare August 2, 2024 08:17
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from 4da93bb to decd0c5 Compare August 2, 2024 08:18
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from 386157c to 7006c75 Compare August 4, 2024 07:14
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from decd0c5 to cff85ab Compare August 4, 2024 07:14
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from fbd2f2d to fd782e9 Compare August 4, 2024 08:04
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from cff85ab to 6571aed Compare August 4, 2024 08:04
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch 2 times, most recently from e21aeb8 to 7af84b2 Compare August 4, 2024 11:24
@ivanradanov
Copy link
Contributor Author

@kiranchandramohan @tblah
I think this warrants another look if you have some time.

I have reiterated a bit and opted to have a omp loop nest wrapper op which signals to the workshare lowering which specific loops need to be parallelized (i.e. converted to wsloop { loop_nest}).

This will allow us to emit this in the frontend if it is needed and be more precise about the exact loops that need to be parallelized.

So the LowerWorksharePass that I have implemented here is tasked with parallelizing the loops nested in workshare_loop_wrapper and both the Fortran->mlir frontend and the hlfir lowering passes would be responsible for emitting the workshare_loop_wrapper ops where appropriate. For that I have started with some of the obvious lowerings in the hlfir bufferizations, but perhaps that can be done gradually and not everything needs to be covered by this PR. Let me know what you think.

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from 7af84b2 to 058fb57 Compare August 4, 2024 13:08
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from 35fb583 to a14789a Compare August 4, 2024 13:08
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from 058fb57 to f8834c9 Compare August 6, 2024 04:52
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from a14789a to f1b9e86 Compare August 6, 2024 04:53
Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for all of the updates!

So the LowerWorksharePass that I have implemented here is tasked with parallelizing the loops nested in workshare_loop_wrapper and both the Fortran->mlir frontend and the hlfir lowering passes would be responsible for emitting the workshare_loop_wrapper ops where appropriate. For that I have started with some of the obvious lowerings in the hlfir bufferizations, but perhaps that can be done gradually and not everything needs to be covered by this PR. Let me know what you think.

Doing it gradually sounds good to me. When you take this out of draft, please document in the commit message exactly what is and is not supported at this point.

flang/lib/Optimizer/OpenMP/LowerWorkshare.cpp Show resolved Hide resolved
@ivanradanov ivanradanov requested review from mjklemm and removed request for dcci September 25, 2024 06:02
@tblah
Copy link
Contributor

tblah commented Sep 25, 2024

Thanks for the updates and good job noticing the bug with unstructured control flow.

I have added a comment explaining the limitation of not allowing CFG in workshare for now and an appropriate TODO message for that.

My concern with the TODO message is that some code that previously compiled using the lowering of WORKSHARE as SINGLE will now hit this TODO. This is okay with me so long as it is fixed soon (before LLVM 20). Otherwise, could these cases continued to be lowered as SINGLE for now.

@ivanradanov
Copy link
Contributor Author

My concern with the TODO message is that some code that previously compiled using the lowering of WORKSHARE as SINGLE will now hit this TODO. This is okay with me so long as it is fixed soon (before LLVM 20). Otherwise, could these cases continued to be lowered as SINGLE for now.

I have updated it to lower to omp.single and emit a warning in CFG cases.

@mjklemm
Copy link
Contributor

mjklemm commented Oct 11, 2024

Is this PR in an acceptable state to be merged? I went through the other three and it seems that the 4 PRs all converging and seem either ready for merge or very close.

@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from 4257950 to d64c6c4 Compare October 19, 2024 16:37
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from d5fbe9c to 8afc111 Compare October 19, 2024 16:37
Change to workshare loop wrapper op

Move single op declaration

Schedule pass properly

Correctly handle nested nested loop nests to be parallelized by workshare

Leave comments for shouldUseWorkshareLowering

Use copyprivate to scatter val from omp.single

TODO still need to implement copy function
TODO transitive check for usage outside of omp.single not imiplemented yet

Transitively check for users outisde of single op

TODO need to implement copy func
TODO need to hoist allocas outside of single regions

Add tests

Hoist allocas

More tests

Emit body for copy func

Test the tmp storing logic

Clean up trivially dead ops

Only handle single-block regions for now

Fix tests for custom assembly for loop wrapper

Only run the lower workshare pass if openmp is enabled

Implement some missing functionality

Fix tests

Fix test

Iterate backwards to find all trivially dead ops

Add expalanation comment for createCopyFun

Update test
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare branch from d64c6c4 to e62341d Compare October 19, 2024 17:22
@ivanradanov ivanradanov force-pushed the users/ivanradanov/flang-workshare-elemental-lowering branch from 8afc111 to d8cfd38 Compare October 19, 2024 17:22
@ivanradanov
Copy link
Contributor Author

I have rebased this on the latest main and also marked the follow up #104748 as ready for review. This follow up PR contains code and tests which are needed to fully check this implementation as well.

I think this stack is currently in a good state to be merged.

The 1/4 #101443 2/4 #101444 3/4 #101445 are already approved and good to go, but 2/4 #101444 must be merged together with this PR because otherwise it will result in compilation failures for omp workshare.

Thus, it would be great if this PR can be reviewed as well and we can proceed with merging if it looks good.

(The build failures are only on windows and coming from the main branch and not introduced by this)

@Thirumalai-Shaktivel
Copy link
Member

All the PRs LGTM, this works for my test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants