-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[flang] Lower omp.workshare to other omp constructs #101446
base: users/ivanradanov/flang-workshare-elemental-lowering
Are you sure you want to change the base?
[flang] Lower omp.workshare to other omp constructs #101446
Conversation
c2cbd77
to
4da93bb
Compare
9a51b40
to
26d0051
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work so far. This is a great start.
What is the plan for transforming do loops generated by lowering (e.g. that do not become hlfir.elemental operations and are not generated by hlfir bufferization)?
@@ -344,6 +345,7 @@ inline void createHLFIRToFIRPassPipeline( | |||
pm.addPass(hlfir::createLowerHLFIRIntrinsics()); | |||
pm.addPass(hlfir::createBufferizeHLFIR()); | |||
pm.addPass(hlfir::createConvertHLFIRtoFIR()); | |||
pm.addPass(flangomp::createLowerWorkshare()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other OpenMP passes are added in createOpenMPFIRPassPipeline
, which is only called when -fopenmp
is used. It would be convenient if this new pass could stay with the other OpenMP passes.
Currently those passes are run immediately after lowering. There are comments which say they have to be run immediately after lowering, but at a glance it isn't obvious why they couldn't be run here after HLFIR. @agozillon what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just seen this ping! The comment is primarily to state that the passes should be ran immediately after lowering from parse tree to IR (HLFIR/FIR/OMP), as they make a lot of changes to convert things into a more final form for the OMP dialect with respect to target. It was previously a lot more important as we had a form of outlining that ripped out target regions from their functions into seperate functions. That's no longer there, but we do still have some passes that modify the IR at this stage to a more finalized form for target, in particular OMPMapInfoFinalization which will generate some new maps for descriptor types, OMPMarkDeclareTarget which will mark functions declare target implicitly, and another that removes functions unnecessary for device. There is also a pass or will be for do concurrent which I believe outlines loops into target regions as well.
But TL;DR, there's a lot of things going on in those passes that would be preferable to keep happening immediately after lowering from the parse tree so later passes can depend on the information being in the "correct" format, whether or not that "immediate" location has changed to after this HLFIR lowering or remains where it is currently I am unsure of!
@skatrak @jsjodin may also have some feedback/input to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to keep the rest of the openmp passes as they are and have added a bool argument to control whether to run the lower-workshare pass
Thank you a lot @tblah for taking a look and for the helpful comments.
I am looking at this for the standard. I intend to go through the various constructs that require to be separated into units of work and provide an alternative lowering for them so that they will get parallelized when we lower the workdistribute operation. To accurately keep track of constructs that need to be parallelized for workdistribute I em debating adding a new loop_nest wrapper for that as discussed here |
26d0051
to
386157c
Compare
4da93bb
to
decd0c5
Compare
386157c
to
7006c75
Compare
decd0c5
to
cff85ab
Compare
fbd2f2d
to
fd782e9
Compare
cff85ab
to
6571aed
Compare
e21aeb8
to
7af84b2
Compare
@kiranchandramohan @tblah I have reiterated a bit and opted to have a omp loop nest wrapper op which signals to the workshare lowering which specific loops need to be parallelized (i.e. converted to wsloop { loop_nest}). This will allow us to emit this in the frontend if it is needed and be more precise about the exact loops that need to be parallelized. So the LowerWorksharePass that I have implemented here is tasked with parallelizing the loops nested in workshare_loop_wrapper and both the Fortran->mlir frontend and the hlfir lowering passes would be responsible for emitting the workshare_loop_wrapper ops where appropriate. For that I have started with some of the obvious lowerings in the hlfir bufferizations, but perhaps that can be done gradually and not everything needs to be covered by this PR. Let me know what you think. |
7af84b2
to
058fb57
Compare
35fb583
to
a14789a
Compare
058fb57
to
f8834c9
Compare
a14789a
to
f1b9e86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all of the updates!
So the LowerWorksharePass that I have implemented here is tasked with parallelizing the loops nested in workshare_loop_wrapper and both the Fortran->mlir frontend and the hlfir lowering passes would be responsible for emitting the workshare_loop_wrapper ops where appropriate. For that I have started with some of the obvious lowerings in the hlfir bufferizations, but perhaps that can be done gradually and not everything needs to be covered by this PR. Let me know what you think.
Doing it gradually sounds good to me. When you take this out of draft, please document in the commit message exactly what is and is not supported at this point.
f8834c9
to
0cb4216
Compare
f1b9e86
to
21c5c42
Compare
0cb4216
to
0858469
Compare
Thanks for the updates and good job noticing the bug with unstructured control flow.
My concern with the TODO message is that some code that previously compiled using the lowering of WORKSHARE as SINGLE will now hit this TODO. This is okay with me so long as it is fixed soon (before LLVM 20). Otherwise, could these cases continued to be lowered as SINGLE for now. |
I have updated it to lower to omp.single and emit a warning in CFG cases. |
Is this PR in an acceptable state to be merged? I went through the other three and it seems that the 4 PRs all converging and seem either ready for merge or very close. |
4257950
to
d64c6c4
Compare
d5fbe9c
to
8afc111
Compare
Change to workshare loop wrapper op Move single op declaration Schedule pass properly Correctly handle nested nested loop nests to be parallelized by workshare Leave comments for shouldUseWorkshareLowering Use copyprivate to scatter val from omp.single TODO still need to implement copy function TODO transitive check for usage outside of omp.single not imiplemented yet Transitively check for users outisde of single op TODO need to implement copy func TODO need to hoist allocas outside of single regions Add tests Hoist allocas More tests Emit body for copy func Test the tmp storing logic Clean up trivially dead ops Only handle single-block regions for now Fix tests for custom assembly for loop wrapper Only run the lower workshare pass if openmp is enabled Implement some missing functionality Fix tests Fix test Iterate backwards to find all trivially dead ops Add expalanation comment for createCopyFun Update test
d64c6c4
to
e62341d
Compare
8afc111
to
d8cfd38
Compare
I have rebased this on the latest main and also marked the follow up #104748 as ready for review. This follow up PR contains code and tests which are needed to fully check this implementation as well. I think this stack is currently in a good state to be merged. The 1/4 #101443 2/4 #101444 3/4 #101445 are already approved and good to go, but 2/4 #101444 must be merged together with this PR because otherwise it will result in compilation failures for Thus, it would be great if this PR can be reviewed as well and we can proceed with merging if it looks good. (The build failures are only on windows and coming from the main branch and not introduced by this) |
All the PRs LGTM, this works for my test cases. |
4/4
There are two points which need some discussion in this PR:
We need to make a value computed in a omp.single accessible in all threads of the omp.parallel region. This is achieved by allocating temporary memory outside the omp.parallel and atoring that in the omp.single and then reloading it from all threads. However, from reading the standard I dont think we are guaranteed that the workshare is nested in the omp.parallel so there could be a omp.parallel { func.call @contains_workshare }, then we would not be able to access the omp.parallel. So I think adding support in the runtime to be able to yield a value from a omp.single could be the fix to this.
For the temporary allocations above not all types are supported by fir.alloca, so I need to use llvm.alloca and unrealized_cast to be able to allocate a temporary for a fir.ref type. This too can be fixed by introducing yielding from omp.single
1/4 #101443
2/4 #101444
3/4 #101445
4/4 #101446
WIP #104748 adds HLFIR lowering that make use of this pipeline