-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
should uses of outer-scope names from parallel code be restricted? #11
Comments
If the purpose of this would be to try to do something about races, then what about races on objects accessed indirectly? |
The idea here is to reduce the chance of creating races. For example:
The use of
One disadvantage of the above is that expressions like Another disadvantage is that no parallel language that I'm aware of has this restriction. I don't know of any usage experience. It certainly deviates from both Cilk and OpenMP. |
Stack-based arrays are rarely large enough to merit parallel computation, most of the time one is dealing with allocated storage of some kind or, sure, static (global). I suggest that we leave static (global) variables as they are, and const capture automatics. They will rarely be arrays of significant size. Arrays of size 4 are interesting for graphics as point multipliers etc. but those are small enough to be const captured. (Please note that since they would be CONST they could be shared among the worker threads. This happens with Apple's Blocks). |
What if my parallel computation involves reading from a thousand-element automatic (i.e. stack-based) array and a million-element heap array while writing to a thousand-element automatic array? Is that going to be supported with this restriction? Below is an excerpt from a widely used HPC application (NWChem - associated file is ccsd_t_kernels_omp.F) where I allocate a four-dimensional automatic array, where the dimensions are usually around 30 (i.e. 6.48MB). You can see that my current Fortran implementation computes on this array inside of OpenMP regions. subroutine sd_t_d1_1(h3d,h2d,h1d,p6d,p5d,p4d,h7d,
2 triplesx,t2sub,v2sub)
IMPLICIT NONE
integer h3d,h2d,h1d,p6d,p5d,p4d,h7d
integer h3,h2,h1,p6,p5,p4,h7
double precision triplesx(h3d,h2d,h1d,p6d,p5d,p4d)
double precision t2sub(h7d,p4d,p5d,h1d)
double precision v2sub(h3d,h2d,p6d,h7d)
double precision v2tmp(h7d,h3d,h2d,p6d)
!$omp parallel do collapse(3)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
do p6=1,p6d
do h7=1,h7d
do h2=1,h2d
!dec$ vector always nontemporal
!DIR$ IVDEP
do h3=1,h3d
v2tmp(h7,h3,h2,p6) = v2sub(h3,h2,p6,h7)
enddo
enddo
enddo
enddo
!$omp parallel do collapse(OMP_COLLAPSE_LEVEL)
!$omp& default(shared) schedule(static)
!$omp& private(h1,h2,h3,p4,p5,p6,h7)
do p5=1,p5d
do p6=1,p6d
do p4=1,p4d
do h1=1,h1d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
do h2=1,h2d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
do h3=1,h3d
!!!dir$ loop count min(8)
!dec$ unroll_and_jam = 8
!dir$ simd
!DIR$ IVDEP
do h7=1,h7d
triplesx(h3,h2,h1,p6,p5,p4)=triplesx(h3,h2,h1,p6,p5,p4)
1 -t2sub(h7,p4,p5,h1)*v2tmp(h7,h3,h2,p6)
enddo
enddo
enddo
enddo
enddo
enddo
enddo
!$omp end parallel do
return
end This is but one type of computation where automatic arrays are large enough to merit parallel computation. And yes, I really do want to be able to write the same code in pure C some day. I have already started converting these functions to C in ccsd_t_kernels_f2c.c. I'd like to not have to use OpenMP if possible. |
My FORTRAN education stopped at WATFOR in the early 70s, and my reading knowledge of OPENMP is even thinner, so... So a couple small points. Stack arrays of this size are not common in my non-HPC experience - one has to use pthread_create with options to spawn threads with huge stacks, presumably that is what the CPLEX runtime would do. The major point you are trying to make, I think, is that there will be cases where large automatic arrays prove useful. So in the "make simple things simple, difficult things possible" design sense, wouldn't forcing the array to decay to a pointer explicitly work here? If its common enough then it might be worth extra syntax My view, obviously, is that safety should be the default (const capture) and riskier behaviors such as concurrent initialization of v2tmp be asserted parallel correct by the programmer by use of the extra syntax. The extra syntax isn't necessary since the programmer could hand roll the address capture, but its less eloquent for sure. |
I will try to make the C implementation match the Fortran one. Then I can try to convert the OpenMP to CPLEX to see how things look. The code right now doesn't need tasking but it would if the data was block-sparse. Currently, NWChem expresses block-sparsity at a higher level in the code than what you see in my example.
The point is to change the layout in physical memory so that subsequent accesses are fast. Alignment and contiguous access are important for vectorization.
It depends. Will compilers be able to preserve the information about the multidimensional array structure in the auto-vectorization passes with decay-to-pointer?
I agree with this philosophy. Let me think about how to implement what I want a bit more in CPLEX and see how it looks. |
If the rationale is to make races less likely, it would make sense to treat globals differently only if we knew that uses of them were less likely to be involved in races than uses of autos. Do we know that? If so, how? If not, what is the rationale for treating them differently? |
(jeff) So v2tmp is purely a performance "hack" (in a good sense). I don't doubt that for massive data sets that this can be a critical win, and can be supported in my proposal by manually degrading the array to a pointer, or having syntax for that. (clark) The race question is important - the current proposal would allow non-atomic automatic variables to be shared and mutated across iterations and back to the originating stack frame, and might seem to work when there isn't much parallelism and might mostly work in fact. I am concerned about non-mutating uses as well. By const capturing them we put a copy near other captures and possibly improve locality, and put a huge stake in the ground when they try to mutate and get a compiler error. It helps teach issues. Globals in a more general programming environment than HPC are already known to have racing issues and are rarely (in my non-HPC experience) used in library or other code that is multi-threaded. The HPC use cases are different in that globals, especially from a FORTRAN background, used to hold the primary data. Their racing issues are very well known since, well, CPLEX requires that the parallel code not induce a race, and copying seems to be counter-intuitive. It turns out that Apple's Blocks don't copy globals, so a similar conceptual treatment leads to fewer concept to learn in parallel and, potentially, concurrent code with closures. A weird issue that came up was that function calls were obviously pointer references to global code and should they be const copied and that led in part to the decision to not capture globals. And in the years since it shipped there were no complaints about not capturing a Global, but, we didn't have _Atomic. So to some degree I say that yes, the HPC experience of using non-atomic globals safely is relevant field experience. |
Lacking any explanation of the nature of your concern, I don't know what to do with this statement. |
A few comments and observation:
|
No description provided.
The text was updated successfully, but these errors were encountered: