-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SplitPytatoArrayContext that attempts a trivial parallelization strategy #216
base: main
Are you sure you want to change the base?
Conversation
kaushikcfd
commented
Jan 17, 2023
•
edited
Loading
edited
- CI should pass once Fix codegen for temporary variables declaration loopy#738 is in.
-
⚠️ Concurrency across reductions is not targeted i.e. results in poor performance for computing error norms, etc. (Maybe we should fix this before merging?)- In the meshmode version, we desperately froze such reductions eagerly
- Update: Added a transformation specifically for "reduce-to-scalar" operations to parallelize the reduction operation using loopy transformations.
- We need a better name than SplitPytatoArrayContext.
615e2b5
to
5249a38
Compare
Making this a draft until the CI is fixed. |
adf8d2b
to
b736f3a
Compare
11583d7
to
fe7517e
Compare
be0878b
to
ac39271
Compare
41ecb8d
to
bb03c86
Compare
bb03c86
to
b8c991a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. LGTM other than these (fairly minor) issues.
@@ -714,7 +716,7 @@ def test_array_equal(actx_factory): | |||
def test_array_context_einsum_array_manipulation(actx_factory, spec): | |||
actx = actx_factory() | |||
|
|||
mat = actx.from_numpy(np.random.randn(10, 10)) | |||
mat = actx.from_numpy(np.random.randn(16, 16)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why these have to be a nice power of two now?
import pytato | ||
|
||
|
||
class SplitPytatoPyOpenCLArrayContext(PytatoPyOpenCLArrayContext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this thing is named "split". Maybe "generic parallelizing"/"basic parallelizing"?
Copyright (C) 2023 Andreas Kloeckner | ||
Copyright (C) 2022 Matthias Diener | ||
Copyright (C) 2022 Matt Smith |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these (other than @kaushikcfd) should be U of I BOT. None of us hold copyright to our work.
@@ -0,0 +1,143 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to docs somewhere?
We deliberately avoid using :class:`pytato.transform.CombineMapper` since | ||
the mapper's caching structure would still lead to recomputing | ||
the union of sets for the results of a revisited node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this comment. Could you explain? And: shouldn't we fix this in CombineMapper
(if we can)? (or at least file an issue in pytato?) (also above)
|
||
def _split_reduce_to_scalar_across_work_items( | ||
kernel: lp.LoopKernel, | ||
callables: Mapping[str, InKernelCallable], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit weird that callables
is needed here/by precompute_for_single_kernel
.
device: "pyopencl.Device", | ||
) -> lp.LoopKernel: | ||
|
||
assert len({kernel.id_to_insn[insn_id].reduction_inames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could elevate reduction inames to a property of the _LoopNest
?
|
||
# collect loop nests of instructions that assign to scalars in the array | ||
# program. | ||
insn_id_to_loop_nest: Mapping[str, _LoopNest] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If insn_id_to_loop_nest
isn't used, why not go straight to all_loop_nests
?
This routine **assumes** that the entrypoint in *t_unit* global | ||
barriers inserted as per :func:`_get_call_kernel_insn_ids`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar? I'm not sure I understand.
# a mapping from shape to the available base storages from temp variables | ||
# that were dead. | ||
shape_to_available_base_storage: Dict[int, Set[str]] = defaultdict(set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# a mapping from shape to the available base storages from temp variables | |
# that were dead. | |
shape_to_available_base_storage: Dict[int, Set[str]] = defaultdict(set) | |
# a mapping from size in bytes to the available base storages from temp variables | |
# that were dead. | |
nbytes_to_available_base_storage: Dict[int, Set[str]] = defaultdict(set) |
Otherwise Intel OpenCL gets its integer arithmetic wrong.
b8c991a
to
b4488ba
Compare