Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support #574

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

GPU Support #574

wants to merge 10 commits into from

Conversation

kaushikcfd
Copy link
Contributor

@kaushikcfd kaushikcfd commented Feb 11, 2020

Draft because:

  • Clean commit history

@kaushikcfd
Copy link
Contributor Author

cc @inducer

pyop2/base.py Outdated
"""
AVAILABLE_ON_HOST_ONLY = 1
AVAILABLE_ON_DEVICE_ONLY = 2
AVAILABLE_ON_BOTH = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe IntFlag instead of IntEnum? (just something to perhaps consider...)

pyop2/backends/__init__.py Outdated Show resolved Hide resolved
pyop2/backends/cuda.py Outdated Show resolved Hide resolved
pyop2/backends/cuda.py Outdated Show resolved Hide resolved
pyop2/backends/cuda.py Outdated Show resolved Hide resolved
pyop2/backends/cuda.py Show resolved Hide resolved
pyop2/compilation.py Outdated Show resolved Hide resolved
pyop2/compilation.py Outdated Show resolved Hide resolved
pyop2/configuration.py Outdated Show resolved Hide resolved
pyop2/offload_utils.py Show resolved Hide resolved
pyop2/transforms/gpu_utils.py Outdated Show resolved Hide resolved
@connorjward connorjward changed the title GPU Suppport GPU Support Jul 12, 2022
pyop2/backends/cpu.py Outdated Show resolved Hide resolved
pyop2/backends/cpu.py Outdated Show resolved Hide resolved
Copy link
Member

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't manage everything, but a bunch of comments/queries.

pyop2/backends/cuda.py Show resolved Hide resolved
pyop2/types/dat.py Show resolved Hide resolved
pyop2/types/dataset.py Show resolved Hide resolved
pyop2/transforms/snpt.py Show resolved Hide resolved
return arg


def snpt_transform(kernel, block_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely impenetrable name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to split_n_across_workgroups, better?

if self.can_be_represented_as_petscvec():
# report to petsc that we are altering data on the CPU
with self.vec as petscvec:
petscvec.array_w
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, comments about vec state management apply here, and mutatis mutandis, the rest of the PR.

grid = tuple(int(evaluate(glens[i], parameters)) if i < len(glens) else 1
for i in range(2))
block = tuple(int(evaluate(llens[i], parameters)) if i < len(llens) else 1
for i in range(3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cached data is generated by a computer. Therefore there should be no need to parse and evaluate it, because the on-disk representation can just be the in-memory representation.

Copy link
Contributor Author

@kaushikcfd kaushikcfd Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cached data is generated by a computer

I preferred a human-readable format to keep the representation valid across pyop2-versions which might not be the case with pickling. (Happy to serialize in a different format if there's a better alternative)

no need to parse and evaluate it,

I don't think we can avoid the evaluates because the grid sizes are typically symbolic expressions of the form: (end - start) // 32 + 1. Therefore the grid sizes are evaluated at runtime during a GlobalKernel.__call__ invocation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pickle version incompatibilities are fine (they would appear when you upgrade python and we don't guarantee very much about the portability of on disk cached data between firedrake-update calls, let alone python version upgrades).

What you're attempting to do is cache a function that computes these things. So in the code_to_compile case, just write that function, and pickle it.

Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pickle version incompatibilities are fin

Cool, will change. (Thanks!)

Although is this a consequence of the on-disk kernel/compiled code not remembering the loopy representation?

Yes, the compiled code stores only the plain CUDA/OpenCL device kernel. Loopy's cache on disk mechanism works by storing both the device kernel and the host code that would call end up calling this device kernel. But adapting PyOP2 to call loopy kernels would be a bit disruptive and I think loopy's cache dir being $XDG_CACHE_DIR has run into some problems in the past.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just write that function,

Planning to simply generate a python function corresponding to the expression: inducer/pymbolic#107

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education, how come you chose to not go with the generated invokers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PyOpenCL we could borrow the invoker, but for PyCUDA/C-target we don't have an invoker in loopy (yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Also, for the pyopencl invoker, a minor concern is that the disk-caching semantics don't play well with PyOP2's disk-caching semantics.
  • I feel adding an invoker for the PyCUDA target shouldn't be too difficult. Taking a look at it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is invoker generation for the C target. I'd expect that putting together CUDA invokers wouldn't be a massive effort based on a copy-paste job of the CL ones. Do you think that might be the right approach?

Copy link
Contributor Author

@kaushikcfd kaushikcfd Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on integrating the PyCUDA/PyOpenCL invokers. For the C-target, reorganizing pyop2/compilation.py is a bigger undertaking and also out of the scope for this PR.

pyop2/backends/cuda.py Outdated Show resolved Hide resolved
.callables_table))
f.write("("+",".join(str(glen) for glen in glens)+",)")
f.write("\n")
f.write("("+",".join(str(llen) for llen in llens)+",)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is data to be consumed by a computer, so don't make it human-readable and require parsing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recommend pickling?

Comment on lines 481 to 504
fpath = self.extra_args_cache_file_path
extra_args_np = [arg
for _, arg in natsorted(numpy.load(fpath).items(),
key=lambda x: x[0])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tremendously fragile. Please think of a better way of saving this data in a way that doesn't rely on dubiously-stable properties of numpy.savez/load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. While np.savez-ing the arrays also stored a string-array that tells us the keys that were used for storing these arrays. See https://github.com/OP2/PyOP2/compare/9b7eb97878f2b6609d64affe896c5d4d901a2934..0e58f08d3ea1487e01c45df2270fa9e5790a845d.

@connorjward
Copy link
Collaborator

@kaushikcfd just to make you aware of a couple things:

  • We are currently in the process of ingesting PyOP2 (and TSFC) into the Firedrake repository, so this repository will shortly be archived.
  • We are hoping to soon drop PyOP2 entirely in favour of pyop3. We then, assuming funding, want to add GPU support there. We fully intend to use a lot of what you have done here but will have to adapt it a fair amount.

@kaushikcfd
Copy link
Contributor Author

Hi Connor!
Sounds good.

We fully intend to use a lot of what you have done here but will have to adapt it a fair amount.

Let me know if anything in this patch isn't clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants