Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of the OpenCL integrators and drivers by CPS. #4

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

stonecp
Copy link

@stonecp stonecp commented Jun 1, 2017

Kyle and Nick,
Here's my initial commit of the OpenCL code used for the WIP paper on SIMD parallelism for kinetics.

@skyreflectedinmirrors
Copy link
Collaborator

@stonecp Thanks Chris!

I will work on getting this integrated into our framework

@kyleniemeyer
Copy link
Member

@arghdos ping on this?

@skyreflectedinmirrors
Copy link
Collaborator

@kyleniemeyer @stonecp

Ok, I see a few things we'll want to resolve before incorporating this:

  1. This package is really intended to be for integrators for generic ODEs. Accordingly, the chemical kinetics library in cklib.c/cklib.h/ck_driver.cl doesn't particularly make sense here. What I've done in the past is essentially have pyJac populate the necessary files for the RHS/Jacobian functions in the mechanism/ folder for accelerInt, these are then compiled and linked into a final executable against the ODE integration drivers which expect RHS/Jac functions of whatever form.

So my question here, is do we want to want to spin off this functionality into it's own pyJac like package? Additionally, how exactly are the binary mechanism files created? I don't see that code in here.

  1. We have different signatures for all the RHS / Jacobian functions, we'll probably want to come up with a unified expected API? I suppose it might have to be different based on CUDA vs OpenCL, as OpenCL doesn't support bundling together dynamically sized __global arrays into a struct and passing it to a kernel (as I do with CUDA for convenience, and to get around the maximum args limitation)

@stonecp
Copy link
Author

stonecp commented Aug 30, 2017

I could switch over to using pyJac for the RHS function and that would also give me the Jacobian. I can stop using my cklib (a chemkin-like clone) in this case.

The objects in OpenCL are JIT compiled so I just push in the source file directly to the OCL run-time library and out pops the objects. I could take the pyJac-generated source and load that at run-time. Does the RHS from pyJac populate an array (e.g., wdot[]) or are all values scalars as in the past? I'll have to review the RHS API.

For the SIMD stuff, I may have to go upstream to change the datatype that's written by pyJac ... for example, change from double to double8. Again, I'll have to review pyJac output. If the datatype in the generated source is C++ templated, this will be easy ... but that won't work with OCL sadly.... yet another reason to stop using that.

@skyreflectedinmirrors
Copy link
Collaborator

skyreflectedinmirrors commented Aug 30, 2017

The objects in OpenCL are JIT compiled so I just push in the source file directly to the OCL run-time library and out pops the objects. I could take the pyJac-generated source and load that at run-time. Does the RHS from pyJac populate an array (e.g., wdot[]) or are all values scalars as in the past? I'll have to review the RHS API.

So the C-version of the RHS operates on simple arrays:

void dydt (const double t, const double pres, const double * __restrict__ y, double * __restrict__ dy)

The CUDA-version takes additional an additional struct argument mechanism_memory:

__device__ void dydt (const double t, const double pres, const double * __restrict\__ y, double * __restrict__ dy, const mechanism_memory * __restrict__ d_mem)

which contains pointers to preallocated global device work-arrays (e.g. concentrations, fwd/reverse rates, etc.). I used this instead of local (in the CUDA sense) memory arrays, because I've had problems before with exceeding the maximum local (CUDA) memory allowance per thread (especially for Jacobian arrays). It's generally simpler to use global device arrays and pass them in, however...

For OpenCL this would require a workaround, but it's potentially possible. One option is to simply declare a maximum size and only force the JIT recompile if it's exceeded. Again my current OpenCL work for the new pyJac avoids this by having the arrays be dynamically sized (of length divisible by the workitem size), which requires no recompiling (ideal for say a CFD situation where the number of problems may be changing with every iteration)

For the SIMD stuff, I may have to go upstream to change the datatype that's written by pyJac ... for example, change from double to double8. Again, I'll have to review pyJac output. If the datatype in the generated source is C++ templated, this will be easy ... but that won't work with OCL sadly.... yet another reason to stop using that.

It's all C-based, and it should be relatively easy (if rather somewhat tedious) to add a new OpenCL language to the current version of pyJac that will simply write all "double"s as "double8", etc.. The new version of pyJac is generated using loopy and I can pretty easily add support for vector types (I've currently just used regular arrays and let OpenCL's implicit vectorizer figure it out for simplicity, but it would be good to see if there was a significant performance benefit by using the vector types explicitly).

One more issue, that currently isn't well handled in accelerInt... user data. Currently the RHS function expects a user parameter (here, the pressure), but this clearly should be revised towards some sort of CVODEs-like user_data parameter that gets passed to the RHS / Jac functions.

@kyleniemeyer
Copy link
Member

@stonecp @arghdos where did we leave this?

@skyreflectedinmirrors
Copy link
Collaborator

skyreflectedinmirrors commented Sep 10, 2018

@kyleniemeyer I'm beginning the process of coupling of pyJac-v2 to accelerInt this week (as soon as all my unit tests pass).

RE: the memory structure & user-data, I've changed pyJac-v2 to adopt a model closer to what @stonecp used in this PR; essentially there is a work-buffer from which all the internally used buffers are extracted via pointer arithmetic, e.g.:

double8* __restrict__ P_arr = rwk + 1 * get_num_groups();

This has the added benefit of de-duplicating a lot of internal memory storage and fits well with the copy-in/out driver that @stonecp created to power the initial condition loop. pyJac-v2 now supports explicit-SIMD types for wide-vectorization as well (though currently only with my own branch of loopy, as I still need to get that PR merged).

So, long story short the bulk of the issues I raised above have been dealt with, and V2 should be fairly easy to couple to the new solver.

One thing I would still like feedback on, what should we be expecting for user_data? For reference, CVODEs expects a user RHS function like:

typedef int (*CVRhsFn)(realtype t, N Vector y, N Vector ydot, void *user data);

and Jacobian of:

typedef (*CVDlsJacFn)(realtype t, N Vector y, N Vector fy, Jac, void *user data, N Vector tmp1, N Vector tmp2, N Vector tmp3);

Currently V2 uses functions like:

void jacobian(double const *restrict P_arr, double const *restrict phi, double *restrict jac, double *restrict rwk)

void species_rates(double const *restrict P_arr, double const *restrict phi, double *restrict dphi, double *restrict rwk)

  • note: species rates is a misnomer, and I'm planning on changing it to 'source_rates' or the like when I start fixing all the Flake errors currently in V2 for the first "true" release
  • Additionally: the P_arr is replaced with V_arr for constant-volume problems

Obviously this works for us, but I think it would be worthwhile (with an eye on our future accelerInt paper) to come up with a more generalized calling interface, I was thinking something like:

void jacobian(double const *__restrict__ phi, double *__restrict__ jac, double *__restrict__ rwk, void const *__restrict__ user_data)

But this presents a few challenges in a vectorized environment, namely:

  • Do all platforms (CUDA, OpenCL) support void* passing of generic data (e.g., structs?)

I believe the answer to this is yes, even if it's annoying to copy over arrays of structs etc.

  • Do we want to include the 'time' variable in the source-rate / Jacobian signatures?

My opinion is yes for compatibility, even if our application has no use for it.

  • How do we handle the issue of per-thread (in CUDA terminology) user-data in a wide-vectorization?

To see what I mean, take our case of the pressure array, which contains a different pressure for each initial condition. However, in the 'driver' kernel model, the RHS function is expecting to be called on a local (in the sense of 'I copied this data from the global state vector of all initial conditions to my own private working buffer'), however indexing a void pointer is tricky, hence a call like:

species_rates(&P_arr[initial_conditon], phi_local, dphi_local, rwk)

(where initial_condition is the index of the IC we're currently solving) won't work if the type of P_arr is void* (as it is the 'user data')! Additionally, I imagine there are certain problems where the user-data doesn't vary per initial condition (e.g., if the pressure was constant over all initial conditions we were solving), which I suppose we might allow via a user option.

So there are a few solutions here I'd like both your feedback on:

  1. We simply pass in the initial condition index / indices into the RHS / Jacobian vector such that it can properly unpack the user-data, e.g. something like this ugly pointer math:
double8* P = (double8*)(&((double*)user_data)[initial_condition]);

The downside here is that the user has to correctly unpack a somewhat tricky pointer arithmetic, and we have to expose the initial_condition index to the RHS and Jacobian (i.e., place inside of pyJac or write our own wrapper) for that sole reason.

  1. We force the user to set-up the data-type at compilation time, i.e., they must define a data-type user_data as (e.g.,):
struct user_data
{
    double myCoolParameter;
    int and_some_offset;
}; 

or #define user_data double (for a single parameter) that can then be included into the solver itself such that we have type-information.

The plus-side here is that we skip the concerns above, at the downside of all this having to happen at compile-time, making it difficult to link to run-time defined objects (@kyleniemeyer this might be a bigger issue when we get around to hooking up accelerInt to the scipy stack).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants