Skip to content
kynan edited this page Feb 23, 2013 · 1 revision

This is an attempt to standardise a backend API for PyOP2. The design objectives are:

  1. To increase the level of standardisation between backends.
  2. To separate code generation from execution. This allows for the possibility of using the backend code generator in a static compilation mode. It also facilitates lazy evaluation approaches.

Currently, this structure follows the OpenCL backend more closely than the sequential backend, since I think the OpenCL backend is closer to allowing for the clean separation required.

This proposal is based on the PyOP2 API. The user documentation section is particularly relevant as it documents the public API of the library.

ParLoop

backend.ParLoop(kernel, it_space, *args)

A ParLoop object records the kernel, iteration space and arguments of a kernel. We should have a base ParLoop class and then each backend should subclass ParLoop to implement its own generation routines. At this stage, only the opencl backend actually implements ParLoop like this.

The backend ParLoop must implement the following methods:

generate(self)

Perform code generation using the kernel, iteration space and argument list. The generated code is returned as a string. Any caching of generated code is handled inside this method, so it is an internal implementation detail whether generation occurs in any given case or a cached version is returned.

Do we want an options dictionary to pass generation options in at this layer?

compute(self)

execute the ParLoop. Code generation is performed inside the compute call.

Generated kernel signature

Currently, the sequential and opencl backends do slightly different things. The sequential backend does not attempt to find unique dats (so it passes pointers to the same dat map pair multiple times). For each direct dat or global in the par_loop call, the kernel just gets a pointer to the data. For each indirect dat, the kernel gets a pointer to the data and a pointer to the map. For each matrix argument the kernel gets a pointer to the matrix object, and for each (rowmap, colmap) pair a pointer to each of these. Finally, the generated kernel gets pointers to the data of each known op2 Const object. This is to ease code generation caching (since constant values can change between par_loop calls). Note that Const objects are passed sorted lexicographically (so 'a_const' appears before 'b_const').

So, for example, a generated kernel signature may look like.

void generated_kernel(T1 *direct_dat1, T2 *indirect_dat2, int *map2, T2 *indirect_dat3, int *map3,
                      T4 *mat4, int *rmap4, int *cmap4, T5 *global5, T6 *global6,
                      T7 *a_const, T8 *b_const, T9 *c_const);

The opencl backend does more work. It only passes unique dat map pairs to the generated kernel (the generated code knows about this). The calling convention of the opencl generated kernel is. For each unique dat, a pointer to the data. For each unique read-only Global, a pointer to the data buffer. For each unique reduction Global, a pointer to the allocated reduction buffer. For each Constant, a pointer to the data buffer. If the par_loop is direct the final argument is the set size. If the par_loop is indirect we have a bunch more arguments. For each unique indirect arg (plan->ninds) a pointer to plan->ind_map[offset]. For each indirect arg a pointer to plan->loc_map[offset]. For each unique matrix argument, a pointer to the data array and the CSR row and column pointers, followed by pointers to each of the rowmap, colmap maps for the matrix. Now we pass the further plan data pointers, in order plan->ind_sizes, plan->ind_offs, plan->blkmap, plan->offset, plan->nelems, plan->nthrcol, plan->thrcol. The final argument to the kernel is the block_offset. The generated kernel is executed plan->ncolors times with the block_offset changing on each execution but all other arguments constant.

##Minimal static data types.

The implementations of the OP2 objects in pyop2.base do none enforce the existence of set sizes, data and other runtime-only. Runtime-only methods are then added in pyop2.runtime_base and in the specific backend modules. This means that a static generator can operate by using the types directly from pyop2.base in combination with ParLoop from the backend for which code is to be generated. A possible API to enable this to be used would to modify op2.init:

op2.init(backend="opencl", static=True)

This would actually load the base backend, which would not implement ParLoop, and would then import ParLoop from the opencl module. For a certain amount of additional safety, it could then overwrite the compute method on ParLoop with an error message to prevent accidents.

##Runtime implementation.

For a runtime implementation, the wrapping of ParLoop in a par_loop call might be as simple as:

def par_loop(kernel, it_space, *args):
    ParLoop(kernel, it_space, *args).compute()

Instantiating the ParLoop object has the advantage that it facilitates lazy evaluation, since we don't actually have to make the compute() method call at this point: the backend implementation is free to do something far smarter.