Abstraction of Cuda and OpenCL on constant vector, probably const_array<T> #182

byzhang · 2015-09-26T06:19:34Z

as discussed in the readme.md

ddemidov · 2015-11-08T13:27:03Z

I've been thinking about this for some time and I am not sure its possible to provide a useful abstraction over constant memory that would work both for OpenCL and CUDA. The way constant memory works is too different in these platforms.

OpenCL
Standard global memory buffer is decorated with __constant keyword when passed to kernel. That enables use of constant cache with that buffer.

CUDA
A __constant__ array is created at program scope (outside of any kernel), and initialized with a call to cudaMemcpyToSymbol API call. Any kernel in that program now may use the array; no need to pass it as kernel parameter.

Now, vexcl creates single-kernel program for each unique vector expression it encounters in code. I see two options for implementing constant_array<T,N> for CUDA backend:

There are 'callbacks' that each new type can fill in to help vexcl generate kernel source and pass the actual arguments to kernel. I could add another callback that would be called right after the kernel is compiled, but before its first use (somewhere around here). constant_array<T,N> would use the callback to copy its contents to constant memory with cudaMemcpyToSymbol.
I could call cudaMemcpyToSymbol each time a kernel using constant_array<T,N> is launched.

The first approach has the drawback that users won't be able to change contents of constant_array<T,N> (since its contents is copied to GPU once, when kernel is compiled). Moreover, vexcl has no means to differentiate between expressions that only differ by contents of their terminals, so the following would not work either:

constant_array<double, 32> A(...);
constant_array<double, 32> B(...);

x = func(A, ...); // Expression uses A, ok.
x = func(B, ...); // Expression uses B, but has same type as above.
                  // A was already copied to the kernel and will be used here as well.

The second approach has noticeable overhead of doing memory transfer (with cudaMemcpyToSymbol) each time a kernel is launched. Since primary use for constant memory use is speed optimization, this seems to be counterproductive.

ddemidov · 2015-11-08T13:31:56Z

It should be still possible to use custom kernel with constant memory in CUDA.

ddemidov added the enhancement label Sep 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstraction of Cuda and OpenCL on constant vector, probably const_array<T> #182

Abstraction of Cuda and OpenCL on constant vector, probably const_array<T> #182

byzhang commented Sep 26, 2015

ddemidov commented Nov 8, 2015

ddemidov commented Nov 8, 2015

Abstraction of Cuda and OpenCL on constant vector, probably const_array<T> #182

Abstraction of Cuda and OpenCL on constant vector, probably const_array<T> #182

Comments

byzhang commented Sep 26, 2015

ddemidov commented Nov 8, 2015

ddemidov commented Nov 8, 2015