You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been thinking about this for some time and I am not sure its possible to provide a useful abstraction over constant memory that would work both for OpenCL and CUDA. The way constant memory works is too different in these platforms.
OpenCL
Standard global memory buffer is decorated with __constant keyword when passed to kernel. That enables use of constant cache with that buffer.
CUDA
A __constant__ array is created at program scope (outside of any kernel), and initialized with a call to cudaMemcpyToSymbol API call. Any kernel in that program now may use the array; no need to pass it as kernel parameter.
Now, vexcl creates single-kernel program for each unique vector expression it encounters in code. I see two options for implementing constant_array<T,N> for CUDA backend:
There are 'callbacks' that each new type can fill in to help vexcl generate kernel source and pass the actual arguments to kernel. I could add another callback that would be called right after the kernel is compiled, but before its first use (somewhere around here). constant_array<T,N> would use the callback to copy its contents to constant memory with cudaMemcpyToSymbol.
I could call cudaMemcpyToSymbol each time a kernel using constant_array<T,N> is launched.
The first approach has the drawback that users won't be able to change contents of constant_array<T,N> (since its contents is copied to GPU once, when kernel is compiled). Moreover, vexcl has no means to differentiate between expressions that only differ by contents of their terminals, so the following would not work either:
constant_array<double, 32> A(...);
constant_array<double, 32> B(...);
x = func(A, ...); // Expression uses A, ok.
x = func(B, ...); // Expression uses B, but has same type as above.// A was already copied to the kernel and will be used here as well.
The second approach has noticeable overhead of doing memory transfer (with cudaMemcpyToSymbol) each time a kernel is launched. Since primary use for constant memory use is speed optimization, this seems to be counterproductive.
as discussed in the readme.md
The text was updated successfully, but these errors were encountered: